Relationship endpoint integrity

Test: Relationship tree

Avg. Score
94.3%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00333.2s100%
2Gemini 3.1 Flash Lite100.0%$0.00313.3s100%
3LFM2 24B100.0%$0.00047.8s100%
4Gemini 3.1 Flash Lite (Preview)100.0%$0.00463.2s100%
5Gemini 2.5 Flash100.0%$0.00756.1s100%
6Gemini 3 Flash (Preview)100.0%$0.00877.1s100%
7Claude 3 Haiku100.0%$0.005611.6s100%
8Grok 4.3100.0%$0.00947.8s100%
9DeepSeek V4 Pro100.0%$0.008317.3s100%
10Mistral Large 3100.0%$0.009117.5s100%
11Gemma 4 26B100.0%$0.001827.2s100%
12Grok 4.3 (Reasoning)100.0%$0.01313.9s100%
13Grok 4.20 (Beta)100.0%$0.0195.5s100%
14GPT-4.1 Mini100.0%$0.004626.0s100%
15Grok 4.20100.0%$0.0207.5s100%
16Ministral 8B99.3%$0.001615.6s98%
17Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.0246.3s100%
18Claude Haiku 4.5100.0%$0.0248.2s100%
19MiniMax M2.799.8%$0.006321.4s99%
20Gemini 2.5 Flash Lite99.1%$0.00234.8s95%
21DeepSeek V3.2100.0%$0.004633.9s100%
22DeepSeek-V2 Chat100.0%$0.004934.0s100%
23DeepSeek V3 (2024-12-26)100.0%$0.004540.2s100%
24GPT-4o, Aug. 6th (temp=0)100.0%$0.0328.4s100%
25MiniMax M2.5100.0%$0.003645.0s100%
26GPT-5.4 Nano (Reasoning)100.0%$0.008841.7s100%
27DeepSeek V4 Flash100.0%$0.002051.4s100%
28GPT-4.199.6%$0.0248.5s97%
29GPT-5.199.8%$0.02517.7s99%
30Ministral 3 14B99.3%$0.003125.4s96%
31GPT-5.4100.0%$0.03317.4s100%
32DeepSeek V3.1100.0%$0.005552.0s100%
33Ministral 3 8B98.8%$0.002217.0s94%
34Mistral Large 2100.0%$0.03720.5s100%
35Qwen 3.6 Flash100.0%$0.01551.7s100%
36Z.AI GLM 4.599.7%$0.007548.1s98%
37Xiaomi MIMO v2.5100.0%$0.00271.1m100%
38Grok 4.20 (Reasoning)100.0%$0.02740.0s100%
39Gemini 3 Flash (Preview, Reasoning)100.0%$0.03040.3s100%
40Mistral Small 3.2 24B98.3%$0.001713.8s90%
41Mistral Large100.0%$0.03641.1s100%
42Ministral 3 3B96.8%$0.00079.9s90%
43GPT-4o, Aug. 6th (temp=1)98.9%$0.0299.1s93%
44ByteDance Seed 1.6100.0%$0.0141.2m100%
45GPT-5.4 (Reasoning, Low)100.0%$0.05335.5s100%
46Qwen3 235B A22B Instruct 250798.8%$0.00171.0m93%
47Claude Sonnet 4.5100.0%$0.07718.6s100%
48GPT-5.4 Mini96.6%$0.00877.9s86%
49Ministral 3B95.7%$0.00079.8s86%
50Gemma 4 31B100.0%$0.00281.9m100%
51Claude Sonnet 4.6100.0%$0.08123.3s100%
52GPT-5.2100.0%$0.06250.7s100%
53Grok 4.20 (Beta, Reasoning)98.3%$0.03230.1s90%
54Qwen 3.5 35B100.0%$0.0152.0m100%
55GPT-4o, May 13th (temp=0)99.3%$0.0885.5s96%
56GPT-5 Mini100.0%$0.0192.0m100%
57GPT-5 Nano100.0%$0.00802.2m100%
58GPT-OSS 120B99.6%$0.00402.2m98%
59GPT-5.5 (Reasoning, Low)100.0%$0.09742.6s100%
60DeepSeek V4 Pro (Reasoning)100.0%$0.0202.3m100%
61Qwen 2.5 72B97.5%$0.00561.2m87%
62GPT-5.5100.0%$0.10537.1s100%
63Writer: Palmyra X596.4%$0.01618.4s80%
64GPT-4o, May 13th (temp=1)98.5%$0.0865.4s91%
65Inception Mercury 295.3%$0.009612.8s79%
66Gemini 2.5 Pro100.0%$0.09958.2s100%
67Claude Opus 4.5100.0%$0.13119.8s100%
68GPT-4.1 Nano93.7%$0.00096.9s75%
69Gemini 3.5 Flash (Reasoning)100.0%$0.11349.4s100%
70o4 Mini97.4%$0.03745.7s84%
71WizardLM 2 8x22b98.3%$0.00982.3m93%
72Arcee AI: Trinity Mini98.1%$0.00371.9m88%
73Z.AI GLM 4.6100.0%$0.0183.0m100%
74Mistral Medium 3.195.7%$0.01021.2s74%
75Aion 2.0100.0%$0.0243.0m100%
76Gemma 3 4B92.8%$0.000727.5s75%
77o4 Mini High98.4%$0.0591.3m91%
78Qwen 3.5 397B A17B100.0%$0.0452.6m100%
79Qwen 3.5 Plus (2026-04-20)100.0%$0.0233.1m100%
80DeepSeek V4 Flash (Reasoning)100.0%$0.00413.5m100%
81Claude Opus 4.6100.0%$0.15431.4s100%
82Gemma 3 27B91.0%$0.001627.5s75%
83GPT-5.4 Nano92.7%$0.002611.2s70%
84Qwen 3.6 27B99.3%$0.0342.7m96%
85Qwen 3.5 Flash100.0%$0.00403.8m100%
86ByteDance Seed 2.0 Lite100.0%$0.0143.6m100%
87Xiaomi MIMO v2.5 Pro97.0%$0.00761.9m82%
88Claude Opus 4.7100.0%$0.18224.3s100%
89GPT-5100.0%$0.0922.3m100%
90Claude Opus 4.7 (Reasoning)100.0%$0.18425.0s100%
91Nemotron 3 Super100.0%$0.00003.2m
92Qwen 3.5 Plus (2026-02-15)100.0%$0.0213.9m100%
93Mistral NeMO88.5%$0.002415.4s67%
94DeepSeek V3 (2025-03-24)92.6%$0.003750.3s69%
95GPT-5.4 Mini (Reasoning, Low)91.6%$0.01414.4s64%
96Gemini 3.1 Pro (Preview)100.0%$0.1721.5m100%
97Z.AI GLM 4.7 Flash95.8%$0.00502.8m81%
98GPT-5.4 Nano (Reasoning, Low)92.6%$0.003813.8s56%
99Z.AI GLM 5100.0%$0.0374.6m100%
100Qwen3.6 Max Preview100.0%$0.0843.7m100%
101Llama 3.1 70B92.7%$0.006434.0s56%
102GPT-5.4 (Reasoning)100.0%$0.1752.6m100%
103Z.AI GLM 4.796.9%$0.0253.7m81%
104MoonshotAI: Kimi K2.6100.0%$0.0825.0m100%
105Qwen 3 32B92.5%$0.00422.9m66%
106Hermes 3 405B91.5%$0.0151.0m50%
107Claude Sonnet 492.2%$0.07217.5s53%
108GPT-5.5 (Reasoning)100.0%$0.2512.0m100%
109MoonshotAI: Kimi K2.598.2%$0.0315.4m89%
110Mistral Small 4 (Reasoning)85.8%$0.005124.7s42%
111GPT-5.4 Mini (Reasoning)93.1%$0.1172.8m73%
112Z.AI GLM 5.1100.0%$0.0817.3m100%
113Gemini 2.5 Flash Lite (Reasoning)76.9%$0.00731.0m41%
114Z.AI GLM 4.5 Air90.8%$0.00713.2m45%
115Claude Opus 4.8 (Reasoning, Low)100.0%$0.3831.9m100%
116Claude Opus 4.8 (Reasoning)100.0%$0.3811.9m100%
117ByteDance Seed 2.0 Mini98.3%$0.00628.7m90%
118Claude Opus 4.6 (Reasoning)100.0%$0.3562.6m100%
119MiniMax M3100.0%$0.0369.3m100%
120Cydonia 24B V4.172.4%$0.00401.0m32%
121Gemma 4 31B (Reasoning)93.2%$0.00435.8m59%
122Qwen 3.6 35B80.5%$0.0141.1m22%
123Qwen 3.5 27B90.0%$0.0293.7m40%
124Z.AI GLM 5 Turbo90.0%$0.0443.6m40%
125Hermes 3 70B68.8%$0.004542.7s24%
126Qwen 3.5 122B90.0%$0.0354.0m40%
127Rocinante 12B67.1%$0.003332.1s20%
128Skyfall 36B V262.0%$0.005426.5s19%
129Gemma 4 26B (Reasoning)80.3%$0.00413.1m21%
130Qwen 3.5 9B86.5%$0.00407.8m64%
131ByteDance Seed 1.6 Flash58.9%$0.001938.2s13%
132Mistral Small 457.9%$0.00307.4s5%
133Llama 3.1 8B55.1%$0.000551.5s15%
134Qwen3.7 Max80.2%$0.0453.5m21%
135Gemini 2.5 Flash (Reasoning)50.1%$0.02337.3s15%
136Claude Opus 492.3%$0.3682.4m54%
137Claude Sonnet 4.6 (Reasoning)100.0%$0.4485.9m100%
138Cohere Command R+ (Aug. 2024)57.4%$0.05756.5s4%
139Nemotron 3 Nano51.7%$0.00392.8m7%
140Gemma 3 12B10.0%$0.001430.4s0%
94.26%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.7 Max100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MiniMax M3100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.2100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
GPT-4.1100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Aion 2.0100100100100100100.0%
o4 Mini100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Mistral Large 2100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Grok 4.20100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Grok 4.3100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
WizardLM 2 8x22b1001001001009398.5%
Mistral Small 3.2 24B1001001001008396.5%
Ministral 3 3B100100100908594.9%
Z.AI GLM 4.7 Flash1001001001007094.0%
Z.AI GLM 4.71001001001006993.8%
Ministral 3B100100100867893.0%
Mistral Medium 3.11001001001005791.3%
GPT-5.4 Mini (Reasoning, Low)1001001001004288.4%
Gemma 4 31B (Reasoning)1001001001003286.3%
Rocinante 12B10010086776786.0%
GPT-5.4 Nano100100100715685.4%
GPT-5.4 Nano (Reasoning, Low)1001001001002685.3%
Claude Opus 41001001001002384.6%
Claude Sonnet 41001001001002284.4%
Qwen 3.5 9B10010092675582.6%
Z.AI GLM 4.5 Air100100100100881.7%
Qwen 3.5 27B100100100100080.0%
Skyfall 36B V210010077605778.9%
Hermes 3 70B10010067524873.6%
Gemini 2.5 Flash Lite (Reasoning)10010055332462.3%
Cydonia 24B V4.1100956044059.9%
ByteDance Seed 1.6 Flash10010042341958.9%
Llama 3.1 8B1008832322354.9%
Nemotron 3 Nano1001004812052.0%
Gemini 2.5 Flash (Reasoning)1002727252240.3%
Cohere Command R+ (Aug. 2024)10010000040.0%
Gemma 3 12B100000020.0%
Mistral Small 477200015.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MiniMax M3100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.2100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Mistral Large 2100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Grok 4.20100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Nemotron 3 Super100100.0%
Grok 4.3100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
GPT-5.11001001001009899.7%
MiniMax M2.71001001001009899.5%
Z.AI GLM 4.51001001001009799.3%
GPT-OSS 120B1001001001009699.2%
GPT-4.11001001001009699.1%
GPT-4o, May 13th (temp=0)1001001001009398.7%
Ministral 3 3B100100100989598.6%
Qwen 3.6 27B1001001001009398.6%
Ministral 3 14B1001001001009398.6%
Ministral 8B10010098989798.5%
Ministral 3B100100100979498.4%
WizardLM 2 8x22b1001001001009198.2%
Gemini 2.5 Flash Lite100100100989298.1%
GPT-4o, Aug. 6th (temp=1)1001001001008997.8%
Z.AI GLM 4.7 Flash1001001001008897.7%
Qwen3 235B A22B Instruct 25071001001001008897.6%
Ministral 3 8B100100100979197.6%
GPT-4o, May 13th (temp=1)1001001001008597.0%
o4 Mini High1001001001008496.9%
Grok 4.20 (Beta, Reasoning)1001001001008396.7%
ByteDance Seed 2.0 Mini1001001001008396.7%
MoonshotAI: Kimi K2.51001001001008296.4%
Arcee AI: Trinity Mini1001001001008196.1%
Qwen 2.5 72B100100100977894.9%
GPT-5.4 Mini (Reasoning, Low)1001001001007494.8%
o4 Mini1001001001007494.8%
Xiaomi MIMO v2.5 Pro1001001001007093.9%
GPT-5.4 Mini100100100838393.2%
Writer: Palmyra X5100100100976792.8%
Gemini 2.5 Flash Lite (Reasoning)100100100966191.5%
Inception Mercury 2100100100886590.7%
Qwen 3.5 9B10010094867290.4%
GPT-4.1 Nano10010093846087.4%
GPT-5.4 Mini (Reasoning)100100100686486.3%
Gemma 3 4B10010089756485.5%
Llama 3.1 70B1001001001002785.4%
DeepSeek V3 (2025-03-24)100100100755185.2%
Cydonia 24B V4.110010094864585.0%
Qwen 3 32B1009695914385.0%
Hermes 3 405B100100100981782.9%
Gemma 3 27B908584836781.9%
Z.AI GLM 5 Turbo100100100100080.0%
Qwen 3.5 122B100100100100080.0%
Mistral NeMO918981725377.0%
Cohere Command R+ (Aug. 2024)10010010074074.8%
Mistral Small 4 (Reasoning)100908978171.7%
Hermes 3 70B10010010020064.0%
Qwen 3.6 35B1001001005061.0%
Gemma 4 26B (Reasoning)1001001003060.7%
Qwen3.7 Max1001001002060.4%
Gemini 2.5 Flash (Reasoning)100736859059.8%
Llama 3.1 8B948950222255.3%
Nemotron 3 Nano100100525051.5%
Rocinante 12B100100410048.3%
Skyfall 36B V210083393045.1%
Gemma 3 12B000000.0%