Red-herring resistance

Test: Relationship tree

Avg. Score
99.1%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00333.2s100%
2Gemini 3.1 Flash Lite100.0%$0.00313.3s100%
3Gemini 2.5 Flash Lite100.0%$0.00234.8s100%
4GPT-4.1 Nano100.0%$0.00096.9s100%
5LFM2 24B100.0%$0.00047.8s100%
6Gemini 3.1 Flash Lite (Preview)100.0%$0.00463.2s100%
7GPT-5.4 Nano100.0%$0.002611.2s100%
8Gemini 2.5 Flash100.0%$0.00756.1s100%
9Mistral Small 3.2 24B100.0%$0.001713.8s100%
10Gemini 3 Flash (Preview)100.0%$0.00877.1s100%
11Mistral NeMO100.0%$0.002415.4s100%
12GPT-5.4 Nano (Reasoning, Low)100.0%$0.003813.8s100%
13GPT-5.4 Mini100.0%$0.00877.9s100%
14Grok 4.3100.0%$0.00947.8s100%
15Inception Mercury 2100.0%$0.009612.8s100%
16DeepSeek V4 Pro100.0%$0.008317.3s100%
17Gemma 3 4B100.0%$0.000727.5s100%
18Mistral Large 3100.0%$0.009117.5s100%
19Ministral 3 14B100.0%$0.003125.4s100%
20MiniMax M2.7100.0%$0.006321.4s100%
21Gemma 4 26B100.0%$0.001827.2s100%
22Grok 4.3 (Reasoning)100.0%$0.01313.9s100%
23Gemma 3 27B100.0%$0.001627.5s100%
24Grok 4.20 (Beta)100.0%$0.0195.5s100%
25Mistral Small 4 (Reasoning)100.0%$0.005124.7s100%
26GPT-5.4 Mini (Reasoning, Low)100.0%$0.01414.4s100%
27GPT-4.1 Mini100.0%$0.004626.0s100%
28Grok 4.20100.0%$0.0207.5s100%
29Gemma 3 12B100.0%$0.001430.4s100%
30Mistral Medium 3.1100.0%$0.01021.2s100%
31Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.0246.3s100%
32GPT-4.1100.0%$0.0248.5s100%
33Claude Haiku 4.5100.0%$0.0248.2s100%
34Writer: Palmyra X5100.0%$0.01618.4s100%
35DeepSeek V3.2100.0%$0.004633.9s100%
36DeepSeek-V2 Chat100.0%$0.004934.0s100%
37ByteDance Seed 1.6 Flash100.0%$0.001938.2s100%
38Llama 3.1 70B100.0%$0.006434.0s100%
39GPT-4o, Aug. 6th (temp=1)100.0%$0.0299.1s100%
40DeepSeek V3 (2024-12-26)100.0%$0.004540.2s100%
41GPT-4o, Aug. 6th (temp=0)100.0%$0.0328.4s100%
42GPT-5.1100.0%$0.02517.7s100%
43MiniMax M2.5100.0%$0.003645.0s100%
44Llama 3.1 8B100.0%$0.000551.5s100%
45GPT-5.4 Nano (Reasoning)100.0%$0.008841.7s100%
46DeepSeek V4 Flash100.0%$0.002051.4s100%
47DeepSeek V3 (2025-03-24)100.0%$0.003750.3s100%
48Z.AI GLM 4.5100.0%$0.007548.1s100%
49GPT-5.4100.0%$0.03317.4s100%
50DeepSeek V3.1100.0%$0.005552.0s100%
51Cydonia 24B V4.1100.0%$0.00401.0m100%
52Mistral Large 2100.0%$0.03720.5s100%
53Gemini 2.5 Flash (Reasoning)100.0%$0.02337.3s100%
54Gemini 2.5 Flash Lite (Reasoning)100.0%$0.00731.0m100%
55Grok 4.20 (Beta, Reasoning)100.0%$0.03230.1s100%
56Qwen 3.6 Flash100.0%$0.01551.7s100%
57Xiaomi MIMO v2.5100.0%$0.00271.1m100%
58Grok 4.20 (Reasoning)100.0%$0.02740.0s100%
59Qwen 2.5 72B100.0%$0.00561.2m100%
60Gemini 3 Flash (Preview, Reasoning)100.0%$0.03040.3s100%
61Hermes 3 405B100.0%$0.0151.0m100%
62Qwen 3.6 35B100.0%$0.0141.1m100%
63Mistral Large100.0%$0.03641.1s100%
64ByteDance Seed 1.6100.0%$0.0141.2m100%
65o4 Mini100.0%$0.03745.7s100%
66GPT-5.4 (Reasoning, Low)100.0%$0.05335.5s100%
67Claude Sonnet 4100.0%$0.07217.5s100%
68GPT-4o, May 13th (temp=1)100.0%$0.0865.4s100%
69Claude Sonnet 4.5100.0%$0.07718.6s100%
70Gemma 4 31B100.0%$0.00281.9m100%
71Arcee AI: Trinity Mini100.0%$0.00371.9m100%
72Claude Sonnet 4.6100.0%$0.08123.3s100%
73Xiaomi MIMO v2.5 Pro100.0%$0.00761.9m100%
74Cohere Command R+ (Aug. 2024)100.0%$0.05756.5s100%
75GPT-5.2100.0%$0.06250.7s100%
76GPT-OSS 120B100.0%$0.00402.2m100%
77Qwen 3.5 35B100.0%$0.0152.0m100%
78GPT-5 Mini100.0%$0.0192.0m100%
79o4 Mini High100.0%$0.0591.3m100%
80WizardLM 2 8x22b100.0%$0.00982.3m100%
81GPT-5.5 (Reasoning, Low)100.0%$0.09742.6s100%
82DeepSeek V4 Pro (Reasoning)100.0%$0.0202.3m100%
83GPT-5.5100.0%$0.10537.1s100%
84Nemotron 3 Nano100.0%$0.00392.8m100%
85Qwen 3 32B100.0%$0.00422.9m100%
86Gemini 2.5 Pro100.0%$0.09958.2s100%
87Claude Opus 4.5100.0%$0.13119.8s100%
88Gemini 3.5 Flash (Reasoning)100.0%$0.11349.4s100%
89Gemma 4 26B (Reasoning)100.0%$0.00413.1m100%
90Qwen 3.6 27B100.0%$0.0342.7m100%
91Z.AI GLM 4.5 Air100.0%$0.00713.2m100%
92Z.AI GLM 4.6100.0%$0.0183.0m100%
93Aion 2.0100.0%$0.0243.0m100%
94Qwen 3.5 397B A17B100.0%$0.0452.6m100%
95Qwen 3.5 Plus (2026-04-20)100.0%$0.0233.1m100%
96DeepSeek V4 Flash (Reasoning)100.0%$0.00413.5m100%
97Claude Opus 4.6100.0%$0.15431.4s100%
98Qwen 3.5 Flash100.0%$0.00403.8m100%
99ByteDance Seed 2.0 Lite100.0%$0.0143.6m100%
100Claude Opus 4.7100.0%$0.18224.3s100%
101Z.AI GLM 4.7100.0%$0.0253.7m100%
102GPT-5100.0%$0.0922.3m100%
103Claude Opus 4.7 (Reasoning)100.0%$0.18425.0s100%
104Nemotron 3 Super100.0%$0.00003.2m
105Qwen 3.5 27B100.0%$0.0293.7m100%
106Qwen 3.5 Plus (2026-02-15)100.0%$0.0213.9m100%
107Z.AI GLM 5 Turbo100.0%$0.0443.6m100%
108Qwen3.7 Max100.0%$0.0453.5m100%
109Qwen 3.5 122B100.0%$0.0354.0m100%
110Gemini 3.1 Pro (Preview)100.0%$0.1721.5m100%
111GPT-5.4 Mini (Reasoning)100.0%$0.1172.8m100%
112Z.AI GLM 5100.0%$0.0374.6m100%
113Qwen3.6 Max Preview100.0%$0.0843.7m100%
114Gemma 4 31B (Reasoning)100.0%$0.00435.8m100%
115MoonshotAI: Kimi K2.5100.0%$0.0315.4m100%
116GPT-5.4 (Reasoning)100.0%$0.1752.6m100%
117MoonshotAI: Kimi K2.6100.0%$0.0825.0m100%
118GPT-5.5 (Reasoning)100.0%$0.2512.0m100%
119Qwen 3.5 9B100.0%$0.00407.8m100%
120Ministral 8B93.2%$0.001615.6s59%
121Claude 3 Haiku93.2%$0.005611.6s59%
122Ministral 3 8B93.2%$0.002217.0s59%
123ByteDance Seed 2.0 Mini100.0%$0.00628.7m100%
124Z.AI GLM 5.1100.0%$0.0817.3m100%
125Skyfall 36B V293.2%$0.005426.5s59%
126Qwen3 235B A22B Instruct 250793.2%$0.00171.0m59%
127Claude Opus 4.8 (Reasoning, Low)100.0%$0.3831.9m100%
128Claude Opus 4.8 (Reasoning)100.0%$0.3811.9m100%
129Claude Opus 4.6 (Reasoning)100.0%$0.3562.6m100%
130Claude Opus 4100.0%$0.3682.4m100%
131MiniMax M3100.0%$0.0369.3m100%
132GPT-4o, May 13th (temp=0)93.2%$0.0885.5s59%
133Z.AI GLM 4.7 Flash93.2%$0.00502.8m59%
134Ministral 3 3B90.6%$0.00079.9s44%
135Mistral Small 490.6%$0.00307.4s44%
136Rocinante 12B90.6%$0.003332.1s44%
137GPT-5 Nano90.6%$0.00802.2m44%
138Hermes 3 70B86.3%$0.004542.7s45%
139Claude Sonnet 4.6 (Reasoning)100.0%$0.4485.9m100%
140Ministral 3B81.3%$0.00079.8s25%
99.15%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.7 Max100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MiniMax M3100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.2100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
GPT-4.1100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Aion 2.0100100100100100100.0%
o4 Mini100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Mistral Large 2100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Grok 4.20100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Grok 4.3100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Cydonia 24B V4.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Skyfall 36B V2100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5 Nano100100100100681.3%
Mistral Small 4100100100100681.3%
Ministral 3 3B100100100100681.3%
Rocinante 12B100100100100681.3%
Ministral 3B1001001006662.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.7 Max100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MiniMax M3100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.2100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
GPT-4.1100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Aion 2.0100100100100100100.0%
o4 Mini100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Mistral Large 2100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Grok 4.20100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Nemotron 3 Super100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Small 4100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Cydonia 24B V4.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-4o, May 13th (temp=0)1001001001003286.3%
Z.AI GLM 4.7 Flash1001001001003286.3%
Qwen3 235B A22B Instruct 25071001001001003286.3%
Claude 3 Haiku1001001001003286.3%
Ministral 3 8B1001001001003286.3%
Ministral 8B1001001001003286.3%
Skyfall 36B V21001001001003286.3%
Hermes 3 70B100100100323272.7%