Correct "no violations" response

Test: Codex Red Herring (False Positive Detection)

Avg. Score
50.5%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5.4 Nano (Reasoning, Low)97.5%$0.00083.9s73%
2Inception Mercury97.5%$0.00048.0s69%
3Nemotron 3 Super98.8%$0.00001.3m78%
4Grok 4.1 Fast96.3%$0.001912.5s62%
5Inception Mercury 295.0%$0.00273.8s56%
6Z.AI GLM 5 Turbo95.0%$0.007116.0s56%
7GPT-5.4 Mini (Reasoning, Low)93.8%$0.00344.0s52%
8o4 Mini95.0%$0.01425.0s56%
9ByteDance Seed 1.6 Flash92.5%$0.00089.1s47%
10GPT-5.4 Nano (Reasoning)92.5%$0.001711.4s47%
11o4 Mini High96.3%$0.02752.5s62%
12GPT-5.193.8%$0.02526.1s52%
13GPT-5 Mini91.3%$0.005937.8s43%
14GPT-4.185.6%$0.00481.2s38%
15Gemini 2.5 Flash Lite (Reasoning)88.8%$0.002316.6s37%
16GPT-5 Nano90.0%$0.00351.1m40%
17Ministral 8B74.4%$0.00074.4s33%
18MiniMax M2.786.3%$0.004734.6s31%
19Ministral 3 8B80.6%$0.001317.9s30%
20Z.AI GLM 591.3%$0.0171.4m43%
21GPT-5.4 Mini (Reasoning)85.0%$0.009010.8s29%
22Grok 4.20 (Beta, Reasoning)86.3%$0.02515.9s31%
23Gemini 2.5 Flash (Reasoning)82.5%$0.008514.2s24%
24ByteDance Seed 1.681.3%$0.004332.7s22%
25GPT-4.1 Nano68.1%$0.00054.6s24%
26Stealth: Healer Alpha78.8%$0.000021.5s18%
27GPT-5.281.3%$0.01314.5s22%
28Claude Sonnet 4.673.8%$0.03317.4s39%
29Claude Opus 4.681.9%$0.04913.2s38%
30Aion 2.083.8%$0.00961.3m26%
31Mistral Small 4 (Reasoning)77.5%$0.002522.6s16%
32GPT-5.4 Nano66.3%$0.00051.5s18%
33Ministral 3B43.1%$0.000412.4s33%
34MiniMax M2.573.1%$0.003225.9s12%
35Claude Haiku 4.551.2%$0.00804.2s24%
36LFM2 24B71.3%$0.000439.0s11%
37GPT-5.4 (Reasoning, Low)71.3%$0.0128.8s9%
38GPT-4.1 Mini51.2%$0.00186.9s18%
39Mistral Medium 3.153.1%$0.00324.7s17%
40Nemotron 3 Nano83.8%$0.00272.9m26%
41Arcee AI: Trinity Mini67.5%$0.000926.3s6%
42Z.AI GLM 4.561.3%$0.003825.0s12%
43Mistral Small Creative37.5%$0.00118.9s26%
44Grok 4 Fast66.3%$0.001919.2s5%
45Ministral 3 3B37.5%$0.001331.1s28%
46Qwen 2.5 72B45.6%$0.000911.3s15%
47Ministral 3 14B50.6%$0.001929.2s15%
48Z.AI GLM 4.668.1%$0.01559.0s14%
49Claude Opus 4.6 (Reasoning)92.5%$0.1201.0m47%
50Stealth: Hunter Alpha64.4%$0.00001.0m5%
51Mistral Small 428.1%$0.00127.4s23%
52Qwen 3 32B60.6%$0.001437.7s3%
53Llama 3.1 8B26.9%$0.000333.8s25%
54Hermes 3 70B42.5%$0.001812.1s9%
55Llama 3.1 Nemotron 70B35.0%$0.007721.4s17%
56Arcee AI: Trinity Large (Preview)43.1%$0.00001.7m22%
57GPT-4o Mini (temp=1)32.5%$0.00085.6s7%
58Z.AI GLM 4.7 Flash70.0%$0.00392.5m8%
59GPT-5.4 (Reasoning)63.7%$0.03231.9s4%
60Cohere Command R+ (Aug. 2024)50.0%$0.0177.4s0%
61Mistral Large 336.9%$0.004011.7s0%
62Grok 4.20 (Beta)34.4%$0.00614.3s0%
63ByteDance Seed 2.0 Mini68.1%$0.00333.1m7%
64GPT-4o, Aug. 6th (temp=1)32.5%$0.00942.4s0%
65Hermes 3 405B30.6%$0.00595.2s0%
66Claude Opus 4.540.0%$0.0364.4s9%
67Gemini 2.5 Flash Lite25.0%$0.00104.0s0%
68Mistral Small 3.2 24B26.3%$0.001011.9s0%
69Mistral Large 235.0%$0.01611.4s0%
70GPT-567.5%$0.0481.5m6%
71Mistral Large33.1%$0.01612.3s0%
72Claude Sonnet 435.0%$0.0224.5s0%
73Llama 3.1 70B25.0%$0.003116.4s0%
74Claude 3 Haiku20.6%$0.00203.5s0%
75Rocinante 12B22.5%$0.001513.3s0%
76GPT-5.4 Mini15.0%$0.00201.2s0%
77Writer: Palmyra X522.5%$0.008513.8s0%
78Gemini 3 Flash (Preview, Reasoning)44.4%$0.02849.2s0%
79GPT-4o Mini (temp=0)16.9%$0.000914.5s0%
80Qwen3 235B A22B Instruct 250721.3%$0.001132.0s0%
81Gemini 3.1 Flash Lite (Preview)11.9%$0.00171.6s0%
82WizardLM 2 8x22b26.9%$0.004155.3s0%
83Z.AI GLM 4.755.0%$0.0232.2m1%
84ByteDance Seed 2.0 Lite25.0%$0.00671.0m0%
85Claude Sonnet 4.520.6%$0.0245.9s0%
86DeepSeek V3.112.5%$0.001936.6s0%
87DeepSeek V3.213.8%$0.001842.8s0%
88DeepSeek-V2 Chat5.0%$0.00238.3s0%
89DeepSeek V3 (2024-12-26)4.4%$0.00236.6s0%
90DeepSeek V3 (2025-03-24)5.6%$0.001515.7s0%
91GPT-4o, Aug. 6th (temp=0)8.8%$0.0112.6s0%
92MoonshotAI: Kimi K2.552.5%$0.0172.6m0%
93Gemini 3 Flash (Preview)2.5%$0.00343.2s0%
94Claude 3.7 Sonnet15.6%$0.0224.3s0%
95Qwen 3.5 Flash34.4%$0.00712.0m0%
96Gemini 2.5 Flash0.6%$0.00211.5s0%
97GPT-5.42.5%$0.00483.1s0%
98Gemma 3 12B1.3%$0.000511.7s0%
99Gemma 3 27B0.0%$0.000712.6s0%
100Gemma 3 4B0.0%$0.000315.0s0%
101GPT-4o, May 13th (temp=1)16.9%$0.0333.0s0%
102Grok 461.3%$0.0741.6m3%
103Claude 3.5 Sonnet18.8%$0.0456.9s0%
104Mistral NeMO8.8%$0.00181.6m0%
105Gemini 2.5 Pro46.3%$0.07352.1s0%
106Qwen 3.5 Plus (2026-02-15)18.8%$0.0212.0m0%
107GPT-4o, May 13th (temp=0)1.9%$0.0436.4s0%
108Qwen 3.5 35B35.0%$0.0442.4m0%
109Qwen 3.5 9B42.5%$0.00384.7m0%
110Claude Sonnet 4.6 (Reasoning)76.3%$0.1452.2m15%
111Gemini 3.1 Pro (Preview)57.5%$0.1201.5m1%
112Claude Opus 418.1%$0.1097.1s0%
113Qwen 3.5 27B50.6%$0.0564.8m2%
114Gemini 3 Pro (Preview)53.8%$0.1441.7m0%
115Qwen 3.5 397B A17B31.3%$0.0285.4m0%
116Qwen 3.5 122B49.4%$0.1356.4m0%
50.50%

Individual Scenarios

basic entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Claude Opus 4.61001001001001001001001001005095.0%
GPT-4.1 Nano1001001001001001001001001005095.0%
Ministral 3 8B1001001001001001001001001005095.0%
GPT-5100100100100100100100100100090.0%
GPT-5.2100100100100100100100100100090.0%
Grok 4.1 Fast100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
Nemotron 3 Super100100100100100100100100100090.0%
Mistral Small 4 (Reasoning)100100100100100100100100100090.0%
Inception Mercury100100100100100100100100100090.0%
Nemotron 3 Nano100100100100100100100100100090.0%
ByteDance Seed 1.6 Flash100100100100100100100100100090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001000080.0%
Z.AI GLM 51001001001001001001001000080.0%
o4 Mini High1001001001001001001001000080.0%
o4 Mini1001001001001001001001000080.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001000080.0%
GPT-5.4 Mini (Reasoning, Low)1001001001001001001001000080.0%
MoonshotAI: Kimi K2.510010010010010010010000070.0%
Z.AI GLM 4.7 Flash10010010010010010010000070.0%
Z.AI GLM 5 Turbo100100100100100100000060.0%
Claude Sonnet 4.6100100100100505050500060.0%
Ministral 8B10010010050505050500055.0%
GPT-5.4 (Reasoning)1001001001001000000050.0%
MiniMax M2.71001001001001000000050.0%
MiniMax M2.51001001001001000000050.0%
GPT-4.11001005050505050500050.0%
Stealth: Hunter Alpha1001001001001000000050.0%
Qwen 3 32B1001001001001000000050.0%
GPT-5.4 Nano (Reasoning)1001001001001000000050.0%
GPT-5.4 Nano1001001001005050000050.0%
Z.AI GLM 4.6100100100100500000045.0%
ByteDance Seed 1.610010010010000000040.0%
Writer: Palmyra X550505050505050500040.0%
Qwen 2.5 72B50505050505050500040.0%
Arcee AI: Trinity Large (Preview)50505050505050500040.0%
Mistral Small Creative50505050505050500040.0%
Arcee AI: Trinity Mini10010010010000000040.0%
Ministral 3B50505050505050500040.0%
Rocinante 12B10010010050500000040.0%
Claude Opus 4.55050505050505000035.0%
Grok 4.20 (Beta)1005050505050000035.0%
Qwen3 235B A22B Instruct 25075050505050505000035.0%
Mistral Small 45050505050505000035.0%
Ministral 3 14B5050505050505000035.0%
Claude 3 Haiku5050505050505000035.0%
GPT-5 Mini100100100000000030.0%
Grok 4.20 (Beta, Reasoning)100100100000000030.0%
GPT-5.4 (Reasoning, Low)100100100000000030.0%
Claude Sonnet 4.5505050505050000030.0%
Gemini 2.5 Flash Lite100100100000000030.0%
WizardLM 2 8x22b505050505050000030.0%
Ministral 3 3B505050505050000030.0%
Mistral NeMO100100100000000030.0%
Llama 3.1 8B505050505050000030.0%
Llama 3.1 Nemotron 70B50505050500000025.0%
Gemini 3.1 Pro (Preview)1001000000000020.0%
Claude Sonnet 45050505000000020.0%
Hermes 3 405B1001000000000020.0%
GPT-4o, Aug. 6th (temp=1)5050505000000020.0%
GPT-4o Mini (temp=1)5050505000000020.0%
Mistral Medium 3.15050505000000020.0%
Claude Opus 4505050000000015.0%
Z.AI GLM 4.5100500000000015.0%
Claude Haiku 4.5505050000000015.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
DeepSeek V3.1505050000000015.0%
Llama 3.1 70B100500000000015.0%
Hermes 3 70B100500000000015.0%
Gemini 3 Pro (Preview)10000000000010.0%
Z.AI GLM 4.710000000000010.0%
Grok 410000000000010.0%
GPT-4.1 Mini50500000000010.0%
GPT-5.4 Mini10000000000010.0%
DeepSeek V3.250500000000010.0%
Cohere Command R+ (Aug. 2024)10000000000010.0%
Qwen 3.5 122B500000000005.0%
Qwen 3.5 27B500000000005.0%
Gemini 3 Flash (Preview, Reasoning)500000000005.0%
ByteDance Seed 2.0 Mini500000000005.0%
Gemini 3 Flash (Preview)500000000005.0%
Claude 3.5 Sonnet500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Qwen 3.5 397B A17B00000000000.0%
GPT-5.4 Mini (Reasoning)00000000000.0%
Gemini 2.5 Pro00000000000.0%
Qwen 3.5 35B00000000000.0%
Qwen 3.5 Flash00000000000.0%
Grok 4 Fast00000000000.0%
Qwen 3.5 9B00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Mistral Large 300000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
ByteDance Seed 2.0 Lite00000000000.0%
GPT-5.400000000000.0%
Claude 3.7 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Mistral Large 200000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Large00000000000.0%
Mistral Small 3.2 24B00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Claude Sonnet 4.61001001001001001001001001005095.0%
Stealth: Hunter Alpha1001001001001001001001001005095.0%
Mistral Medium 3.11001001001001001001001001005095.0%
GPT-4.1 Nano1001001001001001001001001005095.0%
Aion 2.0100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
Ministral 3 8B100100100100100100100100100090.0%
MiniMax M2.51001001001001001001001000080.0%
Gemini 2.5 Pro1001001001001001001001000080.0%
Grok 41001001001001001001001000080.0%
Grok 4.20 (Beta)1001001001001001001005050080.0%
Ministral 8B100100100100100505050505075.0%
Gemini 2.5 Flash Lite10010010010010010010000070.0%
Mistral Small 3.2 24B10010010010010010010000070.0%
Hermes 3 70B10010010010010010050500070.0%
Z.AI GLM 4.51001001001001005050500065.0%
Gemini 3.1 Flash Lite (Preview)1001001001001001005000065.0%
GPT-5.4 Nano1001001001001001005000065.0%
Mistral Small 4 (Reasoning)100100100100100100000060.0%
Qwen 3 32B100100100100100100000060.0%
Qwen 3.5 27B10010010010010050000055.0%
Z.AI GLM 4.610010010010050505000055.0%
Claude Haiku 4.55050505050505050505050.0%
Z.AI GLM 4.7 Flash1001001001001000000050.0%
Claude Sonnet 4.5505050505050505050045.0%
GPT-4.1 Mini505050505050505050045.0%
Writer: Palmyra X5505050505050505050045.0%
Ministral 3 3B505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Gemini 3.1 Pro (Preview)10010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Llama 3.1 Nemotron 70B10050505050505000040.0%
Arcee AI: Trinity Mini10010010010000000040.0%
Claude Opus 4.65050505050505000035.0%
Claude Sonnet 45050505050505000035.0%
Qwen3 235B A22B Instruct 25075050505050505000035.0%
GPT-4o Mini (temp=1)1005050505050000035.0%
Mistral Small 41005050505050000035.0%
Qwen 3.5 122B100100100000000030.0%
Qwen 3.5 35B100100100000000030.0%
Claude Opus 4505050505050000030.0%
Qwen 3.5 9B100100100000000030.0%
WizardLM 2 8x22b100505050500000030.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
GPT-4o, Aug. 6th (temp=1)10050505000000025.0%
DeepSeek V3.110050505000000025.0%
Qwen 2.5 72B10050505000000025.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
DeepSeek V3.25050505000000020.0%
Arcee AI: Trinity Large (Preview)5050505000000020.0%
Mistral Small Creative5050505000000020.0%
Mistral NeMO1001000000000020.0%
Llama 3.1 8B5050505000000020.0%
Rocinante 12B1005050000000020.0%
DeepSeek V3 (2025-03-24)505050000000015.0%
Ministral 3 14B505050000000015.0%
Gemini 3 Flash (Preview, Reasoning)10000000000010.0%
Claude Opus 4.550500000000010.0%
Gemini 3 Pro (Preview)10000000000010.0%
Qwen 3.5 Flash10000000000010.0%
Hermes 3 405B10000000000010.0%
GPT-5.4 Mini10000000000010.0%
Cohere Command R+ (Aug. 2024)10000000000010.0%
Gemini 3 Flash (Preview)500000000005.0%
DeepSeek-V2 Chat500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Mistral Large 2500000000005.0%
Llama 3.1 70B500000000005.0%
Claude 3 Haiku500000000005.0%
Mistral Large 300000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
ByteDance Seed 2.0 Lite00000000000.0%
GPT-5.400000000000.0%
Claude 3.5 Sonnet00000000000.0%
Claude 3.7 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Large00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Ministral 3 8B1001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100090.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
MiniMax M2.7100100100100100100100100100090.0%
ByteDance Seed 2.0 Mini100100100100100100100100100090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100090.0%
Mistral Small 4 (Reasoning)100100100100100100100100100090.0%
Qwen 3 32B100100100100100100100100100090.0%
Inception Mercury100100100100100100100100100090.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100090.0%
Gemini 2.5 Pro1001001001001001001001000080.0%
GPT-5.4 Mini (Reasoning, Low)1001001001001001001001000080.0%
Z.AI GLM 4.7 Flash1001001001001001001001000080.0%
Nemotron 3 Nano1001001001001001001001000080.0%
Arcee AI: Trinity Mini1001001001001001001001000080.0%
Claude Sonnet 4100100100100100100505050075.0%
GPT-5.4 Nano100100100100100100505050075.0%
Ministral 8B100100100100100100505050075.0%
Z.AI GLM 510010010010010010010000070.0%
GPT-5.210010010010010010010000070.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Hermes 3 405B10010010010010010010000070.0%
GPT-5 Nano10010010010010010010000070.0%
Hermes 3 70B1001001001001005050500065.0%
Z.AI GLM 4.6100100100100100505000060.0%
MiniMax M2.5100100100100100100000060.0%
Stealth: Healer Alpha100100100100100100000060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Claude Haiku 4.510050505050505050505055.0%
Grok 41001001001001000000050.0%
Qwen 3.5 9B1001001001001000000050.0%
Mistral Medium 3.15050505050505050505050.0%
Arcee AI: Trinity Large (Preview)5050505050505050505050.0%
Mistral Small Creative5050505050505050505050.0%
Ministral 3 3B5050505050505050505050.0%
Ministral 3B5050505050505050505050.0%
Claude Sonnet 4.6505050505050505050045.0%
GPT-4.1 Mini505050505050505050045.0%
GPT-4o, Aug. 6th (temp=1)100100100505050000045.0%
Qwen 2.5 72B100505050505050500045.0%
GPT-510010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
MoonshotAI: Kimi K2.510010010010000000040.0%
Qwen 3.5 27B10010010050500000040.0%
Gemini 3 Flash (Preview, Reasoning)10010010010000000040.0%
Z.AI GLM 4.710010010010000000040.0%
Z.AI GLM 4.510010010050500000040.0%
Claude 3.5 Sonnet10010010010000000040.0%
Ministral 3 14B50505050505050500040.0%
GPT-4.1 Nano10010010010000000040.0%
WizardLM 2 8x22b1005050505050000035.0%
Qwen 3.5 122B100100100000000030.0%
Stealth: Hunter Alpha100100100000000030.0%
Grok 4 Fast100100100000000030.0%
Qwen 3.5 Plus (2026-02-15)505050505050000030.0%
Grok 4.20 (Beta)100505050500000030.0%
Writer: Palmyra X5100100505000000030.0%
Llama 3.1 Nemotron 70B505050505050000030.0%
Llama 3.1 8B505050505050000030.0%
Mistral Large 350505050500000025.0%
Claude 3.7 Sonnet10010050000000025.0%
Mistral Large 250505050500000025.0%
Mistral Small 450505050500000025.0%
Qwen 3.5 35B1001000000000020.0%
Llama 3.1 70B1005050000000020.0%
Mistral NeMO1005050000000020.0%
DeepSeek V3.1505050000000015.0%
GPT-4o Mini (temp=0)505050000000015.0%
Claude 3 Haiku505050000000015.0%
Rocinante 12B100500000000015.0%
GPT-5.4 (Reasoning)10000000000010.0%
GPT-5.4 (Reasoning, Low)10000000000010.0%
Claude Opus 4.510000000000010.0%
Claude Sonnet 4.510000000000010.0%
Qwen 3.5 Flash10000000000010.0%
Gemini 3.1 Flash Lite (Preview)50500000000010.0%
ByteDance Seed 2.0 Lite10000000000010.0%
GPT-5.4 Mini10000000000010.0%
DeepSeek V3.250500000000010.0%
Mistral Large50500000000010.0%
Qwen3 235B A22B Instruct 250710000000000010.0%
GPT-4o Mini (temp=1)50500000000010.0%
Claude Opus 4500000000005.0%
GPT-4o, May 13th (temp=0)500000000005.0%
GPT-4o, May 13th (temp=1)500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
GPT-4o, Aug. 6th (temp=0)500000000005.0%
Gemini 2.5 Flash500000000005.0%
Mistral Small 3.2 24B500000000005.0%
Gemma 3 12B500000000005.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
GPT-5.400000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Z.AI GLM 4.61001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Z.AI GLM 5100100100100100100100100100090.0%
o4 Mini High100100100100100100100100100090.0%
MiniMax M2.5100100100100100100100100100090.0%
ByteDance Seed 2.0 Mini100100100100100100100100100090.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
Nemotron 3 Nano100100100100100100100100100090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001000080.0%
Qwen 3.5 27B1001001001001001001001000080.0%
GPT-5 Nano1001001001001001001001000080.0%
Arcee AI: Trinity Mini1001001001001001001001000080.0%
Z.AI GLM 4.5100100100100100100100500075.0%
GPT-5.4 Nano100100100100100100100500075.0%
GPT-4.1 Nano100100100100100505050505075.0%
GPT-5.210010010010010010010000070.0%
Aion 2.010010010010010010010000070.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Grok 4 Fast10010010010010010010000070.0%
Qwen 3 32B10010010010010010010000070.0%
Llama 3.1 Nemotron 70B10010010010010010050500070.0%
GPT-4.1 Mini1001001001005050505050065.0%
Qwen 3.5 122B100100100100100100000060.0%
Claude Haiku 4.5100100505050505050505060.0%
Llama 3.1 70B100100100100100505000060.0%
Qwen 2.5 72B100100100100505050500060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Qwen 3.5 397B A17B1001001001001000000050.0%
GPT-5.4 (Reasoning, Low)1001001001001000000050.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001000000050.0%
Gemini 2.5 Pro1001001001001000000050.0%
Grok 41001001001001000000050.0%
Hermes 3 70B1001001005050505000050.0%
Ministral 3 14B5050505050505050505050.0%
Claude Sonnet 4.6505050505050505050045.0%
Claude Sonnet 4100100100100500000045.0%
Ministral 3 3B505050505050505050045.0%
Ministral 8B505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Qwen 3.5 9B10010010010000000040.0%
Stealth: Healer Alpha10010010010000000040.0%
Mistral Large 350505050505050500040.0%
Claude 3.5 Sonnet10010010050500000040.0%
Grok 4.20 (Beta)10010010050500000040.0%
Hermes 3 405B10010010010000000040.0%
Mistral Medium 3.150505050505050500040.0%
Arcee AI: Trinity Large (Preview)10050505050505000040.0%
Mistral Small Creative50505050505050500040.0%
GPT-4.11001005050500000035.0%
Ministral 3 8B5050505050505000035.0%
Rocinante 12B1001001005000000035.0%
Z.AI GLM 4.7100100100000000030.0%
GPT-4o, Aug. 6th (temp=1)100505050500000030.0%
Mistral Large 2505050505050000030.0%
Mistral Large505050505050000030.0%
Claude Sonnet 4.510010050000000025.0%
DeepSeek V3.250505050500000025.0%
GPT-4o Mini (temp=1)10050505000000025.0%
GPT-51001000000000020.0%
Stealth: Hunter Alpha1001000000000020.0%
DeepSeek-V2 Chat1001000000000020.0%
GPT-5.4 Mini1001000000000020.0%
GPT-4o Mini (temp=0)5050505000000020.0%
Mistral Small 45050505000000020.0%
Claude 3 Haiku5050505000000020.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
DeepSeek V3 (2024-12-26)100500000000015.0%
WizardLM 2 8x22b505050000000015.0%
Llama 3.1 8B505050000000015.0%
GPT-5.4 (Reasoning)10000000000010.0%
Qwen 3.5 35B10000000000010.0%
Qwen 3.5 Plus (2026-02-15)50500000000010.0%
GPT-5.410000000000010.0%
Writer: Palmyra X550500000000010.0%
Claude 3.7 Sonnet500000000005.0%
DeepSeek V3.1500000000005.0%
Qwen3 235B A22B Instruct 2507500000000005.0%
MoonshotAI: Kimi K2.500000000000.0%
Claude Opus 4.500000000000.0%
Claude Opus 400000000000.0%
Qwen 3.5 Flash00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
ByteDance Seed 2.0 Lite00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Small 3.2 24B00000000000.0%
Gemma 3 12B00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%

detailed entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3 14B1001001001001001001001001005095.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100090.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100090.0%
MoonshotAI: Kimi K2.5100100100100100100100100100090.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100090.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100090.0%
GPT-5.2100100100100100100100100100090.0%
MiniMax M2.7100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
Qwen 3.5 122B1001001001001001001001000080.0%
ByteDance Seed 1.61001001001001001001001000080.0%
MiniMax M2.51001001001001001001001000080.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001000080.0%
Qwen 3.5 Flash1001001001001001001001000080.0%
Inception Mercury 21001001001001001001001000080.0%
Nemotron 3 Nano1001001001001001001001000080.0%
ByteDance Seed 1.6 Flash1001001001001001001001000080.0%
Grok 410010010010010010010000070.0%
Z.AI GLM 4.510010010010010010050500070.0%
GPT-5.4 Nano10010010010010010010000070.0%
Qwen 2.5 72B1001001001001005050500065.0%
Gemini 3.1 Pro (Preview)100100100100100100000060.0%
GPT-5.4 (Reasoning)100100100100100100000060.0%
Claude Sonnet 4.6100100505050505050505060.0%
Z.AI GLM 4.7100100100100100100000060.0%
Qwen 3.5 35B100100100100100100000060.0%
Arcee AI: Trinity Large (Preview)100100100505050505050060.0%
Z.AI GLM 4.610010010010010050000055.0%
Qwen 3 32B10010010010010050000055.0%
Claude Sonnet 4.51001005050505050500050.0%
Claude Haiku 4.55050505050505050505050.0%
Mistral Small 4 (Reasoning)1001001001001000000050.0%
Mistral Medium 3.15050505050505050505050.0%
GPT-4.1 Nano1001001001005050000050.0%
Cohere Command R+ (Aug. 2024)1001001001001000000050.0%
GPT-4.1 Mini505050505050505050045.0%
Mistral Small Creative505050505050505050045.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Gemini 3 Pro (Preview)10010010010000000040.0%
Gemini 2.5 Flash Lite10010010050500000040.0%
Qwen3 235B A22B Instruct 250710050505050505000040.0%
Ministral 3B5050505050505000035.0%
Qwen 3.5 9B100100505000000030.0%
GPT-4o, Aug. 6th (temp=1)100100505000000030.0%
Writer: Palmyra X5505050505050000030.0%
Mistral Small 3.2 24B100100100000000030.0%
WizardLM 2 8x22b100100505000000030.0%
Ministral 3 3B505050505050000030.0%
Llama 3.1 8B505050505050000030.0%
Grok 4.20 (Beta)10050505000000025.0%
GPT-4o, May 13th (temp=1)50505050500000025.0%
GPT-4o, Aug. 6th (temp=0)50505050500000025.0%
DeepSeek V3.150505050500000025.0%
Claude Opus 45050505000000020.0%
GPT-5.4 Mini1001000000000020.0%
DeepSeek V3.25050505000000020.0%
Mistral Small 45050505000000020.0%
Claude Opus 4.5505050000000015.0%
Claude 3.7 Sonnet100500000000015.0%
Mistral Large505050000000015.0%
GPT-4o Mini (temp=1)505050000000015.0%
Llama 3.1 Nemotron 70B505050000000015.0%
Hermes 3 70B100500000000015.0%
Claude 3 Haiku505050000000015.0%
Rocinante 12B505050000000015.0%
Gemini 3 Flash (Preview)50500000000010.0%
DeepSeek-V2 Chat50500000000010.0%
Z.AI GLM 4.7 Flash10000000000010.0%
Qwen 3.5 27B500000000005.0%
Claude Sonnet 4500000000005.0%
Gemini 3.1 Flash Lite (Preview)500000000005.0%
Mistral Large 3500000000005.0%
Hermes 3 405B500000000005.0%
Mistral Large 2500000000005.0%
Llama 3.1 70B500000000005.0%
LFM2 24B500000000005.0%
Gemini 2.5 Pro00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
GPT-5.400000000000.0%
Claude 3.5 Sonnet00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100090.0%
GPT-5100100100100100100100100100090.0%
Gemini 3 Pro (Preview)100100100100100100100100100090.0%
o4 Mini100100100100100100100100100090.0%
ByteDance Seed 2.0 Mini100100100100100100100100100090.0%
Z.AI GLM 4.7 Flash100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100090.0%
Claude Sonnet 4.610010010010010010010010050085.0%
Grok 4.20 (Beta, Reasoning)1001001001001001001001000080.0%
Grok 4.1 Fast1001001001001001001001000080.0%
Qwen 3.5 9B1001001001001001001001000080.0%
Qwen 3 32B1001001001001001001001000080.0%
Nemotron 3 Nano1001001001001001001001000080.0%
Z.AI GLM 4.5100100100100100100505050075.0%
GPT-4.1 Nano100100100100100100100500075.0%
MiniMax M2.710010010010010010010000070.0%
Qwen 3.5 Flash10010010010010010010000070.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010000070.0%
Mistral Small 4 (Reasoning)10010010010010010010000070.0%
Z.AI GLM 4.61001001001001001005000065.0%
MiniMax M2.51001001001001001005000065.0%
Mistral Medium 3.11001001001005050505050065.0%
Qwen 3.5 27B100100100100100100000060.0%
Qwen 3.5 35B100100100100100100000060.0%
Stealth: Hunter Alpha100100100100100100000060.0%
GPT-4o, Aug. 6th (temp=1)100100100100100505000060.0%
Hermes 3 70B100100100505050505050060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Arcee AI: Trinity Large (Preview)10010010010050505000055.0%
GPT-5.21001001001001000000050.0%
Claude Opus 4.55050505050505050505050.0%
Z.AI GLM 4.71001001001001000000050.0%
Grok 4 Fast1001001001001000000050.0%
Claude Haiku 4.55050505050505050505050.0%
Ministral 3B5050505050505050505050.0%
GPT-4.1 Mini505050505050505050045.0%
Ministral 8B100100100505050000045.0%
Claude Opus 4.650505050505050500040.0%
Aion 2.010010010010000000040.0%
Mistral Small 410050505050505000040.0%
Qwen 2.5 72B10010050505050000040.0%
Llama 3.1 Nemotron 70B10010050505050000040.0%
Qwen 3.5 Plus (2026-02-15)5050505050505000035.0%
WizardLM 2 8x22b5050505050505000035.0%
Qwen 3.5 122B100100100000000030.0%
ByteDance Seed 1.6100100100000000030.0%
Gemini 3 Flash (Preview, Reasoning)100100100000000030.0%
Claude Sonnet 4100505050500000030.0%
Grok 4100100100000000030.0%
Claude Opus 4505050505050000030.0%
GPT-5.4 Nano100505050500000030.0%
Ministral 3 8B100100505000000030.0%
Mistral Large 350505050500000025.0%
Grok 4.20 (Beta)10010050000000025.0%
Ministral 3 3B50505050500000025.0%
Llama 3.1 8B50505050500000025.0%
Gemini 2.5 Pro1001000000000020.0%
Gemini 2.5 Flash (Reasoning)1001000000000020.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
Hermes 3 405B1001000000000020.0%
Gemini 2.5 Flash Lite1001000000000020.0%
Llama 3.1 70B1001000000000020.0%
Claude 3 Haiku5050505000000020.0%
Arcee AI: Trinity Mini1001000000000020.0%
Mistral Large 2505050000000015.0%
Mistral Small 3.2 24B100500000000015.0%
Mistral Small Creative505050000000015.0%
Ministral 3 14B505050000000015.0%
MoonshotAI: Kimi K2.510000000000010.0%
GPT-5.410000000000010.0%
Claude 3.7 Sonnet50500000000010.0%
DeepSeek V3.150500000000010.0%
DeepSeek V3.250500000000010.0%
Mistral Large50500000000010.0%
Qwen3 235B A22B Instruct 250750500000000010.0%
Writer: Palmyra X550500000000010.0%
Claude Sonnet 4.5500000000005.0%
Gemini 3.1 Flash Lite (Preview)500000000005.0%
DeepSeek-V2 Chat500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Rocinante 12B500000000005.0%
Qwen 3.5 397B A17B00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
ByteDance Seed 2.0 Lite00000000000.0%
Claude 3.5 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
GPT-5.4 Mini00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemma 3 12B00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)1001001001001001001001001005095.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100090.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100090.0%
Z.AI GLM 5100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
Z.AI GLM 4.6100100100100100100100100505090.0%
MiniMax M2.5100100100100100100100100100090.0%
Gemini 2.5 Pro100100100100100100100100100090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100090.0%
Z.AI GLM 4.5100100100100100100100100100090.0%
Z.AI GLM 4.7 Flash100100100100100100100100100090.0%
Mistral Small 4 (Reasoning)100100100100100100100100100090.0%
ByteDance Seed 1.6 Flash100100100100100100100100100090.0%
GPT-5.11001001001001001001001000080.0%
Qwen 3.5 122B1001001001001001001001000080.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001000080.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001000080.0%
GPT-5.4 Nano1001001001001001005050505080.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Z.AI GLM 4.710010010010010010010000070.0%
Qwen 3.5 35B10010010010010010010000070.0%
Stealth: Hunter Alpha10010010010010010010000070.0%
Qwen 3.5 9B10010010010010010010000070.0%
ByteDance Seed 2.0 Lite10010010010010010010000070.0%
Nemotron 3 Nano10010010010010010010000070.0%
Cohere Command R+ (Aug. 2024)10010010010010010010000070.0%
Qwen 3.5 Flash1001001001001001005000065.0%
Claude Opus 4.6 (Reasoning)100100100100100100000060.0%
GPT-5100100100100100100000060.0%
MoonshotAI: Kimi K2.5100100100100100100000060.0%
Qwen 3.5 27B100100100100100100000060.0%
ByteDance Seed 2.0 Mini100100100100100100000060.0%
GPT-4.1 Mini100100100100505050500060.0%
Mistral Medium 3.1100100100100505050500060.0%
Ministral 3 14B10050505050505050505055.0%
Stealth: Healer Alpha1001001001001000000050.0%
Qwen 2.5 72B1001005050505050500050.0%
GPT-4.1 Nano1001005050505050500050.0%
Mistral Small Creative505050505050505050045.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Claude 3.7 Sonnet10010010010000000040.0%
GPT-5.4 Mini10010010010000000040.0%
Qwen 3 32B10010010010000000040.0%
Claude Haiku 4.55050505050505000035.0%
Claude 3.5 Sonnet1001005050500000035.0%
Arcee AI: Trinity Large (Preview)5050505050505000035.0%
WizardLM 2 8x22b1001005050500000035.0%
Ministral 3 3B5050505050505000035.0%
Ministral 3B5050505050505000035.0%
Gemini 3.1 Pro (Preview)100100100000000030.0%
Claude Sonnet 4100100100000000030.0%
Gemini 2.5 Flash Lite100100100000000030.0%
Mistral Small 3.2 24B100100505000000030.0%
Llama 3.1 8B505050505050000030.0%
Claude Opus 410050505000000025.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
Grok 4.20 (Beta)10010050000000025.0%
Qwen3 235B A22B Instruct 250710010050000000025.0%
Mistral Small 450505050500000025.0%
Llama 3.1 Nemotron 70B50505050500000025.0%
Hermes 3 70B50505050500000025.0%
Claude Sonnet 4.6 (Reasoning)1001000000000020.0%
DeepSeek V3 (2025-03-24)1001000000000020.0%
GPT-4o Mini (temp=1)5050505000000020.0%
Llama 3.1 70B1005050000000020.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
Writer: Palmyra X5505050000000015.0%
Hermes 3 405B10000000000010.0%
Claude 3 Haiku50500000000010.0%
Rocinante 12B50500000000010.0%
Gemini 3.1 Flash Lite (Preview)500000000005.0%
GPT-4o, May 13th (temp=0)500000000005.0%
DeepSeek V3.1500000000005.0%
DeepSeek V3.2500000000005.0%
Gemma 3 12B500000000005.0%
LFM2 24B500000000005.0%
Claude Sonnet 4.500000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
GPT-5.400000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
GPT-4o, Aug. 6th (temp=1)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Gemini 2.5 Flash00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Claude Haiku 4.51001001001001001001001001005095.0%
GPT-4.1 Mini1001001001001001001001001005095.0%
GPT-5.4 Nano (Reasoning, Low)1001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
MiniMax M2.7100100100100100100100100100090.0%
o4 Mini100100100100100100100100100090.0%
Stealth: Hunter Alpha100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
Claude Opus 4.610010010010010010010010050085.0%
GPT-5.4 Nano10010010010010010010050505085.0%
Claude Opus 4.6 (Reasoning)1001001001001001001001000080.0%
GPT-5.4 (Reasoning)1001001001001001001001000080.0%
Qwen 3.5 122B1001001001001001001001000080.0%
GPT-5.21001001001001001001001000080.0%
Z.AI GLM 4.61001001001001001001001000080.0%
Z.AI GLM 4.71001001001001001001001000080.0%
Stealth: Healer Alpha1001001001001001001001000080.0%
Inception Mercury 21001001001001001001001000080.0%
Nemotron 3 Nano1001001001001001001001000080.0%
ByteDance Seed 1.6 Flash1001001001001001001001000080.0%
Cohere Command R+ (Aug. 2024)1001001001001001001001000080.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
MiniMax M2.510010010010010010010000070.0%
Z.AI GLM 4.7 Flash10010010010010010010000070.0%
Hermes 3 405B10010010010010010010000070.0%
Mistral Small 4 (Reasoning)10010010010010010010000070.0%
GPT-4.1 Nano1001001005050505050505065.0%
Z.AI GLM 4.5100100100100100505000060.0%
Mistral Small 3.2 24B100100100100100100000060.0%
LFM2 24B100100100100100100000060.0%
Llama 3.1 70B10010010010050505000055.0%
MoonshotAI: Kimi K2.51001001001001000000050.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001000000050.0%
Gemini 2.5 Pro1001001001001000000050.0%
GPT-4o, Aug. 6th (temp=1)1001001001005050000050.0%
Mistral Medium 3.1100100100505050000045.0%
Arcee AI: Trinity Large (Preview)505050505050505050045.0%
Mistral Small Creative100100505050505000045.0%
Claude 3 Haiku505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Claude Sonnet 4.6 (Reasoning)10010010010000000040.0%
GPT-510010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Claude Sonnet 410010010010000000040.0%
Qwen 3.5 Flash10010010010000000040.0%
Qwen 3.5 9B10010010010000000040.0%
GPT-4o, Aug. 6th (temp=0)10010010010000000040.0%
Qwen 3 32B10010010010000000040.0%
Qwen 2.5 72B10050505050505000040.0%
Hermes 3 70B10010010050500000040.0%
Ministral 3 3B50505050505050500040.0%
Rocinante 12B10010010010000000040.0%
GPT-4o Mini (temp=1)5050505050505000035.0%
Llama 3.1 Nemotron 70B5050505050505000035.0%
Llama 3.1 8B5050505050505000035.0%
Qwen 3.5 35B100100100000000030.0%
Claude 3.5 Sonnet100100100000000030.0%
Claude 3.7 Sonnet100100505000000030.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
Mistral Small 450505050500000025.0%
Gemini 3.1 Pro (Preview)1001000000000020.0%
Claude Opus 41001000000000020.0%
ByteDance Seed 2.0 Lite1001000000000020.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
Grok 4.20 (Beta)100500000000015.0%
ByteDance Seed 2.0 Mini10000000000010.0%
GPT-5.4 Mini10000000000010.0%
DeepSeek V3.250500000000010.0%
DeepSeek V3 (2025-03-24)10000000000010.0%
Gemini 2.5 Flash Lite10000000000010.0%
Qwen3 235B A22B Instruct 250750500000000010.0%
Gemini 3.1 Flash Lite (Preview)500000000005.0%
GPT-4o, May 13th (temp=0)500000000005.0%
WizardLM 2 8x22b500000000005.0%
Claude Sonnet 4.500000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
GPT-5.400000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
DeepSeek V3.100000000000.0%
Gemini 2.5 Flash00000000000.0%
Writer: Palmyra X500000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%