Matches sentence count

Test: Write N of X

Avg. Score
89.7%
Scenarios
5

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3 Flash (Preview)100.0%$0.00193.2s100%
2GPT-5.4 Mini (Reasoning, Low)99.9%$0.00212.2s99%
3Mistral Large 3100.0%$0.00136.2s100%
4Mistral Small 3.2 24B100.0%$0.00038.7s100%
5GPT-5.4 Mini (Reasoning)100.0%$0.00334.9s100%
6Llama 3.1 Nemotron 70B100.0%$0.000712.6s100%
7Grok 4.20 (Beta)99.4%$0.00231.4s94%
8DeepSeek V4 Flash99.3%$0.00015.8s93%
9GPT-5 Mini100.0%$0.002210.8s100%
10Qwen 3.5 Plus (2026-02-15)100.0%$0.001414.5s100%
11Grok 4.2099.2%$0.00243.8s93%
12Gemini 3.1 Flash Lite (Preview)98.6%$0.00082.0s89%
13Grok 4.20 (Beta, Reasoning)100.0%$0.00714.8s100%
14Grok 4.20 (Reasoning)100.0%$0.004111.6s100%
15GPT-5.4100.0%$0.00657.1s100%
16GPT-5.4 (Reasoning, Low)100.0%$0.00736.7s100%
17Gemma 4 31B100.0%$0.000323.8s100%
18Z.AI GLM 5 Turbo100.0%$0.005113.8s100%
19Qwen 3.6 Flash100.0%$0.004915.9s100%
20GPT-5.299.8%$0.00818.8s99%
21Gemini 3 Flash (Preview, Reasoning)100.0%$0.008013.6s100%
22Inception Mercury 297.0%$0.00091.3s81%
23GPT-5.4 Mini98.2%$0.00202.2s80%
24GPT-5100.0%$0.009516.2s100%
25o4 Mini High99.9%$0.008118.2s99%
26Z.AI GLM 4.7 Flash100.0%$0.000935.2s100%
27GPT-5.199.8%$0.009514.1s98%
28Stealth: Aurora Alpha97.4%5.1s84%
29GPT-5.4 (Reasoning)100.0%$0.01212.0s100%
30Inception Mercury96.8%$0.00041.6s73%
31Grok 4.3 (Reasoning)100.0%$0.006027.7s100%
32MoonshotAI: Kimi K2.5100.0%$0.004531.5s100%
33Mistral Small Creative96.8%$0.00032.6s72%
34GPT-5.5100.0%$0.0177.1s100%
35GPT-5.5 (Reasoning, Low)100.0%$0.0177.6s100%
36Qwen 3.5 Flash100.0%$0.002439.4s100%
37ByteDance Seed 1.6100.0%$0.003636.9s100%
38Mistral Medium 3.197.2%$0.00148.1s71%
39Nemotron 3 Super98.0%$0.000014.0s72%
40Gemini 2.5 Flash Lite (Reasoning)96.5%$0.000910.8s71%
41DeepSeek V3 (2025-03-24)98.0%$0.000615.7s72%
42GPT-5.4 Nano (Reasoning)95.8%$0.00103.9s65%
43o4 Mini98.4%$0.006314.9s80%
44Mistral Small 4 (Reasoning)96.1%$0.00109.0s68%
45Gemini 2.5 Pro99.4%$0.01613.2s94%
46Grok 4.1 Fast95.6%$0.00055.6s64%
47DeepSeek-V2 Chat97.9%$0.000320.5s72%
48Claude Opus 4.7 (Reasoning)99.9%$0.0246.5s99%
49Qwen 3.5 35B100.0%$0.01134.9s100%
50Claude Opus 4.699.9%$0.02112.9s99%
51Claude Sonnet 4.6 (Reasoning)100.0%$0.02015.2s100%
52Gemma 3 27B95.4%$0.000312.6s69%
53GPT-5.4 Nano (Reasoning, Low)93.2%$0.00072.5s65%
54Gemini 3.5 Flash (Reasoning)100.0%$0.0239.9s100%
55Claude Opus 4.799.6%$0.0236.3s97%
56Gemma 3 4B95.0%$0.00015.0s61%
57Gemini 3.1 Flash Lite94.7%$0.00082.4s60%
58Qwen 3.6 27B100.0%$0.009642.0s100%
59Qwen 3 32B95.8%$0.000516.3s69%
60Qwen 3.5 Plus (2026-04-20)100.0%$0.007946.0s100%
61GPT-5.5 (Reasoning)99.9%$0.02510.5s99%
62Z.AI GLM 5100.0%$0.007448.6s100%
63GPT-5 Nano98.0%$0.001026.0s72%
64MoonshotAI: Kimi K2.6100.0%$0.007749.9s100%
65Ministral 3 14B92.1%$0.00043.7s57%
66Gemini 3 Pro (Preview)100.0%$0.02516.6s100%
67ByteDance Seed 2.0 Lite100.0%$0.005559.0s100%
68Qwen 3.6 35B98.0%$0.004824.5s72%
69GPT-OSS 120B96.9%$0.000532.1s71%
70GPT-5.4 Nano93.5%$0.00073.3s53%
71Claude Opus 4.6 (Reasoning)100.0%$0.02914.3s100%
72DeepSeek V4 Flash (Reasoning)93.1%$0.00027.4s51%
73Gemma 4 31B (Reasoning)100.0%$0.00081.3m100%
74ByteDance Seed 2.0 Mini100.0%$0.00141.3m100%
75Qwen 3.5 9B100.0%$0.00101.3m100%
76MiniMax M2.795.5%$0.002934.0s71%
77Gemma 4 26B (Reasoning)100.0%$0.00111.4m100%
78Nemotron 3 Nano92.8%$0.000421.8s58%
79Claude Opus 4.595.9%$0.0178.5s67%
80Z.AI GLM 4.697.7%$0.003047.4s72%
81Stealth: Healer Alpha90.5%$0.00007.8s45%
82Claude Sonnet 4.694.7%$0.0118.7s59%
83DeepSeek V4 Pro (Reasoning)99.5%$0.00541.2m94%
84Gemini 3.1 Flash Lite (Reasoning)90.1%$0.00083.1s42%
85Grok 4 Fast89.9%$0.00054.2s40%
86Claude Sonnet 4.592.0%$0.0107.5s56%
87DeepSeek V3.292.1%$0.000521.2s49%
88Grok 4.389.4%$0.00244.2s42%
89Z.AI GLM 4.7100.0%$0.00401.5m100%
90Gemma 4 26B89.6%$0.000313.6s44%
91Gemini 3.1 Pro (Preview)100.0%$0.03429.2s100%
92Z.AI GLM 5.1100.0%$0.0131.3m100%
93Qwen3.7 Max100.0%$0.02746.7s100%
94Mistral Small 485.8%$0.00054.2s39%
95Qwen 3.5 122B100.0%$0.02355.5s100%
96DeepSeek V3 (2024-12-26)88.6%$0.000912.2s41%
97DeepSeek V4 Pro90.0%$0.001924.8s49%
98Llama 3.1 8B85.6%$0.00041.7s33%
99Stealth: Hunter Alpha87.9%$0.000019.7s42%
100Gemini 2.5 Flash (Reasoning)85.2%$0.00377.0s42%
101Xiaomi MIMO v2.586.7%$0.00157.5s34%
102Qwen3.6 Max Preview100.0%$0.0211.2m100%
103GPT-4.186.0%$0.00415.8s32%
104Xiaomi MIMO v2.5 Pro83.5%$0.00179.5s31%
105Aion 2.089.1%$0.003127.9s39%
106Claude 3.5 Sonnet93.2%$0.01038.4s54%
107DeepSeek V3.183.3%$0.00096.9s27%
108Z.AI GLM 4.5 Air83.7%$0.000811.6s29%
109ByteDance Seed 1.6 Flash82.3%$0.00058.3s26%
110GPT-4.1 Mini79.4%$0.00085.1s28%
111Llama 3.1 70B80.5%$0.00173.1s26%
112Grok 488.0%$0.01115.0s35%
113Gemma 3 12B80.1%$0.00029.4s23%
114Qwen3 235B A22B Instruct 250780.6%$0.000310.5s22%
115Gemini 3.5 Flash (Reasoning, Minimal)79.2%$0.00512.5s22%
116Claude Sonnet 484.2%$0.01210.0s31%
117Hermes 3 405B80.0%$0.000018.7s24%
118Writer: Palmyra X579.9%$0.00328.9s20%
119Ministral 3 8B74.1%$0.00033.0s19%
120Ministral 3 3B74.8%$0.00021.6s15%
121Gemini 2.5 Flash Lite73.3%$0.00031.7s18%
122GPT-4.1 Nano75.2%$0.00036.8s19%
123Qwen 3.5 397B A17B100.0%$0.0172.0m100%
124Z.AI GLM 4.574.8%$0.00128.5s16%
125GPT-4o, Aug. 6th (temp=1)74.2%$0.00603.2s18%
126GPT-4o, Aug. 6th (temp=0)74.0%$0.00583.1s18%
127Qwen 3.5 27B92.5%$0.0141.0m49%
128MiniMax M2.575.6%$0.001419.1s15%
129Claude Haiku 4.570.3%$0.00314.3s14%
130Gemini 2.5 Flash64.8%$0.00112.2s16%
131Qwen 2.5 72B66.7%$0.00076.3s11%
132Ministral 3B62.2%$0.00012.2s9%
133GPT-4o Mini (temp=0)79.5%$0.000453.1s20%
134GPT-4o Mini (temp=1)79.6%$0.000454.0s20%
135Ministral 8B60.4%$0.00022.8s9%
136LFM2 24B63.9%$0.00019.8s4%
137WizardLM 2 8x22b71.5%$0.001738.3s14%
138Claude 3.7 Sonnet66.3%$0.0108.4s12%
139GPT-4o, May 13th (temp=0)71.8%$0.01126.0s17%
140Arcee AI: Trinity Large (Preview)57.2%$0.00005.1s5%
141Hermes 3 70B58.1%$0.00076.7s6%
142GPT-4o, May 13th (temp=1)74.0%$0.01133.0s17%
143Cohere Command R+ (Aug. 2024)60.4%$0.00594.3s6%
144Mistral Large 258.8%$0.00454.2s4%
145Claude 3 Haiku61.9%$0.000836.4s9%
146Mistral NeMO47.6%$0.00033.6s3%
147Rocinante 12B53.1%$0.000624.0s5%
148Arcee AI: Trinity Mini44.0%$0.00024.6s0%
149Claude Opus 480.2%$0.05418.2s24%
150Mistral Large63.8%$0.02052.3s12%
89.71%

Individual Scenarios

sentences

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Grok 4.3100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100090.0%
Rocinante 12B1001001001001009898920078.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Grok 4.3100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
WizardLM 2 8x22b10010010010010010010010010010099.9%
Llama 3.1 8B1001001001001001001001001009899.8%
GPT-5.5 (Reasoning)1001001001001001001001001009899.8%
Ministral 3B1001001001001001001001001009899.8%
Ministral 8B100100100100100100100100989899.7%
GPT-5.2100100100100100100100100989899.7%
Rocinante 12B100100100100100100989877287.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Grok 4.3100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)10010010010010010010010010010099.9%
ByteDance Seed 1.6 Flash10010010010010010010010010010099.9%
GPT-4o, Aug. 6th (temp=0)10010010010010010010010010010099.9%
Gemma 4 26B10010010010010010010010010010099.9%
Gemma 3 12B10010010010010010010010010010099.9%
Gemma 3 27B10010010010010010010010010010099.9%
Mistral Small 4 (Reasoning)1001001001001001001001001009899.8%
DeepSeek V4 Flash1001001001001001001001001009899.8%
DeepSeek V3 (2025-03-24)1001001001001001001001001009899.8%
ByteDance Seed 2.0 Mini1001001001001001001001001009899.8%
GPT-5.11001001001001001001001001009899.8%
DeepSeek V4 Flash (Reasoning)1001001001001001001001001009899.8%
o4 Mini1001001001001001001001001009899.8%
Writer: Palmyra X51001001001001001001001001009899.8%
GPT-5.21001001001001001001001001009899.8%
GPT-4o, May 13th (temp=1)1001001001001001001001001009899.8%
LFM2 24B1001001001001001001001001009899.8%
Nemotron 3 Nano1001001001001001001001001009899.8%
Gemma 3 4B1001001001001001001001001009899.8%
Gemini 2.5 Flash (Reasoning)1001001001001001001001001009899.8%
Inception Mercury1001001001001001001001001009899.8%
Mistral Large 31001001001001001001001001009899.8%
Ministral 3 3B1001001001001001001001001009899.8%
GPT-5.4 Nano (Reasoning)100100100100100100100100989899.7%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100989899.7%
Ministral 3 14B100100100100100100100100989899.6%
GPT-OSS 120B10010010010010010010098989899.5%
GPT-5.4 Nano10010010010010010010098989899.5%
Mistral Medium 3.110010010010010010010098989899.5%
DeepSeek-V2 Chat1001001001001001009898989899.4%
Claude Sonnet 4.51001001001001001001001001009299.2%
Llama 3.1 70B1001001001001001001001001009299.2%
MiniMax M2.7100100100100100100100100989299.0%
Inception Mercury 2100100100100100100100100989299.0%
MiniMax M2.510010010010010010010098989298.9%
Stealth: Aurora Alpha1001001001001001009898989298.7%
Llama 3.1 8B100100100100100100100100929298.4%
Qwen 2.5 72B10098989898989892929296.7%
Mistral Small 410010010010098989898927796.3%
Qwen 3 32B100100100100100100100100985495.2%
Cohere Command R+ (Aug. 2024)10098989898989292777793.1%
Mistral Large 2100100989898929292777792.7%
Mistral Large10098989292929277777789.8%
DeepSeek V3 (2024-12-26)10010010010010010010010092289.4%
WizardLM 2 8x22b100100100989898989877987.8%
Ministral 3 8B10098989898989892542786.3%
Arcee AI: Trinity Large (Preview)10098929277777777777784.7%
Gemini 2.5 Flash Lite10010010010010098929254083.6%
Hermes 3 405B989892929292777777280.0%
Hermes 3 70B10010010010092927754542779.6%
Claude 3 Haiku10010010010077777754542776.7%
Claude 3.7 Sonnet9892929277775454542771.8%
Ministral 3B1001009898927777270067.1%
Rocinante 12B10010092927727200049.1%
Ministral 8B9277777754272790044.2%
Gemini 2.5 Flash989277542727922038.9%
Mistral NeMO777754545427999937.9%
Arcee AI: Trinity Mini100980000000019.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Claude Opus 4.710010010010010010010010010010099.9%
Gemma 3 4B10010010010010010010010010010099.9%
GPT-5.5 (Reasoning)1001001001001001001001001009899.8%
Grok 4.201001001001001001001001001009899.8%
Nemotron 3 Super1001001001001001001001001009899.8%
Grok 4.31001001001001001001001001009899.8%
o4 Mini High1001001001001001001001001009899.8%
Gemini 3.1 Flash Lite (Preview)1001001001001001001001001009899.8%
Gemini 3.1 Flash Lite (Reasoning)1001001001001001001001001009899.8%
Claude Opus 4.6100100100100100100100100989899.7%
Grok 4 Fast100100100100100100100100989899.6%
Gemini 3.1 Flash Lite10010010010010010010098989899.5%
GPT-5.21001001001001001009898989899.4%
GPT-5.11001001001001001001001001009299.2%
Gemma 4 26B1001001001001001001001001009299.2%
GPT-5.4 Nano (Reasoning)100100100100100100100100989299.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100989299.0%
Z.AI GLM 4.61001001001001001009898989298.7%
GPT-5.4 Nano1001001001001001009898989298.7%
GPT-4o Mini (temp=1)1001001001001001009898929298.1%
GPT-OSS 120B1001001001001001009898929298.1%
GPT-4o Mini (temp=0)100100989898989898929297.5%
Qwen 3 32B1001001001001001009898987797.2%
Claude Sonnet 4.5100100989898989892929296.9%
Inception Mercury10010010010098989898927796.3%
Inception Mercury 210010010010098989892927795.7%
Mistral Small 4 (Reasoning)100100100100100100100100985495.2%
Claude Opus 41001001001001001009892777794.5%
Stealth: Healer Alpha10010010010010010010098925494.4%
Stealth: Aurora Alpha10010010010010010010092925493.8%
GPT-5.4 Nano (Reasoning, Low)1001001009892929292777792.2%
Xiaomi MIMO v2.510010010010010010010010098990.7%
GPT-4.1100100100100100100989898990.4%
Claude Sonnet 410010010010010098989898990.3%
Stealth: Hunter Alpha100100100100100100100100100290.1%
Llama 3.1 8B100100100100100989898772790.0%
DeepSeek V3.11001001001001001001009898089.7%
ByteDance Seed 1.6 Flash100100100100100100989892088.9%
Grok 4.1 Fast100100100100100100100100542788.0%
DeepSeek V3.21001001001001001001009877287.7%
DeepSeek V4 Pro10010010010098989892542786.8%
Mistral Medium 3.110010010010010098989277086.6%
Gemini 3.5 Flash (Reasoning, Minimal)9898989898989892542786.2%
MiniMax M2.7100100100989892929277986.0%
Z.AI GLM 4.5 Air1001001001009898929277085.9%
Gemma 3 27B1001001001009898989254985.0%
Mistral Small Creative100100100989898929254984.2%
DeepSeek V3 (2024-12-26)10010010010098989292272783.6%
Nemotron 3 Nano1001001001001001009877272783.0%
Grok 41001001001001001001001000080.0%
Writer: Palmyra X51001001001001001001001000080.0%
Xiaomi MIMO v2.5 Pro10010010010010098775454078.3%
Claude 3.5 Sonnet1001001001001001001002727976.3%
Gemma 3 12B10010010010010010098542075.3%
Ministral 3 8B100100100100927777779974.2%
Llama 3.1 70B10010098929292775427273.5%
Gemini 2.5 Flash (Reasoning)10010098929292772727971.6%
GPT-4o, Aug. 6th (temp=1)10010098927777775427971.3%
GPT-4o, May 13th (temp=1)100100100989277545427270.4%
GPT-4o, Aug. 6th (temp=0)10098989892775427272770.0%
Hermes 3 405B10010010010010010010000070.0%
WizardLM 2 8x22b1001009898929277279269.7%
GPT-4.1 Mini10010092929277542727066.2%
Z.AI GLM 4.510010010010098777790066.2%
Ministral 3 3B100100100100100925422064.9%
Ministral 3 14B100929277775454549060.9%
GPT-4o, May 13th (temp=0)9892929254542727272759.2%
MiniMax M2.51001001001009892000059.1%
Qwen3 235B A22B Instruct 250710010010010010027200052.9%
Mistral Small 4100100989254542700052.5%
Claude Haiku 4.598927777772727279051.4%
Gemini 2.5 Flash9292545454272799041.8%
Rocinante 12B1001009254540000039.9%
Qwen 2.5 72B989854542727900036.8%
Ministral 3B100987727279000034.0%
GPT-4.1 Nano10077545499200030.4%
Claude 3 Haiku100100100000000030.0%
Mistral Large925454272727920029.2%
Ministral 8B10098542700000027.9%
Gemini 2.5 Flash Lite10010054200000025.5%
LFM2 24B100980000000019.8%
Hermes 3 70B10092000000011.1%
Cohere Command R+ (Aug. 2024)779220000009.0%
Mistral Large 292200000001.2%
Arcee AI: Trinity Large (Preview)92000000001.1%
Claude 3.7 Sonnet00000000000.0%
Mistral NeMO00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)1001001001001001001001001009899.8%
Mistral Small 3.2 24B1001001001001001001001001009899.8%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001009899.8%
ByteDance Seed 2.0 Lite1001001001001001001001001009899.8%
o4 Mini High100100100100100100100100989899.7%
Claude Opus 4.7 (Reasoning)100100100100100100100100989899.6%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100989899.6%
Claude Opus 4.7100100100100100989898929297.9%
DeepSeek V4 Pro (Reasoning)100100100100100100100100987797.6%
Gemini 2.5 Pro1001001001001001009898987797.2%
Grok 4.20 (Beta)100100100100100989898987797.1%
DeepSeek V4 Flash10010010010010010010098927796.8%
Grok 4.20100100989898989898927796.0%
Stealth: Aurora Alpha100100100100100989892777794.4%
Gemini 3.1 Flash Lite (Preview)10010010010010010010077777793.2%
MiniMax M2.710010010010010010010092775492.3%
Gemma 3 27B10010010010010010010092775492.3%
o4 Mini100100100100100100100100922791.9%
GPT-5.4 Mini1001001009898989898922791.2%
Inception Mercury 21001001001001001009898545490.4%
Qwen 3.6 35B100100100100100100100100100090.0%
DeepSeek-V2 Chat100100100100100100100100100090.0%
Grok 4.1 Fast100100100100100100100100100090.0%
Nemotron 3 Super100100100100100100100100100090.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Claude 3.5 Sonnet100100100100100100100100100089.9%
Z.AI GLM 4.610010010010010010010010098089.8%
Inception Mercury1001001001001001001009277987.8%
GPT-OSS 120B100100100100100100989277086.8%
Qwen 3 32B10010010010010098989277086.7%
Mistral Small 4 (Reasoning)100100100100100100987777085.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001009892927777083.7%
Nemotron 3 Nano100100100100100100987727981.2%
GPT-5.4 Nano (Reasoning)1001001001001001001007727080.5%
Mistral Small 4100100100100100100987727080.3%
Claude Opus 4.5100100100989898925454079.4%
Gemma 3 4B10010010010010010092549075.5%
GPT-5.4 Nano (Reasoning, Low)100100100989877775427974.1%
Gemini 3.1 Flash Lite10010010010010010077549074.0%
Claude Sonnet 4.61001009898989292540073.3%
DeepSeek V3.2100100100100100100100270072.7%
DeepSeek V3 (2024-12-26)1001001001001001009800069.8%
GPT-5.4 Nano100100100100100989200069.1%
DeepSeek V4 Flash (Reasoning)1001001001001001005420065.5%
Claude Sonnet 4.598989898777754279063.8%
DeepSeek V4 Pro100100100100985454270063.3%
Qwen 3.5 27B1001001001001001002700062.7%
Grok 4100100100100100100000060.0%
Claude 3.7 Sonnet100100100100100100000059.9%
Stealth: Healer Alpha10010010010010077200057.9%
Gemini 2.5 Flash Lite10098989277545420057.5%
Gemini 2.5 Flash (Reasoning)10098927754545499054.7%
Gemini 3.1 Flash Lite (Reasoning)1001009898989222050.9%
Hermes 3 405B1001001001001000000050.0%
Qwen3 235B A22B Instruct 25071001001001001000000050.0%
Grok 4 Fast1001001001001000000050.0%
Stealth: Hunter Alpha1009277777754990049.6%
Gemma 4 26B100100100775454220048.8%
Grok 4.310010098925427000047.2%
Aion 2.0100100100100542000045.5%
GPT-4.1 Nano10010010077770000045.5%
Gemini 2.5 Flash1001009898279000043.3%
Xiaomi MIMO v2.5100100100100272000042.9%
Llama 3.1 8B10010010010000000040.0%
Ministral 8B10010010010000000040.0%
GPT-4.1100100989800000039.7%
Xiaomi MIMO v2.5 Pro100100929292000039.5%
Z.AI GLM 4.5 Air100100982700000032.6%
GPT-4.1 Mini1001005427270000030.8%
Claude Sonnet 410092772790000030.6%
Llama 3.1 70B100100100000000030.0%
DeepSeek V3.1989277200000027.0%
Gemma 3 12B989854000000025.0%
ByteDance Seed 1.6 Flash100989990000022.6%
MiniMax M2.51001000000000020.0%
Writer: Palmyra X5100980000000019.8%
Rocinante 12B9290000000010.1%
Ministral 3 8B10000000000010.0%
Gemini 3.5 Flash (Reasoning, Minimal)10000000000010.0%
Ministral 3B10000000000010.0%
Ministral 3 3B922200000009.6%
Z.AI GLM 4.5772000000007.9%
Claude Opus 4549200000006.4%
Claude 3 Haiku270000000002.7%
GPT-4o Mini (temp=1)00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Claude Haiku 4.500000000000.0%
LFM2 24B00000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
WizardLM 2 8x22b00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Mistral NeMO00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Mistral Large00000000000.0%
Qwen 2.5 72B00000000000.0%
GPT-4o, Aug. 6th (temp=1)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Mistral Large 200000000000.0%
Arcee AI: Trinity Large (Preview)00000000000.0%
Hermes 3 70B00000000000.0%