Filter word density

Test: Bad Writing Habits

Avg. Score
87.4%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Mistral Small Creative99.8%$0.00079.1s97%
2Ministral 3 14B99.8%$0.000711.7s97%
3Mistral Small 499.8%$0.001418.2s98%
4Writer: Palmyra X5100.0%$0.01122.0s99%
5Mistral Medium 3.199.6%$0.004836.5s97%
6Mistral Large 398.8%$0.003330.3s92%
7Mistral Small 4 (Reasoning)98.9%$0.002230.2s91%
8Qwen3 235B A22B Instruct 250799.8%$0.001159.2s96%
9o4 Mini99.1%$0.01525.7s91%
10GPT-5.4 Nano (Reasoning, Low)97.5%$0.005520.6s88%
11Mistral Large 298.7%$0.01329.4s90%
12Mistral Large98.6%$0.01430.9s90%
13GPT-5.4 Mini97.8%$0.01516.8s88%
14GPT-5.4 Mini (Reasoning, Low)98.3%$0.01516.8s87%
15DeepSeek V3 (2025-03-24)98.2%$0.001439.4s86%
16Qwen 3.5 9B99.5%$0.00111.4m95%
17Ministral 3 8B97.2%$0.000819.6s81%
18GPT-5.4 Nano (Reasoning)97.1%$0.006124.5s85%
19GPT-5.4 Nano97.3%$0.005726.3s84%
20ByteDance Seed 1.6 Flash97.1%$0.001327.3s81%
21o4 Mini High99.2%$0.02547.2s94%
22Grok 4.397.1%$0.006930.5s83%
23Stealth: Hunter Alpha97.4%$0.000055.0s85%
24Grok 4.2097.8%$0.009345.7s86%
25GPT-4o Mini (temp=1)95.9%$0.001234.8s82%
26DeepSeek V4 Flash96.2%$0.000631.6s80%
27Grok 4.20 (Beta)96.0%$0.01815.8s83%
28GPT-4.198.2%$0.01844.7s88%
29Xiaomi MIMO v2.5 Pro97.3%$0.008553.5s87%
30DeepSeek V4 Pro98.2%$0.00481.3m89%
31DeepSeek V4 Flash (Reasoning)95.9%$0.000731.1s78%
32Z.AI GLM 5 Turbo95.1%$0.008133.2s80%
33GPT-5.4 Mini (Reasoning)96.9%$0.02228.1s83%
34GPT-4.1 Mini95.5%$0.002719.0s73%
35Ministral 8B96.1%$0.000410.4s69%
36Qwen 3 32B96.5%$0.001554.6s79%
37GPT-5 Mini95.6%$0.010057.4s84%
38Qwen 3.6 Flash96.3%$0.01041.4s79%
39Z.AI GLM 4.7 Flash96.1%$0.00171.2m80%
40Z.AI GLM 4.797.0%$0.0101.4m85%
41Grok 4 Fast94.9%$0.001724.1s68%
42Ministral 3B93.1%$0.00018.1s64%
43Qwen 3.5 Flash95.0%$0.002547.5s73%
44Qwen 3.6 35B95.6%$0.00831.0m78%
45Grok 4.1 Fast96.6%$0.001837.8s67%
46Claude Sonnet 4.596.4%$0.03538.1s83%
47Qwen 3.5 122B96.3%$0.0251.1m85%
48GPT-5.499.8%$0.0491.4m97%
49Qwen 3.5 Plus (2026-04-20)98.4%$0.0171.8m88%
50Claude Sonnet 4.696.6%$0.03139.3s79%
51GPT-5.4 (Reasoning, Low)99.7%$0.0551.4m97%
52Grok 4.20 (Beta, Reasoning)94.4%$0.03934.0s82%
53Gemma 3 27B93.1%$0.000652.6s70%
54LFM2 24B92.2%$0.000228.4s65%
55Aion 2.093.8%$0.00641.3m77%
56GPT-4o, Aug. 6th (temp=1)92.3%$0.01824.4s70%
57Ministral 3 3B92.0%$0.000511.1s59%
58Xiaomi MIMO v2.591.5%$0.005431.8s66%
59Stealth: Healer Alpha90.7%$0.000023.7s63%
60GPT-5.199.7%$0.0541.8m96%
61Claude Opus 4.7 (Reasoning)98.6%$0.07632.0s89%
62GPT-5.298.8%$0.0561.5m93%
63Z.AI GLM 5.195.7%$0.0141.5m77%
64GPT-4o Mini (temp=0)89.2%$0.001234.8s65%
65Gemini 3 Pro (Preview)97.1%$0.05554.4s84%
66GPT-4.1 Nano89.6%$0.000713.3s57%
67Qwen 3.5 35B93.9%$0.0181.0m70%
68DeepSeek-V2 Chat92.2%$0.002153.3s63%
69Gemini 2.5 Pro93.3%$0.03636.2s72%
70DeepSeek V3 (2024-12-26)91.6%$0.002154.6s63%
71Grok 4.20 (Reasoning)95.3%$0.0181.5m72%
72Gemini 2.5 Flash (Reasoning)87.6%$0.01121.5s60%
73Z.AI GLM 590.7%$0.00841.2m66%
74Claude Opus 4.796.0%$0.06930.4s76%
75Claude Opus 4.698.2%$0.0781.2m88%
76Claude Sonnet 4.6 (Reasoning)97.5%$0.0601.2m79%
77DeepSeek V3.291.9%$0.00141.9m69%
78Qwen 3.5 27B94.3%$0.0201.6m70%
79Mistral NeMO86.4%$0.000510.1s48%
80Qwen 3.5 Plus (2026-02-15)85.2%$0.006031.5s58%
81Rocinante 12B88.9%$0.001438.4s52%
82Qwen 3.6 27B97.4%$0.0252.3m77%
83Gemma 3 12B85.3%$0.000441.3s55%
84Hermes 3 405B90.1%$0.003253.2s52%
85Claude Haiku 4.586.4%$0.01121.6s53%
86DeepSeek V4 Pro (Reasoning)96.8%$0.0153.1m82%
87Claude Opus 4.6 (Reasoning)97.7%$0.0881.4m85%
88Gemini 2.5 Flash83.7%$0.005210.6s46%
89Gemma 4 31B85.5%$0.00101.6m61%
90GPT-5.4 (Reasoning)99.8%$0.0892.6m98%
91DeepSeek V3.187.8%$0.00201.8m60%
92MiniMax M2.586.4%$0.00341.3m54%
93Gemini 3 Flash (Preview)81.3%$0.007819.6s48%
94Gemma 3 4B79.9%$0.000220.0s46%
95Arcee AI: Trinity Mini80.8%$0.00039.2s42%
96Z.AI GLM 4.684.1%$0.006551.5s51%
97GPT-598.3%$0.0652.8m90%
98Grok 495.0%$0.0481.7m68%
99MiniMax M2.784.4%$0.00401.1m52%
100Claude 3 Haiku81.4%$0.002514.9s41%
101Qwen3.6 Max Preview99.0%$0.0503.5m90%
102Gemma 4 31B (Reasoning)88.1%$0.00142.2m60%
103GPT-5.5100.0%$0.1391.7m100%
104Qwen 3.5 397B A17B92.6%$0.0143.0m70%
105GPT-4o, May 13th (temp=1)84.3%$0.03314.4s47%
106GPT-5.5 (Reasoning)100.0%$0.1421.8m100%
107GPT-5.5 (Reasoning, Low)99.9%$0.1391.8m98%
108Gemini 2.5 Flash Lite75.4%$0.00099.5s39%
109Gemini 3.1 Pro (Preview)97.7%$0.1071.8m83%
110Grok 4.3 (Reasoning)90.8%$0.0212.3m55%
111MoonshotAI: Kimi K2.591.0%$0.0193.2m63%
112Cohere Command R+ (Aug. 2024)81.8%$0.02052.5s38%
113Gemini 3.1 Flash Lite70.8%$0.003012.1s33%
114GPT-4o, May 13th (temp=0)79.2%$0.03514.1s37%
115Claude Opus 4.585.4%$0.07053.4s55%
116Gemini 3.1 Flash Lite (Preview)70.4%$0.00308.4s32%
117Gemini 3 Flash (Preview, Reasoning)74.2%$0.01230.1s37%
118GPT-4o, Aug. 6th (temp=0)77.8%$0.02322.7s34%
119ByteDance Seed 2.0 Lite83.2%$0.0122.2m46%
120Gemini 3.1 Flash Lite (Reasoning)67.0%$0.003011.9s29%
121Arcee AI: Trinity Large (Preview)73.3%$0.000043.6s25%
122Hermes 3 70B73.8%$0.00101.2m31%
123Gemma 4 26B71.4%$0.000955.1s28%
124WizardLM 2 8x22b79.3%$0.00261.8m31%
125Z.AI GLM 4.570.5%$0.005142.1s26%
126Z.AI GLM 4.5 Air71.2%$0.002958.2s26%
127Qwen 2.5 72B66.2%$0.001036.7s24%
128Claude 3.7 Sonnet73.0%$0.04246.7s34%
129Gemini 2.5 Flash Lite (Reasoning)62.8%$0.002830.8s21%
130MoonshotAI: Kimi K2.698.0%$0.0586.5m88%
131Claude 3.5 Sonnet71.2%$0.04835.5s27%
132ByteDance Seed 1.674.2%$0.0132.5m33%
133Gemma 4 26B (Reasoning)69.5%$0.00132.0m25%
134GPT-5 Nano61.0%$0.00421.4m25%
135Claude Sonnet 464.0%$0.03243.7s18%
136Claude Opus 493.7%$0.2091.4m72%
137Llama 3.1 70B44.9%$0.001529.4s10%
138Inception Mercury 240.8%$0.00327.0s10%
139Nemotron 3 Super51.5%$0.00001.4m15%
140ByteDance Seed 2.0 Mini75.4%$0.00454.9m36%
141Inception Mercury47.2%$0.01117.6s2%
142GPT-OSS 120B49.5%$0.00151.8m12%
143Stealth: Aurora Alpha33.9%$0.00009.8s4%
144Llama 3.1 Nemotron 70B37.5%$0.003831.7s6%
145Llama 3.1 8B40.0%$0.00031.3m9%
146Mistral Small 3.2 24B77.6%$0.00685.6m26%
147Nemotron 3 Nano25.6%$0.00101.1m2%
87.40%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Grok 4.201001001001009999.9%
GPT-5.4 Nano (Reasoning)1001001001009999.7%
Qwen 3 32B1001001001009899.6%
Ministral 3 8B1001001001009799.4%
Qwen 3.5 35B1001001001009599.0%
GPT-4o Mini (temp=1)1001001001009198.2%
LFM2 24B100100100989297.9%
WizardLM 2 8x22b100100100999097.8%
GPT-4o, Aug. 6th (temp=1)1001001001008997.7%
Aion 2.01001001001008496.8%
MoonshotAI: Kimi K2.61001001001008396.7%
Ministral 8B100100100948896.5%
DeepSeek V3 (2024-12-26)1001001001008296.5%
MoonshotAI: Kimi K2.5100100100968696.5%
DeepSeek V4 Pro1001001001008196.1%
Grok 4.20 (Beta, Reasoning)1009996929195.7%
Arcee AI: Trinity Mini100100100958395.6%
Qwen 3.5 Plus (2026-02-15)100100100948495.5%
Z.AI GLM 5100100100928495.3%
GPT-5 Mini100100100987895.1%
Grok 4.20 (Beta)100100100888795.1%
Grok 41001001001007394.5%
Claude Sonnet 4.6 (Reasoning)1001001001007194.1%
Z.AI GLM 4.710010093918493.7%
MiniMax M2.7969595929093.7%
ByteDance Seed 1.6100100100947193.0%
Claude Sonnet 4.5100100100877692.6%
Xiaomi MIMO v2.5 Pro10010099897392.3%
Z.AI GLM 5 Turbo1009897927592.3%
GPT-4.1 Nano1009895917692.1%
Claude Opus 4.6 (Reasoning)100100100916992.0%
Gemma 3 27B100100100936391.3%
Claude Opus 41001001001005591.1%
DeepSeek V3.2100100100777389.9%
Claude 3 Haiku100100100856289.4%
DeepSeek V4 Flash100100100826088.4%
Gemini 2.5 Pro1001001001004288.4%
GPT-4.1 Mini100100100875287.9%
Gemma 4 31B (Reasoning)10010098855487.1%
ByteDance Seed 2.0 Lite100100100983787.1%
Stealth: Healer Alpha1001001001003486.8%
Claude Opus 4.510010090786686.8%
Stealth: Hunter Alpha10010089755583.8%
Ministral 3B100100100100981.8%
DeepSeek V4 Flash (Reasoning)100100100733481.5%
Gemini 3 Flash (Preview, Reasoning)10010084744981.4%
Rocinante 12B1009695882480.6%
Gemini 3.1 Flash Lite1009872635577.9%
Hermes 3 70B10010010086077.1%
Gemini 2.5 Flash (Reasoning)1009976664076.3%
DeepSeek-V2 Chat100100100701176.1%
Ministral 3 3B100100100522775.7%
Mistral NeMO1009188881175.6%
GPT-4o Mini (temp=0)878275736175.5%
Claude 3.5 Sonnet1008675684975.5%
Gemini 3.1 Flash Lite (Preview)1007469676274.3%
Gemini 3 Flash (Preview)968880632770.8%
Gemma 3 4B887971714470.3%
Gemma 4 26B837975654869.9%
Xiaomi MIMO v2.510010084511369.6%
Z.AI GLM 4.61007974633269.4%
GPT-4o, May 13th (temp=1)10010010046069.2%
GPT-4o, Aug. 6th (temp=0)100948862068.9%
GPT-OSS 120B1007166633567.0%
Gemma 4 31B1008765483366.5%
Hermes 3 405B10010056462665.7%
Nemotron 3 Super1008654453864.8%
Z.AI GLM 4.5100837365064.2%
DeepSeek V3.1948869601064.1%
Claude Haiku 4.591837766063.4%
Inception Mercury 21007363403562.3%
Gemini 3.1 Flash Lite (Reasoning)1007256541559.4%
Claude Sonnet 41007752331956.2%
GPT-4o, May 13th (temp=0)945958392354.5%
ByteDance Seed 2.0 Mini1001005411053.0%
GPT-5 Nano716856402852.7%
MiniMax M2.5100864136052.6%
Gemini 2.5 Flash100893825451.2%
Z.AI GLM 4.5 Air1006737292251.0%
Claude 3.7 Sonnet100814824050.4%
Arcee AI: Trinity Large (Preview)9589602049.2%
Gemma 4 26B (Reasoning)8579449043.4%
Gemma 3 12B85553327039.9%
Nemotron 3 Nano8166330035.9%
Llama 3.1 70B1007500035.0%
Gemini 2.5 Flash Lite63464218033.8%
Stealth: Aurora Alpha10034169031.8%
Llama 3.1 8B6151370029.9%
Qwen 2.5 72B6050155026.1%
Gemini 2.5 Flash Lite (Reasoning)6129150020.9%
Mistral Small 3.2 24B3832210018.3%
Cohere Command R+ (Aug. 2024)71000014.3%
Inception Mercury000000.0%
Llama 3.1 Nemotron 70B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Gemma 4 31B (Reasoning)1001001001009999.7%
GPT-51001001001009799.4%
WizardLM 2 8x22b1001001001009699.2%
Z.AI GLM 4.71001001001009598.9%
Z.AI GLM 51001001001009498.9%
Stealth: Hunter Alpha100100100989498.4%
DeepSeek V4 Flash1001001001009298.4%
Ministral 3 3B1001001001009298.4%
GPT-4.1 Mini1001001001009098.0%
DeepSeek V4 Pro1001001001008997.7%
Qwen 3.5 122B1001001001008797.5%
Grok 4.20 (Beta, Reasoning)100100100998997.5%
ByteDance Seed 2.0 Lite1001001001008797.4%
Z.AI GLM 5 Turbo1001001001008697.2%
Gemma 4 31B100100100978997.1%
DeepSeek V3.11001001001008597.1%
Ministral 3B1001001001008396.6%
Z.AI GLM 4.6100100100919196.5%
GPT-5.4 Nano (Reasoning)1001001001008196.2%
GPT-4o Mini (temp=1)100100100928895.9%
DeepSeek-V2 Chat100100100948595.8%
Qwen 3.5 35B1001001001007595.1%
DeepSeek V3 (2024-12-26)1001001001007595.0%
DeepSeek V3.2100100100898594.7%
Grok 4.3 (Reasoning)1001001001007394.6%
MiniMax M2.5100100100878594.3%
Claude Opus 4.510010098878694.3%
Claude Sonnet 41001001001007194.1%
GPT-5.4 Nano100100100898093.8%
Mistral Large 2100100100826990.2%
GPT-4o, May 13th (temp=1)10010094797289.0%
Stealth: Healer Alpha10010099935488.9%
Z.AI GLM 4.51009588887188.6%
GPT-4o Mini (temp=0)1009189897188.0%
Gemma 4 26B10010099756487.7%
Hermes 3 405B1001001001003687.1%
GPT-4.1 Nano100100100864886.8%
Gemma 4 26B (Reasoning)1009897884986.3%
Gemini 2.5 Pro10010081777486.3%
Arcee AI: Trinity Large (Preview)100100100963185.5%
MiniMax M2.7100100100725585.4%
Gemini 3 Flash (Preview)10010098834384.8%
GPT-4o, Aug. 6th (temp=1)10010082766083.5%
Gemini 3 Flash (Preview, Reasoning)10010083745883.0%
Ministral 8B1001001001001382.6%
Gemma 3 27B1009382815381.7%
Gemini 2.5 Flash (Reasoning)1008881706781.3%
Gemini 2.5 Flash1009980636380.9%
Mistral Small 3.2 24B100100100100080.0%
Arcee AI: Trinity Mini100100100792080.0%
Gemini 3.1 Flash Lite (Preview)10010073646179.7%
Xiaomi MIMO v2.510010082773779.2%
Gemma 3 12B959379645878.0%
ByteDance Seed 2.0 Mini100100100711877.8%
Claude 3.5 Sonnet1009388623876.3%
Gemini 3.1 Flash Lite1009878743076.1%
GPT-4o, Aug. 6th (temp=0)978983623072.3%
Hermes 3 70B100919073070.8%
Gemini 3.1 Flash Lite (Reasoning)868382603669.5%
Z.AI GLM 4.5 Air1009174561867.7%
ByteDance Seed 1.610010070382666.9%
GPT-4o, May 13th (temp=0)1009264452565.4%
Gemma 3 4B757166615164.8%
Mistral NeMO100767059060.9%
Nemotron 3 Super908344361854.3%
Claude 3.7 Sonnet100945812754.2%
Cohere Command R+ (Aug. 2024)100684436049.5%
GPT-5 Nano100634439049.2%
Gemini 2.5 Flash Lite894542241142.2%
Claude 3 Haiku79464621539.5%
Qwen 2.5 72B9251391036.9%
GPT-OSS 120B52483832234.2%
Llama 3.1 70B1002200024.4%
Stealth: Aurora Alpha801700019.4%
Nemotron 3 Nano3633170017.3%
Inception Mercury 2373330014.5%
Gemini 2.5 Flash Lite (Reasoning)42960011.4%
Llama 3.1 8B54000010.7%
Llama 3.1 Nemotron 70B1100002.1%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Small 4100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Qwen 3.6 Flash10010010010010099.9%
GPT-4o, Aug. 6th (temp=1)10010010010010099.9%
Mistral Medium 3.11001001001009999.7%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.7%
Gemma 3 27B1001001001009799.4%
Ministral 3B1001001001009799.4%
Claude Haiku 4.51001001001009699.3%
Qwen 3.5 397B A17B1001001001009699.2%
Inception Mercury1001001001009699.1%
GPT-4.1 Nano1001001001009599.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009498.7%
o4 Mini High1001001001009498.7%
Gemini 2.5 Flash (Reasoning)1001001001009398.6%
Qwen 3.5 122B1001001001009398.5%
MoonshotAI: Kimi K2.51001001001009198.2%
GPT-5 Mini1001001001009198.1%
Stealth: Aurora Alpha1001001001009098.1%
LFM2 24B1001001001008997.9%
GPT-5.4 Nano1001001001008897.7%
Z.AI GLM 4.5100100100989197.6%
ByteDance Seed 1.61001001001008897.6%
Gemini 2.5 Pro1001001001008897.6%
Hermes 3 405B1001001001008897.6%
Mistral Large1001001001008797.4%
DeepSeek-V2 Chat1001001001008596.9%
Z.AI GLM 4.61001001001008396.7%
Qwen3.6 Max Preview1001001001008396.7%
Qwen 3.6 35B1001001001008196.3%
GPT-5.4 Nano (Reasoning)1001001001008096.1%
Grok 4.20 (Beta, Reasoning)1001001001008096.0%
Qwen 2.5 72B1001001001008096.0%
GPT-OSS 120B100100100938695.9%
GPT-51001001001007695.3%
Qwen 3.6 27B1001001001007695.3%
Gemini 3 Flash (Preview)100100100997795.2%
GPT-4o, May 13th (temp=0)100100100908595.0%
Rocinante 12B100100100878694.7%
Nemotron 3 Super1001001001007194.3%
Claude Opus 4.5100100100887993.5%
MiniMax M2.5100100100966993.1%
ByteDance Seed 2.0 Mini1001001001006292.4%
GPT-4o, May 13th (temp=1)1001001001006192.2%
GPT-4o Mini (temp=1)100100100847591.8%
Qwen 3.5 27B1001001001005691.3%
Inception Mercury 2100100100995390.3%
DeepSeek V3.1100100100915188.5%
Hermes 3 70B10010091856688.2%
Claude Opus 4.71001001001004188.2%
Gemini 3.1 Flash Lite100100100885287.9%
Gemini 2.5 Flash Lite10010089874383.9%
Gemini 3.1 Flash Lite (Preview)1009895912681.9%
Mistral NeMO100100100981181.8%
GPT-5 Nano1009073706980.4%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Llama 3.1 Nemotron 70B1007575666375.8%
Llama 3.1 70B1008571433867.5%
Nemotron 3 Nano1001008635064.2%
Llama 3.1 8B997875382863.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 410010010010010099.9%
Gemini 3 Pro (Preview)1001001001009999.7%
GPT-4o, May 13th (temp=0)1001001001009899.6%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.5%
Qwen 3.5 Plus (2026-04-20)1001001001009799.4%
Qwen 3.5 397B A17B1001001001009799.3%
ByteDance Seed 1.6 Flash1001001001009699.3%
GPT-4o Mini (temp=0)1001001001009699.2%
Claude Opus 41001001001009699.1%
DeepSeek V4 Flash1001001001009599.0%
Cohere Command R+ (Aug. 2024)1001001001009599.0%
MiniMax M2.51001001001009599.0%
Ministral 3 3B1001001001009598.9%
Claude Opus 4.51001001001009498.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009498.7%
GPT-4o, May 13th (temp=1)1001001001009498.7%
Mistral Small 41001001001009498.7%
Gemini 2.5 Flash (Reasoning)100100100989398.2%
Grok 4.3100100100999198.1%
Grok 4.20 (Beta)1001001001009098.0%
Gemma 3 12B1001001001009098.0%
Claude 3 Haiku1001001001009097.9%
Z.AI GLM 5 Turbo100100100999097.8%
Gemini 2.5 Pro1001001001008997.7%
Aion 2.01001001001008897.6%
Mistral Large 31001001001008897.6%
WizardLM 2 8x22b100100100949397.5%
GPT-4.1 Mini100100100989097.5%
Qwen 3.5 27B1001001001008797.4%
Qwen 3.5 122B1001001001008797.4%
GPT-5100100100979097.2%
Z.AI GLM 510010098949397.0%
Claude Sonnet 4.51001001001008396.7%
Mistral Small 4 (Reasoning)1001001001008296.5%
o4 Mini High1001001001008296.4%
Qwen 3.5 Plus (2026-02-15)1001001001008296.4%
Grok 4.20 (Reasoning)1001001001008196.1%
LFM2 24B1001001001008196.1%
DeepSeek-V2 Chat1001001001008096.0%
Qwen 3.6 27B1001001001007795.5%
GPT-5.4 Nano (Reasoning)1001001001007695.3%
DeepSeek V3.210010098918594.9%
Claude Opus 4.7 (Reasoning)100100100957994.7%
Claude Haiku 4.5100100100947894.3%
DeepSeek V3 (2025-03-24)1001001001007093.9%
Grok 4.201001001001006793.3%
GPT-4o Mini (temp=1)100100100897793.2%
Hermes 3 70B1001001001006693.1%
Qwen 3 32B100100100897592.8%
Arcee AI: Trinity Large (Preview)100100100996392.4%
GPT-5 Mini100100100877391.9%
Grok 4.20 (Beta, Reasoning)10010099877091.2%
Grok 4 Fast100100100866390.0%
Xiaomi MIMO v2.5100100100915990.0%
GPT-4o, Aug. 6th (temp=0)10010098836889.7%
Z.AI GLM 4.5 Air100100100915589.2%
Ministral 3B100100100905589.1%
GPT-4o, Aug. 6th (temp=1)100100100875488.1%
DeepSeek V3.110010088846186.7%
MiniMax M2.710010083767186.1%
Gemma 4 26B100100100656485.9%
Claude 3.5 Sonnet10010085737085.5%
MoonshotAI: Kimi K2.51009793864985.1%
ByteDance Seed 2.0 Lite100100100665984.9%
Stealth: Healer Alpha1009487717184.7%
Z.AI GLM 4.510010082815282.8%
GPT-5 Nano1009387746082.6%
Claude Sonnet 410010093861979.5%
Gemini 3 Flash (Preview)10010095762479.0%
Gemini 2.5 Flash Lite999594574177.2%
Arcee AI: Trinity Mini998973714475.4%
Gemma 4 31B1008669655274.5%
Z.AI GLM 4.610010082672174.0%
Claude 3.7 Sonnet10010083393871.9%
Gemma 4 31B (Reasoning)1009188522871.7%
Gemini 3 Flash (Preview, Reasoning)877672693467.9%
Gemma 3 4B1008679492567.7%
Qwen 2.5 72B10010062522467.6%
GPT-OSS 120B1009956452765.6%
Mistral Small 3.2 24B10010010024064.8%
Gemini 2.5 Flash Lite (Reasoning)747168594563.2%
Nemotron 3 Nano1008754521361.1%
ByteDance Seed 1.6967670431660.2%
Inception Mercury1001001000060.0%
ByteDance Seed 2.0 Mini1001007117057.6%
Inception Mercury 293855548557.4%
Nemotron 3 Super100944833055.2%
Gemini 3.1 Flash Lite (Preview)938473151055.1%
Gemini 3.1 Flash Lite (Reasoning)100894139053.8%
Gemma 4 26B (Reasoning)90636248052.7%
Gemini 3.1 Flash Lite827548321049.3%
Llama 3.1 70B63636154048.4%
Stealth: Aurora Alpha8281700046.4%
Llama 3.1 Nemotron 70B10048170032.9%
Llama 3.1 8B7648390032.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 3.5 397B A17B1001001001009999.8%
Qwen 3.6 27B1001001001009999.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.8%
Ministral 8B1001001001009999.7%
GPT-4o, May 13th (temp=1)1001001001009899.6%
Xiaomi MIMO v2.5 Pro1001001001009899.6%
Stealth: Healer Alpha1001001001009899.5%
Arcee AI: Trinity Large (Preview)1001001001009899.5%
DeepSeek V4 Flash1001001001009899.5%
GPT-4.11001001001009598.9%
DeepSeek V4 Flash (Reasoning)1001001001009498.7%
GPT-5.21001001001009398.6%
Gemini 3.1 Flash Lite1001001001009398.5%
Ministral 3 3B1001001001009398.5%
GPT-4o, Aug. 6th (temp=1)1001001001009298.4%
Stealth: Hunter Alpha1001001001009198.3%
Grok 4.201001001001009198.2%
Claude Opus 4.61001001001009198.1%
Claude Opus 4.51001001001008997.9%
Z.AI GLM 51001001001008997.7%
Claude 3.7 Sonnet1001001001008897.7%
GPT-5.4 Mini (Reasoning)100100100969097.4%
GPT-4o Mini (temp=0)100100100978997.1%
ByteDance Seed 1.61001001001008697.1%
Gemma 4 31B (Reasoning)1001001001008596.9%
GPT-5.4 Nano (Reasoning, Low)100100100958896.6%
DeepSeek V4 Pro (Reasoning)1001001001008296.4%
GPT-OSS 120B1001001001008296.4%
Mistral Small 3.2 24B1001001001008296.4%
Gemini 2.5 Pro100100100968696.3%
DeepSeek V3.21001001001008196.1%
GPT-4.1 Nano1001001001007995.8%
Qwen 3.6 Flash1001001001007895.6%
Gemma 4 26B100100100948395.6%
Claude Haiku 4.5100100100928595.4%
Gemma 4 31B100100100968195.4%
Z.AI GLM 5 Turbo100100100918294.6%
Cohere Command R+ (Aug. 2024)100100100908294.3%
Aion 2.0100100100957694.1%
Gemini 3 Pro (Preview)100100100907993.7%
Z.AI GLM 4.71001001001006893.5%
Arcee AI: Trinity Mini1001001001006893.5%
DeepSeek V3.110010096898293.5%
GPT-4o, Aug. 6th (temp=0)100100100996893.3%
GPT-5 Mini100100100897793.3%
Grok 4.20 (Beta, Reasoning)100100100996893.3%
Z.AI GLM 5.1100100100937293.0%
Claude 3.5 Sonnet1001001001006192.2%
Gemini 2.5 Flash100100100887091.6%
Gemini 2.5 Flash (Reasoning)100100100886790.9%
MiniMax M2.510010096847290.4%
Gemma 3 4B1001001001005190.1%
Ministral 3B1001001001005090.0%
Inception Mercury 21001001001004589.1%
Qwen 3.5 27B1001001001004589.0%
Ministral 3 8B1001001001004589.0%
Z.AI GLM 4.5 Air100100100806188.3%
Gemini 3.1 Flash Lite (Preview)100100100815887.7%
Qwen 3.5 Plus (2026-02-15)100100100854886.6%
Gemma 4 26B (Reasoning)100100100805186.1%
Gemma 3 12B10010079747184.8%
Gemini 3 Flash (Preview, Reasoning)1009593766084.8%
Gemini 2.5 Flash Lite1008987686782.4%
ByteDance Seed 2.0 Lite100100100100981.7%
Rocinante 12B100100100713681.4%
Z.AI GLM 4.61009486784881.0%
Qwen 2.5 72B10010084813980.8%
Stealth: Aurora Alpha10010089813580.8%
Claude Opus 4.7100100100494679.0%
Llama 3.1 70B10010090663778.5%
GPT-5 Nano10010089851377.4%
Z.AI GLM 4.510010010080076.0%
Gemini 3 Flash (Preview)928987604474.5%
Nemotron 3 Super868071616071.4%
Hermes 3 70B1001009550069.0%
Gemini 2.5 Flash Lite (Reasoning)977155454562.6%
Gemini 3.1 Flash Lite (Reasoning)908852493362.3%
Nemotron 3 Nano100766617051.7%
Llama 3.1 8B10075660048.1%
Llama 3.1 Nemotron 70B100763315545.9%
ByteDance Seed 2.0 Mini604638131133.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-5.4100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
DeepSeek V3 (2024-12-26)1001001001009999.7%
GPT-4o, Aug. 6th (temp=1)1001001001009899.6%
LFM2 24B1001001001009899.5%
DeepSeek V3.21001001001009799.5%
GPT-5.21001001001009799.4%
Mistral Large1001001001009799.4%
Claude Opus 4.6 (Reasoning)1001001001009799.4%
Claude Opus 4.7 (Reasoning)1001001001009799.4%
MiniMax M2.51001001001009699.3%
DeepSeek V4 Flash (Reasoning)1001001001009699.2%
GPT-5.11001001001009298.5%
Grok 4.201001001001009198.2%
Mistral Medium 3.11001001001009198.1%
Stealth: Hunter Alpha1001001001009098.1%
Gemma 4 26B (Reasoning)100100100969497.9%
Gemma 3 27B1001001001008797.5%
Rocinante 12B1001001001008797.4%
DeepSeek V4 Pro1001001001008797.4%
Grok 4.3 (Reasoning)100100100949397.3%
Gemini 3 Pro (Preview)100100100988997.2%
MoonshotAI: Kimi K2.51001001001008697.1%
GPT-4o Mini (temp=0)1001001001008697.1%
Mistral Small Creative1001001001008697.1%
Ministral 3 14B1001001001008697.1%
Grok 4.3100100100949297.1%
Qwen 3 32B100100100959097.1%
MoonshotAI: Kimi K2.61001001001008596.9%
Qwen 3.5 397B A17B1001001001008496.9%
Hermes 3 70B1001001001008396.7%
Mistral Large 2100100100958996.7%
MiniMax M2.71001001001008296.4%
Mistral Large 31001001001008096.0%
Qwen 3.5 122B1001001001008096.0%
Claude Opus 4.61001001001008095.9%
GPT-4o, Aug. 6th (temp=0)1001001001007995.8%
Ministral 3 3B1001001001007795.4%
DeepSeek V3.1100100100908795.4%
Gemini 2.5 Pro10010093938995.1%
GPT-5 Mini100100100987594.7%
Z.AI GLM 5 Turbo1001001001007194.3%
Z.AI GLM 4.6100100100878494.1%
Stealth: Healer Alpha100100100997093.9%
Gemma 3 12B100100100927493.1%
Grok 4.20 (Reasoning)1001001001006593.1%
Claude Haiku 4.5100100100937192.8%
GPT-4o, May 13th (temp=0)1001001001006392.7%
Grok 4.20 (Beta)10010096887892.5%
Claude Opus 4.510010096868092.4%
Qwen 3.5 Plus (2026-02-15)100100100976492.3%
Grok 4100100100946792.1%
Gemini 2.5 Flash Lite10010096956992.0%
GPT-4.1 Mini100100100916891.7%
Xiaomi MIMO v2.5100100100867091.3%
Grok 4.20 (Beta, Reasoning)100100100867091.1%
Qwen 2.5 72B100100100975790.8%
Qwen 3.5 35B1001001001005490.8%
Aion 2.010010089887590.4%
Gemma 4 31B1009888838190.2%
Claude Sonnet 4100100100916090.1%
Z.AI GLM 4.510010092916689.8%
GPT-5.4 Nano (Reasoning, Low)1008988858489.2%
DeepSeek-V2 Chat1001001001004488.7%
WizardLM 2 8x22b100100100974388.1%
Gemma 4 31B (Reasoning)100100100706987.6%
Z.AI GLM 4.7 Flash10010098865587.6%
Mistral NeMO100100100954187.2%
Claude 3.5 Sonnet1009488856586.4%
Gemini 2.5 Flash Lite (Reasoning)10010097716085.5%
Arcee AI: Trinity Large (Preview)100100100923585.3%
Gemma 4 26B1009694904384.4%
GPT-4o, May 13th (temp=1)10010087873982.6%
GPT-5 Nano1008077757481.1%
Claude 3 Haiku10010089813580.9%
Qwen 3.6 27B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Gemini 3.1 Flash Lite (Reasoning)10010090803080.0%
ByteDance Seed 2.0 Lite1009087634877.6%
Z.AI GLM 4.5 Air1009778615277.6%
Cohere Command R+ (Aug. 2024)100100100661476.2%
Arcee AI: Trinity Mini1008875665276.0%
Claude 3.7 Sonnet1008988544975.9%
Gemini 3.1 Flash Lite (Preview)898571625472.1%
Gemini 3 Flash (Preview)938872664272.0%
Gemini 3 Flash (Preview, Reasoning)958582732471.6%
Gemini 3.1 Flash Lite10010067662270.9%
Gemma 3 4B867772684769.8%
Llama 3.1 70B1007559545267.8%
ByteDance Seed 1.61006961554866.6%
Llama 3.1 Nemotron 70B10010010027065.4%
ByteDance Seed 2.0 Mini1001007050063.9%
Inception Mercury1001006646062.4%
Nemotron 3 Super827558443859.2%
Llama 3.1 8B1001005733358.7%
GPT-OSS 120B564947262239.8%
Stealth: Aurora Alpha77553129038.6%
Inception Mercury 269604312036.9%
Nemotron 3 Nano1006780035.1%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude Sonnet 4.61001001001009999.8%
GPT-5.4 Nano1001001001009999.8%
Mistral Large1001001001009999.7%
Grok 4.20 (Beta)1001001001009799.4%
Qwen 3.6 Flash1001001001009799.4%
Claude Opus 4.6 (Reasoning)1001001001009699.2%
GPT-5.4 (Reasoning, Low)1001001001009598.9%
o4 Mini1001001001009298.5%
Ministral 8B1001001001009198.3%
GPT-5.4 (Reasoning)1001001001009198.2%
Qwen 3.5 Plus (2026-04-20)1001001001009198.1%
Grok 4 Fast1001001001008997.8%
Stealth: Hunter Alpha1001001001008897.6%
Mistral Small 4 (Reasoning)1001001001008897.6%
Mistral Large 31001001001008797.5%
Grok 4.3100100100998797.2%
ByteDance Seed 2.0 Lite1001001001008396.7%
Qwen 3.5 35B100100100958796.3%
Claude Sonnet 4.51001001001008296.3%
Qwen 3.6 27B100100100958696.3%
GPT-4.1 Mini1001001001008196.1%
GPT-5.2100100100938695.8%
Claude Opus 4.71001001001007995.8%
GPT-5.4 Nano (Reasoning)1001001001007795.4%
Xiaomi MIMO v2.5 Pro100100100918595.2%
DeepSeek V4 Pro100100100987494.4%
Grok 410010097967894.2%
GPT-5 Mini100100100888394.1%
Qwen 3.6 35B100100100947393.6%
DeepSeek V4 Flash10010091888793.3%
GPT-5.4 Mini (Reasoning)100100100868093.2%
Grok 4.20 (Beta, Reasoning)1001001001006693.1%
GPT-4.1100100100897392.4%
Gemini 3 Pro (Preview)100100100877592.4%
MoonshotAI: Kimi K2.6100100100996192.0%
Z.AI GLM 4.7 Flash10010099966491.8%
DeepSeek V3.11009695847990.9%
GPT-5.4 Nano (Reasoning, Low)1009998896389.8%
Claude Opus 410010089867489.7%
Grok 4.3 (Reasoning)100100100836389.2%
DeepSeek-V2 Chat100100100816388.7%
ByteDance Seed 2.0 Mini100100100895488.5%
GPT-5.4 Mini (Reasoning, Low)100100100915188.4%
GPT-4o Mini (temp=1)100100100885488.3%
Mistral NeMO100100100736888.1%
GPT-4o, Aug. 6th (temp=1)10010096885688.0%
Mistral Small 3.2 24B1001001001003787.5%
Qwen 3.5 122B10010093796587.4%
Ministral 3B100100100765987.0%
GPT-5.4 Mini1009386856886.6%
LFM2 24B100100100983286.0%
ByteDance Seed 1.6 Flash100100100804885.5%
Z.AI GLM 5 Turbo10010086796185.2%
Xiaomi MIMO v2.51009995666585.1%
Rocinante 12B100100100754884.5%
DeepSeek V4 Pro (Reasoning)100100100804184.2%
Gemini 3.1 Pro (Preview)1009988795584.1%
Z.AI GLM 4.710010086686583.7%
Hermes 3 405B1001001001001583.0%
Claude 3 Haiku1009891883682.5%
Qwen 3.5 397B A17B1009891725182.4%
MoonshotAI: Kimi K2.510010087626181.8%
Cohere Command R+ (Aug. 2024)100100100604180.3%
Qwen 3.5 Flash1009379605978.1%
Aion 2.01008579656077.7%
GPT-4o Mini (temp=0)10010075554474.7%
Gemini 3 Flash (Preview, Reasoning)1009974702974.1%
DeepSeek V4 Flash (Reasoning)959279604073.0%
Gemma 3 27B1001008870772.9%
Qwen 3.5 27B1001007171669.5%
Gemma 3 12B1007169683668.8%
Gemini 2.5 Pro1008785521467.6%
Z.AI GLM 5.11008973443267.5%
DeepSeek V3 (2024-12-26)10010056413967.2%
GPT-4o, May 13th (temp=0)1008369542967.1%
Gemini 2.5 Flash Lite847365654566.6%
Gemma 4 31B (Reasoning)1007960543866.1%
MiniMax M2.51008268462463.9%
Gemini 3.1 Flash Lite (Preview)1008652413662.9%
Gemma 4 31B947058484362.5%
Z.AI GLM 5917970452261.3%
Gemini 3.1 Flash Lite929082321061.2%
DeepSeek V3.21006551494061.0%
Gemini 3 Flash (Preview)100867239360.2%
Claude 3.7 Sonnet1001007222058.7%
Gemini 2.5 Flash (Reasoning)94746357057.6%
Ministral 3 3B10010042211455.4%
Claude Opus 4.5956360421655.3%
Hermes 3 70B100926019054.1%
Gemini 3.1 Flash Lite (Reasoning)896356292752.6%
Stealth: Healer Alpha100784534352.0%
GPT-4.1 Nano100943526050.8%
Claude Haiku 4.581726819248.2%
Gemini 2.5 Flash81635046048.1%
ByteDance Seed 1.688853124045.4%
Gemma 3 4B94524324844.2%
GPT-4o, May 13th (temp=1)8975240037.4%
Z.AI GLM 4.67454510035.8%
Qwen 3.5 Plus (2026-02-15)48484027032.3%
GPT-4o, Aug. 6th (temp=0)7765180031.9%
GPT-5 Nano8754170031.6%
MiniMax M2.778363111031.2%
Arcee AI: Trinity Mini58523310030.5%
Qwen 2.5 72B8626240027.2%
Arcee AI: Trinity Large (Preview)7929160024.6%
Llama 3.1 8B961600022.4%
Llama 3.1 Nemotron 70B921370022.3%
Claude Sonnet 442312011020.8%
Gemma 4 26B463280017.1%
Z.AI GLM 4.5492700015.2%
Z.AI GLM 4.5 Air383500014.6%
Gemini 2.5 Flash Lite (Reasoning)372770014.1%
WizardLM 2 8x22b481100011.6%
Claude 3.5 Sonnet4300008.6%
Inception Mercury2300004.7%
Llama 3.1 70B1700003.4%
Nemotron 3 Super1600003.2%
Gemma 4 26B (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%