Adverbs in dialogue tags

Test: Bad Writing Habits

Avg. Score
90.9%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Inception Mercury 2100.0%$0.00327.0s100%
2Gemini 3.1 Flash Lite (Reasoning)99.7%$0.003011.9s96%
3Gemini 3.1 Flash Lite (Preview)98.9%$0.00308.4s89%
4Gemini 3.1 Flash Lite99.3%$0.003012.1s86%
5ByteDance Seed 1.6 Flash99.2%$0.001327.3s88%
6Grok 4.1 Fast98.7%$0.001837.8s89%
7Stealth: Aurora Alpha98.6%$0.00009.8s78%
8Qwen 3.6 35B99.2%$0.00831.0m91%
9Ministral 8B97.6%$0.000410.4s76%
10Gemini 3 Flash (Preview)98.4%$0.007819.6s78%
11GPT-5.4 Nano97.2%$0.005726.3s81%
12DeepSeek V4 Flash97.8%$0.000631.6s78%
13Ministral 3 3B96.8%$0.000511.1s70%
14Qwen 3.5 Flash98.1%$0.002547.5s77%
15Gemini 3 Flash (Preview, Reasoning)97.8%$0.01230.1s77%
16GPT-5.4 Mini97.8%$0.01516.8s74%
17Gemma 4 26B97.5%$0.000955.1s78%
18Qwen 2.5 72B97.7%$0.001036.7s71%
19Mistral NeMO95.6%$0.000510.1s64%
20DeepSeek V4 Flash (Reasoning)96.2%$0.000731.1s67%
21Qwen 3.5 35B98.5%$0.0181.0m78%
22Z.AI GLM 4.696.6%$0.006551.5s73%
23DeepSeek V4 Pro97.6%$0.00481.3m76%
24o4 Mini95.8%$0.01525.7s71%
25GPT-5.4 Mini (Reasoning, Low)96.8%$0.01516.8s66%
26Qwen 3.5 27B98.9%$0.0201.6m85%
27Stealth: Healer Alpha95.0%$0.000023.7s64%
28Ministral 3B94.3%$0.00018.1s61%
29Qwen 3.6 Flash96.3%$0.01041.4s68%
30Qwen 3.5 Plus (2026-02-15)94.6%$0.006031.5s65%
31Nemotron 3 Nano96.2%$0.00101.1m67%
32Mistral Small 4 (Reasoning)92.5%$0.002230.2s65%
33Stealth: Hunter Alpha95.4%$0.000055.0s64%
34Z.AI GLM 5 Turbo95.2%$0.008133.2s63%
35GPT-5.4 Nano (Reasoning)94.2%$0.006124.5s61%
36GPT-5.4 Nano (Reasoning, Low)94.1%$0.005520.6s59%
37Gemma 4 31B97.4%$0.00101.6m70%
38GPT-OSS 120B96.5%$0.00151.8m73%
39Gemma 4 26B (Reasoning)97.8%$0.00132.0m74%
40GPT-4o Mini (temp=0)93.4%$0.001234.8s60%
41Mistral Small Creative93.2%$0.00079.1s53%
42Qwen 3.5 Plus (2026-04-20)98.0%$0.0171.8m76%
43GPT-5.4 Mini (Reasoning)95.6%$0.02228.1s62%
44Nemotron 3 Super95.4%$0.00001.4m66%
45Xiaomi MIMO v2.5 Pro94.8%$0.008553.5s62%
46Mistral Medium 3.193.4%$0.004836.5s58%
47WizardLM 2 8x22b94.7%$0.00261.8m73%
48Qwen 3.5 122B96.2%$0.0251.1m69%
49Grok 4 Fast92.2%$0.001724.1s55%
50Qwen3 235B A22B Instruct 250793.5%$0.001159.2s59%
51GPT-5.498.4%$0.0491.4m79%
52Gemma 4 31B (Reasoning)97.2%$0.00142.2m69%
53GPT-4o, May 13th (temp=0)94.0%$0.03514.1s61%
54Gemini 2.5 Pro94.8%$0.03636.2s65%
55Ministral 3 8B90.8%$0.000819.6s52%
56Grok 4.392.6%$0.006930.5s54%
57Writer: Palmyra X592.3%$0.01122.0s54%
58Qwen 3.5 9B94.6%$0.00111.4m59%
59Cohere Command R+ (Aug. 2024)93.8%$0.02052.5s62%
60Gemini 2.5 Flash Lite89.1%$0.00099.5s50%
61GPT-4.193.4%$0.01844.7s58%
62Qwen 3 32B92.7%$0.001554.6s54%
63Qwen 3.5 397B A17B98.9%$0.0143.0m79%
64Gemini 2.5 Flash89.8%$0.005210.6s49%
65Gemma 3 4B89.6%$0.000220.0s49%
66LFM2 24B89.1%$0.000228.4s51%
67Z.AI GLM 4.589.3%$0.005142.1s56%
68Qwen3.6 Max Preview100.0%$0.0503.5m100%
69Z.AI GLM 4.5 Air91.8%$0.002958.2s54%
70Inception Mercury92.2%$0.01117.6s46%
71Mistral Large92.0%$0.01430.9s50%
72GPT-4o, Aug. 6th (temp=0)90.5%$0.02322.7s54%
73Xiaomi MIMO v2.590.6%$0.005431.8s48%
74Aion 2.092.8%$0.00641.3m55%
75Grok 4.3 (Reasoning)96.9%$0.0212.3m69%
76Ministral 3 14B88.8%$0.000711.7s43%
77MiniMax M2.790.6%$0.00401.1m53%
78o4 Mini High91.8%$0.02547.2s56%
79Z.AI GLM 4.792.0%$0.0101.4m57%
80Mistral Large 290.1%$0.01329.4s47%
81Gemma 3 27B87.8%$0.000652.6s49%
82DeepSeek V3.291.7%$0.00141.9m56%
83Mistral Large 388.9%$0.003330.3s41%
84Z.AI GLM 4.7 Flash89.9%$0.00171.2m49%
85Claude Sonnet 4.591.7%$0.03538.1s52%
86GPT-5.5 (Reasoning, Low)100.0%$0.1391.8m100%
87GPT-5.295.2%$0.0561.5m65%
88Grok 4.20 (Beta)85.6%$0.01815.8s45%
89Gemini 3 Pro (Preview)93.1%$0.05554.4s57%
90ByteDance Seed 1.694.4%$0.0132.5m59%
91Llama 3.1 8B89.1%$0.00031.3m44%
92Mistral Small 484.2%$0.001418.2s38%
93Z.AI GLM 587.7%$0.00841.2m48%
94Gemini 2.5 Flash Lite (Reasoning)85.4%$0.002830.8s39%
95Z.AI GLM 5.191.2%$0.0141.5m48%
96Gemini 2.5 Flash (Reasoning)85.8%$0.01121.5s39%
97Grok 4.2085.2%$0.009345.7s44%
98Claude 3 Haiku83.8%$0.002514.9s35%
99Claude Opus 4.7 (Reasoning)93.9%$0.07632.0s55%
100Claude Sonnet 4.689.0%$0.03139.3s45%
101DeepSeek V3 (2025-03-24)85.0%$0.001439.4s36%
102Grok 4.20 (Reasoning)89.4%$0.0181.5m47%
103GPT-5 Mini86.1%$0.010057.4s41%
104GPT-5.4 (Reasoning, Low)93.0%$0.0551.4m55%
105Llama 3.1 Nemotron 70B83.2%$0.003831.7s34%
106Gemma 3 12B82.5%$0.000441.3s36%
107Claude Opus 4.693.4%$0.0781.2m59%
108Llama 3.1 70B83.2%$0.001529.4s32%
109Grok 492.2%$0.0481.7m53%
110Arcee AI: Trinity Mini80.1%$0.00039.2s30%
111MiniMax M2.582.7%$0.00341.3m42%
112Qwen 3.6 27B91.3%$0.0252.3m53%
113ByteDance Seed 2.0 Lite90.8%$0.0122.2m45%
114GPT-5.190.8%$0.0541.8m58%
115Claude Opus 4.6 (Reasoning)93.8%$0.0881.4m60%
116DeepSeek V3 (2024-12-26)82.4%$0.002154.6s34%
117GPT-4o Mini (temp=1)78.2%$0.001234.8s36%
118Claude Sonnet 4.6 (Reasoning)90.1%$0.0601.2m48%
119DeepSeek-V2 Chat80.8%$0.002153.3s30%
120DeepSeek V3.184.8%$0.00201.8m37%
121GPT-5.5 (Reasoning)98.0%$0.1421.8m77%
122Rocinante 12B79.5%$0.001438.4s28%
123MoonshotAI: Kimi K2.591.8%$0.0193.2m52%
124GPT-4o, May 13th (temp=1)81.2%$0.03314.4s33%
125Claude Opus 4.789.1%$0.06930.4s40%
126GPT-5.4 (Reasoning)95.4%$0.0892.6m68%
127GPT-594.3%$0.0652.8m62%
128DeepSeek V4 Pro (Reasoning)90.4%$0.0153.1m47%
129Claude Sonnet 482.1%$0.03243.7s35%
130Claude Opus 4.587.5%$0.07053.4s44%
131GPT-5.597.2%$0.1391.7m69%
132Grok 4.20 (Beta, Reasoning)83.6%$0.03934.0s30%
133GPT-4o, Aug. 6th (temp=1)77.0%$0.01824.4s30%
134Claude 3.5 Sonnet82.7%$0.04835.4s34%
135Gemini 3.1 Pro (Preview)93.3%$0.1071.8m59%
136Arcee AI: Trinity Large (Preview)75.3%$0.000043.6s22%
137Claude 3.7 Sonnet78.0%$0.04246.7s34%
138ByteDance Seed 2.0 Mini92.9%$0.00454.9m51%
139Claude Haiku 4.572.0%$0.01121.6s22%
140GPT-4.1 Mini67.0%$0.002719.0s20%
141Hermes 3 405B70.7%$0.003253.2s18%
142Mistral Small 3.2 24B92.8%$0.00695.7m50%
143GPT-5 Nano65.9%$0.00421.4m18%
144Hermes 3 70B63.7%$0.00101.2m13%
145MoonshotAI: Kimi K2.694.4%$0.0586.5m61%
146Claude Opus 488.7%$0.2091.4m50%
147GPT-4.1 Nano46.6%$0.000713.3s4%
90.91%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Qwen 3.6 27B1001001001009598.9%
Grok 4.201001001001009298.4%
Gemini 2.5 Flash1001001001009298.4%
Aion 2.01001001001008997.8%
Stealth: Healer Alpha1001001001008997.8%
Gemini 2.5 Flash Lite (Reasoning)1001001001008997.8%
GPT-4o, Aug. 6th (temp=0)1001001001008997.8%
LFM2 24B1001001001008997.8%
o4 Mini1001001001008697.1%
Mistral Large1001001001008697.1%
Mistral Small 3.2 24B1001001001008596.9%
Claude Sonnet 4.61001001001008296.5%
Mistral Small 4 (Reasoning)1001001001008296.5%
Grok 4.20 (Reasoning)1001001001007995.8%
WizardLM 2 8x22b100100100928495.2%
Xiaomi MIMO v2.51001001001007595.0%
Gemini 2.5 Flash Lite1001001001007595.0%
Llama 3.1 70B1001001001007595.0%
Gemma 3 12B1001001001007093.9%
Arcee AI: Trinity Large (Preview)1001001001007093.9%
DeepSeek V3.21001001001006793.3%
Mistral Large 31001001001006593.0%
Claude Opus 4100100100867592.3%
GPT-4.1 Mini1001001001005791.4%
Ministral 3B1001001001005791.4%
Z.AI GLM 5.1100100100956291.4%
Gemini 3.1 Pro (Preview)1001001001005290.4%
Qwen 3 32B1001001001005090.0%
Claude Haiku 4.51001001001004288.4%
Llama 3.1 Nemotron 70B1001001001003386.7%
Claude 3.7 Sonnet100100100706486.6%
GPT-5 Nano100100100874486.1%
GPT-4o Mini (temp=1)100100100854084.9%
GPT-5 Mini1001001001002484.7%
GPT-4o, May 13th (temp=1)100100100824084.5%
Grok 4.20 (Beta)100100100784083.5%
DeepSeek V3 (2024-12-26)100100100674682.6%
Claude Sonnet 4100100100575281.8%
GPT-4o, Aug. 6th (temp=1)100100100100080.0%
Grok 4.3100100100673380.0%
Ministral 3 3B100100100100080.0%
DeepSeek V3 (2025-03-24)10010010089077.8%
MiniMax M2.51009787574276.7%
Cohere Command R+ (Aug. 2024)100100100571875.1%
ByteDance Seed 1.610010010075075.0%
Z.AI GLM 4.5100100100422974.1%
Z.AI GLM 5100100100571073.3%
MoonshotAI: Kimi K2.610010010046670.5%
Mistral Small 410010071502969.9%
MoonshotAI: Kimi K2.510010010018063.6%
ByteDance Seed 2.0 Lite1001001000060.0%
Hermes 3 405B100100890057.8%
Hermes 3 70B100100330046.7%
GPT-4.1 Nano100100330046.7%
Rocinante 12B1005200030.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude 3.5 Sonnet1001001001009598.9%
Gemini 3.1 Pro (Preview)1001001001008997.8%
GPT-5.4 Mini (Reasoning)1001001001008997.8%
Claude Opus 4.51001001001008997.8%
GPT-5.41001001001008997.8%
Z.AI GLM 4.5 Air1001001001008997.8%
GPT-5.4 Nano (Reasoning)1001001001008997.8%
ByteDance Seed 1.6 Flash1001001001008997.8%
Gemini 3.1 Flash Lite (Reasoning)1001001001008296.5%
Z.AI GLM 4.51001001001008296.5%
GPT-5.11001001001007595.0%
MiniMax M2.71001001001005791.4%
Gemma 3 12B1001001001005791.4%
GPT-5.4 Nano1001001001005791.4%
Cohere Command R+ (Aug. 2024)1001001001005791.4%
Grok 4.20 (Beta)100100100896290.2%
Llama 3.1 8B100100100757590.0%
GPT-5.4 (Reasoning)1001001001003386.7%
Grok 4.20 (Reasoning)1001001001003386.7%
Grok 41001001001003386.7%
Gemini 3.1 Flash Lite1001001001003386.7%
GPT-4o, Aug. 6th (temp=1)1001001001003386.7%
Claude 3.7 Sonnet10010089825084.2%
ByteDance Seed 1.61001001001001883.6%
Z.AI GLM 5.1100100100100080.0%
Gemma 4 31B (Reasoning)100100100100080.0%
Claude Opus 4.7100100100100080.0%
GPT-4.1100100100100080.0%
Xiaomi MIMO v2.5 Pro100100100100080.0%
ByteDance Seed 2.0 Mini100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
GPT-OSS 120B100100100100080.0%
Grok 4 Fast100100100100080.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100080.0%
Claude Haiku 4.5100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Nemotron 3 Super100100100100080.0%
Mistral Large 2100100100100080.0%
DeepSeek V3.1100100100100080.0%
Qwen3 235B A22B Instruct 2507100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Hermes 3 70B100100100100080.0%
Rocinante 12B100100100100080.0%
GPT-5.4 (Reasoning, Low)10010010095078.9%
Ministral 3 8B10010010089077.8%
Mistral Large10010010082076.5%
Writer: Palmyra X510010010082076.5%
Hermes 3 405B10010010075075.0%
Gemma 3 4B1001008975072.8%
Gemini 2.5 Flash10010010057071.4%
GPT-5 Nano10010010026065.2%
GPT-4.1 Mini1001006757064.8%
GPT-5 Mini1001001000060.0%
Claude Opus 4.61001001000060.0%
Grok 4.20 (Beta, Reasoning)1001001000060.0%
MiniMax M2.51001001000060.0%
DeepSeek V3 (2024-12-26)1001001000060.0%
Mistral Small 41001001000060.0%
Ministral 3 14B1001001000060.0%
Claude 3 Haiku1001001000060.0%
Claude Sonnet 4100100670053.3%
GPT-4o, May 13th (temp=1)1001003318050.3%
GPT-4o Mini (temp=1)67333333033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001009899.7%
Grok 4 Fast1001001001009799.3%
GPT-5 Nano1001001001009799.3%
DeepSeek V3 (2024-12-26)1001001001009598.9%
Mistral Small 4 (Reasoning)1001001001009398.6%
Mistral Small Creative1001001001009398.6%
Gemini 2.5 Flash1001001001009298.4%
Grok 4.20 (Beta, Reasoning)1001001001009198.2%
Gemini 3 Pro (Preview)1001001001009098.1%
Gemini 2.5 Flash (Reasoning)1001001001008997.8%
Mistral NeMO1001001001008997.8%
Z.AI GLM 51001001001008797.5%
Aion 2.01001001001008797.4%
Grok 4.1 Fast1001001001008596.9%
Stealth: Hunter Alpha1001001001008596.9%
Claude Sonnet 4.61001001001008296.5%
GPT-4o, May 13th (temp=0)1001001001008296.5%
Gemini 2.5 Pro1001001001008096.0%
Llama 3.1 70B100100100898995.6%
Claude Sonnet 41001001001007595.0%
Llama 3.1 8B1001001001007595.0%
Z.AI GLM 4.610010097978095.0%
Grok 4.20100100100977794.7%
DeepSeek-V2 Chat100100100868694.3%
Claude Opus 4.51001001001007194.2%
Claude Opus 41001001001006593.0%
Grok 41001001001006292.4%
GPT-4o Mini (temp=0)10010092897991.9%
Grok 4.20 (Beta)100100100847391.3%
WizardLM 2 8x22b100100100926491.3%
Mistral Medium 3.11001001001004388.6%
Arcee AI: Trinity Mini1001001001004288.4%
Claude 3.7 Sonnet10010090826988.2%
Mistral Large 31001001001003887.6%
GPT-4o, Aug. 6th (temp=0)1001001001003887.6%
Stealth: Healer Alpha1001001001003386.7%
DeepSeek V3.21001001001003386.7%
Gemini 2.5 Flash Lite100100100755786.4%
GPT-4.1 Mini10010097716285.9%
Gemma 3 12B100100100824685.7%
GPT-4o, Aug. 6th (temp=1)100100100675784.8%
o4 Mini High1001001001002184.3%
GPT-4o, May 13th (temp=1)10010093636083.1%
Ministral 3 8B100100100654381.6%
DeepSeek V3.1100100100792981.5%
Xiaomi MIMO v2.5100100100100080.0%
Rocinante 12B100100100100080.0%
Mistral Small 410010010098079.7%
Ministral 3 14B100100100673179.4%
Z.AI GLM 4.510010077535176.1%
Gemma 3 4B10010010080076.0%
Qwen 3.5 Plus (2026-02-15)100100100393675.0%
Claude 3 Haiku100100100363373.9%
Hermes 3 405B10010010067073.3%
Gemma 3 27B1009595481871.2%
LFM2 24B1008271603369.4%
GPT-4o Mini (temp=1)100939357068.7%
Hermes 3 70B1001008946067.0%
MiniMax M2.5100897339060.2%
Arcee AI: Trinity Large (Preview)1001005033056.7%
Llama 3.1 Nemotron 70B100100750055.0%
Claude Haiku 4.510071403042.8%
GPT-4.1 Nano402400012.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009298.4%
Rocinante 12B1001001001008997.8%
Ministral 3 3B1001001001008296.5%
LFM2 24B1001001001008296.5%
Gemini 2.5 Flash1001001001008096.0%
Mistral NeMO1001001001008096.0%
Claude 3.7 Sonnet1001001001007995.8%
Writer: Palmyra X51001001001007995.8%
Claude Sonnet 4.51001001001007895.5%
MiniMax M2.71001001001007595.0%
Ministral 3 8B1001001001007595.0%
WizardLM 2 8x22b100100100977794.6%
MoonshotAI: Kimi K2.61001001001007194.2%
GPT-4o, May 13th (temp=1)1001001001007093.9%
Gemini 3.1 Flash Lite (Preview)1001001001006793.3%
GPT-4o Mini (temp=0)1001001001006492.9%
Grok 4.201001001001006492.7%
GPT-5 Nano1001001001006292.4%
MiniMax M2.5100100100857391.5%
GPT-4o Mini (temp=1)1001001001005791.4%
Gemini 2.5 Flash Lite (Reasoning)100100100757590.0%
Cohere Command R+ (Aug. 2024)1001001001005090.0%
Grok 4 Fast100100100797190.0%
Claude Sonnet 4.61001001001003386.7%
DeepSeek V4 Pro (Reasoning)1001001001003386.7%
Qwen 3.6 27B1001001001003386.7%
DeepSeek V3 (2024-12-26)1001001001003186.1%
Mistral Medium 3.1100100100785285.9%
Z.AI GLM 4.51001001001002985.7%
Arcee AI: Trinity Large (Preview)1001001001002584.9%
Claude 3.5 Sonnet100100100863383.8%
Mistral Large1001001001001883.6%
Hermes 3 70B1001001001001883.6%
GPT-4o, Aug. 6th (temp=1)100100100892983.5%
Gemini 2.5 Flash (Reasoning)10010089575780.6%
Inception Mercury100100100100080.1%
Grok 4.3 (Reasoning)100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Gemma 3 12B100100100100080.0%
Claude 3 Haiku100100100100080.0%
DeepSeek V3 (2025-03-24)10010010095078.9%
Mistral Small 4 (Reasoning)100100100463776.6%
Gemini 2.5 Flash Lite10010010040068.0%
Hermes 3 405B1001007157065.6%
GPT-4.1 Mini1007972571464.4%
GPT-4.1 Nano716700027.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Z.AI GLM 5.11001001001009999.7%
Z.AI GLM 4.71001001001009799.5%
Mistral Small 3.2 24B1001001001009799.5%
Qwen 3.6 35B1001001001009598.9%
DeepSeek V4 Flash (Reasoning)1001001001009598.9%
Grok 41001001001009598.9%
Gemini 3.1 Flash Lite (Reasoning)1001001001009598.9%
Grok 4 Fast1001001001009598.9%
DeepSeek V3 (2025-03-24)1001001001009598.9%
Grok 4.201001001001009398.6%
Ministral 3 14B1001001001008997.8%
WizardLM 2 8x22b10010098959597.5%
Gemini 2.5 Flash1001001001008797.4%
Claude Sonnet 4.51001001001008697.3%
LFM2 24B1001001001008496.8%
Z.AI GLM 4.7 Flash1001001001007995.8%
o4 Mini High1001001001007895.5%
DeepSeek V3 (2024-12-26)100100100987995.4%
DeepSeek V4 Pro1001001001007795.4%
Qwen 3.6 27B1001001001007595.0%
GPT-OSS 120B1001001001007595.0%
Stealth: Healer Alpha1001001001007595.0%
DeepSeek V3.11001001001007595.0%
Llama 3.1 8B1001001001007595.0%
Mistral Large 2100100100868293.6%
GPT-5 Mini100100100986993.4%
MiniMax M2.51001001001006593.0%
DeepSeek V3.21001001001006092.1%
Claude Haiku 4.510010097847891.8%
Z.AI GLM 4.5100100100956391.6%
Claude Sonnet 41001001001005290.4%
Mistral Medium 3.1100100100975590.2%
Gemma 3 12B1001001001004388.6%
Grok 4.20 (Beta, Reasoning)1001001001004288.4%
Mistral Small 4 (Reasoning)1001001001004288.4%
Xiaomi MIMO v2.51001001001003386.7%
MiniMax M2.7100100100894486.6%
Ministral 3 8B1001001001002685.2%
ByteDance Seed 2.0 Mini100100100891881.4%
GPT-4o, May 13th (temp=0)10010097852481.1%
GPT-4.1 Mini10010097555080.4%
GPT-4o, Aug. 6th (temp=1)100100100891380.3%
Inception Mercury100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
GPT-4o Mini (temp=1)1009589852879.2%
Gemma 3 4B10010097573377.6%
Rocinante 12B10010010082076.5%
GPT-5 Nano10010093681575.2%
GPT-4o, Aug. 6th (temp=0)10010067575275.1%
Claude 3 Haiku100100100541874.4%
Mistral Small 410010010071074.2%
Claude 3.7 Sonnet100100100432673.8%
Hermes 3 70B10010010013062.5%
Gemma 3 27B1001005746561.6%
Gemini 2.5 Flash Lite (Reasoning)100100700053.9%
DeepSeek-V2 Chat10079628049.8%
Llama 3.1 70B10075670048.3%
Llama 3.1 Nemotron 70B100753318045.3%
Arcee AI: Trinity Large (Preview)10010000040.0%
GPT-4.1 Nano6218130018.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Z.AI GLM 4.71001001001009799.5%
DeepSeek V3 (2024-12-26)1001001001009598.9%
Gemma 3 27B1001001001009598.9%
WizardLM 2 8x22b1001001001009598.9%
Z.AI GLM 51001001001009298.4%
Qwen 3.5 Plus (2026-04-20)1001001001008997.8%
o4 Mini1001001001008997.8%
Claude Opus 41001001001008997.8%
Mistral Medium 3.11001001001008296.5%
Gemini 3 Flash (Preview)1001001001007595.0%
Grok 4.31001001001007595.0%
DeepSeek V3 (2025-03-24)1001001001007194.2%
MiniMax M2.5100100100927593.4%
o4 Mini High1001001001006793.3%
Nemotron 3 Super100100100897592.8%
Claude Sonnet 4.51001001001006292.4%
Grok 4.20100100100926791.7%
Grok 4 Fast1001001001005791.4%
DeepSeek V4 Flash1001001001005791.4%
MoonshotAI: Kimi K2.51001001001005290.4%
Grok 4.20 (Reasoning)1001001001003386.7%
Z.AI GLM 4.61001001001003386.7%
Llama 3.1 70B1001001001003386.7%
Claude Sonnet 4100100100923385.0%
Hermes 3 70B1001001001001883.6%
Gemini 2.5 Flash1001001001001081.9%
DeepSeek V3.1100100100574680.7%
DeepSeek-V2 Chat100100100891480.6%
Qwen 3.6 27B100100100100080.0%
GPT-4.1100100100100080.0%
Grok 4100100100100080.0%
ByteDance Seed 2.0 Mini100100100100080.0%
Stealth: Healer Alpha100100100100080.0%
Claude Haiku 4.5100100100100080.0%
Xiaomi MIMO v2.5100100100100080.0%
ByteDance Seed 2.0 Lite100100100100080.0%
Hermes 3 405B100100100100080.0%
DeepSeek V4 Pro100100100100080.0%
GPT-5 Nano100100100100080.0%
DeepSeek V3.2100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Gemma 3 4B100100100100080.0%
Ministral 8B100100100100080.0%
Qwen 3.5 Plus (2026-02-15)10010010079075.8%
GPT-4.1 Mini100100100571875.1%
Gemma 3 12B100100100521874.0%
Mistral Small 410010010067073.3%
Grok 4.20 (Beta, Reasoning)100100100521372.9%
LFM2 24B10010010057071.4%
Claude Sonnet 4.6 (Reasoning)1001008267069.8%
GPT-4.1 Nano10010010033066.7%
Rocinante 12B10010010033066.7%
Claude 3.7 Sonnet1001001008061.5%
GPT-4o, May 13th (temp=1)1001001000060.0%
Inception Mercury1001001000060.0%
Aion 2.0100100950058.9%
Claude 3.5 Sonnet100100820056.5%
GPT-4o, Aug. 6th (temp=1)100100180043.6%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
ByteDance Seed 1.6 Flash1001001001009799.3%
GPT-4o, Aug. 6th (temp=0)1001001001009598.9%
Claude 3 Haiku1001001001009598.9%
Ministral 8B1001001001009598.9%
MoonshotAI: Kimi K2.61001001001008997.8%
o4 Mini High1001001001008997.8%
Grok 4.1 Fast1001001001008997.8%
GPT-4.11001001001008997.8%
Grok 41001001001008997.8%
ByteDance Seed 2.0 Lite1001001001008997.8%
Grok 4 Fast1001001001008697.1%
Claude Sonnet 41001001001008296.5%
Mistral Large 31001001001008296.5%
GPT-5.21001001001007795.4%
Grok 4.20 (Reasoning)100100100977995.2%
ByteDance Seed 1.61001001001007595.0%
Qwen 3.6 Flash1001001001007595.0%
Gemini 2.5 Pro1001001001007595.0%
GPT-4o, May 13th (temp=1)1001001001007595.0%
WizardLM 2 8x22b1001001001007595.0%
Claude 3.7 Sonnet1001001001006793.3%
Mistral Small 4 (Reasoning)1001001001006793.3%
DeepSeek V3.2100100100897592.8%
Gemini 2.5 Flash (Reasoning)1001001001005791.4%
DeepSeek V3.11001001001005791.4%
DeepSeek V4 Flash1001001001005791.4%
Ministral 3 14B1001001001005791.4%
GPT-5.4 (Reasoning, Low)100100100757590.0%
Z.AI GLM 51001001001005090.0%
Xiaomi MIMO v2.5100100100757189.2%
GPT-4.1 Mini100100100825787.9%
Gemini 3 Pro (Preview)100100100895087.8%
Mistral Medium 3.11001001001003887.6%
Claude Opus 4.51001001001003386.7%
Writer: Palmyra X51001001001003386.7%
Gemma 3 12B1001001001003386.7%
Claude Haiku 4.510010089825785.7%
DeepSeek V3 (2025-03-24)100100100953385.6%
GPT-5.4 (Reasoning)100100100705785.3%
Claude Opus 4100100100854084.9%
GPT-5.4 Mini (Reasoning)100100100823884.0%
Claude Opus 4.7 (Reasoning)1001001001001883.6%
GPT-5.4 Mini (Reasoning, Low)1001001001001883.6%
Qwen 3.5 9B100100100754283.4%
LFM2 24B10010075706782.2%
Claude Opus 4.610010078706482.2%
Z.AI GLM 5 Turbo100100100891881.4%
DeepSeek V3 (2024-12-26)100100100703380.6%
Qwen 3.5 122B100100100100080.0%
GPT-4o, May 13th (temp=0)100100100100080.0%
Nemotron 3 Super100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Inception Mercury100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Rocinante 12B100100100100080.0%
Grok 4.20100100100672979.0%
MiniMax M2.710010010095078.9%
Gemini 2.5 Flash Lite (Reasoning)10010010095078.9%
Grok 4.310010010095078.9%
GPT-4o, Aug. 6th (temp=1)100100100573378.1%
Z.AI GLM 4.510010095602676.3%
GPT-51001009780075.5%
MiniMax M2.51009582673375.4%
Arcee AI: Trinity Mini100100100571875.1%
Gemini 3.1 Pro (Preview)10010010067073.3%
Mistral Small 41008986642672.9%
Claude Sonnet 4.610010010057071.4%
GPT-4o Mini (temp=1)100100100421370.9%
Grok 4.3 (Reasoning)1001008267069.8%
GPT-5.11001006755064.4%
Claude Opus 4.6 (Reasoning)1001007150064.2%
Z.AI GLM 4.7100956457063.1%
Hermes 3 405B1001007924060.5%
Grok 4.20 (Beta, Reasoning)1001001000060.0%
Llama 3.1 70B1001001000060.0%
Hermes 3 70B100100890057.8%
Arcee AI: Trinity Large (Preview)100100808057.5%
GPT-5 Nano100894239053.9%
GPT-4.1 Nano757575181852.3%
Claude Opus 4.71001002618048.9%
Claude Sonnet 4.6 (Reasoning)10089130040.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
Grok 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100