Dialogue tag variety (said vs. fancy)

Test: Bad Writing Habits

Avg. Score
66.7%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude Sonnet 4.699.4%$0.03139.3s95%
2Claude Sonnet 4.6 (Reasoning)99.4%$0.0601.2m93%
3Z.AI GLM 5 Turbo93.9%$0.008133.2s56%
4DeepSeek V4 Flash90.5%$0.000631.6s54%
5MiniMax M2.592.4%$0.00341.3m60%
6MiniMax M2.789.5%$0.00401.1m54%
7Z.AI GLM 5.194.1%$0.0141.5m54%
8DeepSeek V4 Flash (Reasoning)86.9%$0.000731.1s41%
9Qwen 3.6 35B89.5%$0.00831.0m45%
10Mistral Large85.7%$0.01430.9s42%
11Qwen 3.5 397B A17B95.7%$0.0143.0m66%
12Mistral Large 286.8%$0.01329.4s39%
13DeepSeek V4 Pro86.3%$0.00481.3m46%
14Ministral 8B80.0%$0.000410.4s35%
15Z.AI GLM 588.4%$0.00841.2m42%
16Claude Sonnet 4.588.4%$0.03538.1s45%
17Mistral Large 382.1%$0.003330.3s35%
18GPT-5 Mini81.8%$0.010057.4s44%
19Qwen 3.5 9B85.3%$0.00111.4m40%
20Mistral Small Creative78.8%$0.00079.1s29%
21Qwen 3.5 35B86.5%$0.0181.0m41%
22Ministral 3 3B79.1%$0.000511.1s29%
23Claude Opus 4.591.0%$0.07053.4s56%
24Ministral 3B77.5%$0.00018.1s28%
25Qwen 3.5 Flash82.4%$0.002547.5s30%
26Qwen3.6 Max Preview97.2%$0.0503.5m73%
27ByteDance Seed 1.6 Flash77.0%$0.001327.3s29%
28Claude Haiku 4.577.2%$0.01121.6s32%
29Grok 4.380.8%$0.006930.5s28%
30Qwen 3.6 Flash82.6%$0.01041.4s30%
31GPT-5.4 Nano (Reasoning)75.3%$0.006124.5s32%
32Mistral Small 4 (Reasoning)76.3%$0.002230.2s30%
33Ministral 3 8B75.1%$0.000819.6s27%
34GPT-5.4 Mini (Reasoning, Low)77.8%$0.01516.8s30%
35Mistral Small 474.4%$0.001418.2s26%
36Writer: Palmyra X577.5%$0.01122.0s28%
37GPT-5.4 Mini76.8%$0.01516.8s27%
38GPT-5.4 Nano (Reasoning, Low)72.6%$0.005520.6s28%
39Qwen 3.5 27B84.3%$0.0201.6m40%
40Qwen3 235B A22B Instruct 250777.6%$0.001159.2s29%
41GPT-5.487.7%$0.0491.4m44%
42Grok 4.3 (Reasoning)87.6%$0.0212.3m44%
43Mistral Medium 3.175.6%$0.004836.5s25%
44Ministral 3 14B69.6%$0.000711.7s22%
45Claude Sonnet 480.1%$0.03243.7s31%
46Inception Mercury73.6%$0.01117.6s21%
47GPT-5.4 Nano69.9%$0.005726.3s23%
48GPT-5.4 (Reasoning, Low)85.1%$0.0551.4m41%
49GPT-5.4 Mini (Reasoning)75.3%$0.02228.1s24%
50Gemini 3.1 Flash Lite (Reasoning)70.9%$0.003011.9s16%
51Qwen 3.5 122B80.3%$0.0251.1m28%
52ByteDance Seed 1.685.2%$0.0132.5m34%
53GPT-590.2%$0.0652.8m55%
54Stealth: Healer Alpha67.2%$0.000023.7s16%
55DeepSeek V3 (2025-03-24)68.8%$0.001439.4s17%
56Claude Opus 4.6 (Reasoning)88.5%$0.0881.4m43%
57Gemini 3.1 Flash Lite66.7%$0.003012.1s12%
58Stealth: Hunter Alpha68.6%$0.000055.0s19%
59Claude Opus 4.781.3%$0.06930.4s29%
60Claude Opus 4.686.0%$0.0781.2m37%
61Mistral NeMO62.9%$0.000510.1s14%
62Aion 2.070.8%$0.00641.3m24%
63Xiaomi MIMO v2.566.9%$0.005431.8s17%
64Grok 4.1 Fast66.1%$0.001837.8s16%
65Gemini 2.5 Pro73.3%$0.03636.2s23%
66Llama 3.1 70B64.3%$0.001529.4s11%
67Gemini 3.1 Flash Lite (Preview)61.5%$0.00308.4s9%
68ByteDance Seed 2.0 Lite77.4%$0.0122.2m24%
69Claude Opus 4.7 (Reasoning)78.8%$0.07632.0s26%
70Qwen 3 32B61.9%$0.001554.6s18%
71DeepSeek V4 Pro (Reasoning)80.4%$0.0153.1m34%
72Xiaomi MIMO v2.5 Pro66.0%$0.008553.5s15%
73Z.AI GLM 4.665.4%$0.006551.5s14%
74GPT-5.589.8%$0.1391.7m58%
75DeepSeek V3 (2024-12-26)63.6%$0.002154.6s14%
76GPT-5.4 (Reasoning)87.7%$0.0892.6m47%
77Claude 3.5 Sonnet70.3%$0.04835.4s19%
78GPT-5.5 (Reasoning, Low)89.8%$0.1391.8m55%
79Grok 4.20 (Beta, Reasoning)67.6%$0.03934.0s17%
80Gemini 2.5 Flash56.9%$0.005210.6s9%
81GPT-4.163.7%$0.01844.7s15%
82LFM2 24B57.5%$0.000228.4s10%
83Gemma 4 26B (Reasoning)72.3%$0.00132.0m14%
84o4 Mini59.3%$0.01525.7s11%
85Qwen 3.5 Plus (2026-04-20)72.7%$0.0171.8m16%
86Qwen 2.5 72B56.5%$0.001036.7s10%
87Z.AI GLM 4.559.4%$0.005142.1s10%
88WizardLM 2 8x22b66.5%$0.00261.8m16%
89Grok 4.20 (Reasoning)67.7%$0.0181.5m16%
90Arcee AI: Trinity Large (Preview)56.3%$0.000043.6s9%
91Gemma 4 26B59.6%$0.000955.1s8%
92MoonshotAI: Kimi K2.577.1%$0.0193.2m29%
93GPT-4.1 Nano51.8%$0.000713.3s7%
94Grok 4.2056.9%$0.009345.7s12%
95Grok 4 Fast52.9%$0.001724.1s8%
96Gemma 3 27B56.7%$0.000652.6s9%
97Arcee AI: Trinity Mini50.0%$0.00039.2s5%
98DeepSeek V3.264.4%$0.00141.9m14%
99GPT-4o, Aug. 6th (temp=0)54.9%$0.02322.7s13%
100Gemma 4 31B (Reasoning)66.9%$0.00142.2m14%
101DeepSeek-V2 Chat55.7%$0.002153.3s8%
102o4 Mini High61.2%$0.02547.2s10%
103GPT-4o, May 13th (temp=0)57.7%$0.03514.1s10%
104GPT-5.172.0%$0.0541.8m23%
105DeepSeek V3.160.6%$0.00201.8m13%
106GPT-5.5 (Reasoning)87.2%$0.1421.8m42%
107GPT-4.1 Mini43.7%$0.002719.0s8%
108Grok 4.20 (Beta)48.1%$0.01815.8s7%
109Nemotron 3 Super52.7%$0.00001.4m8%
110Claude 3 Haiku40.3%$0.002514.9s6%
111Hermes 3 405B48.9%$0.003253.2s5%
112Gemini 2.5 Flash Lite40.0%$0.00099.5s4%
113Claude 3.7 Sonnet56.4%$0.04246.7s12%
114Gemini 3.1 Pro (Preview)79.9%$0.1071.8m28%
115Z.AI GLM 4.7 Flash49.2%$0.00171.2m8%
116Rocinante 12B46.1%$0.001438.4s3%
117Z.AI GLM 4.5 Air46.6%$0.002958.2s8%
118Inception Mercury 238.2%$0.00327.0s4%
119Gemini 3 Flash (Preview, Reasoning)45.3%$0.01230.1s5%
120Gemini 2.5 Flash Lite (Reasoning)43.3%$0.002830.8s3%
121Z.AI GLM 4.751.7%$0.0101.4m8%
122GPT-5 Nano47.2%$0.00421.4m9%
123GPT-5.261.4%$0.0561.5m17%
124Gemini 2.5 Flash (Reasoning)41.7%$0.01121.5s2%
125ByteDance Seed 2.0 Mini77.8%$0.00454.9m23%
126Stealth: Aurora Alpha35.7%$0.00009.8s0%
127Llama 3.1 8B45.2%$0.00031.3m5%
128Gemini 3 Flash (Preview)37.4%$0.007819.6s3%
129Qwen 3.6 27B58.5%$0.0252.3m8%
130Llama 3.1 Nemotron 70B33.9%$0.003831.7s1%
131GPT-4o Mini (temp=0)34.6%$0.001234.8s0%
132Gemma 4 31B44.1%$0.00101.6m3%
133Gemma 3 12B31.2%$0.000441.3s3%
134Cohere Command R+ (Aug. 2024)38.6%$0.02052.5s3%
135Qwen 3.5 Plus (2026-02-15)30.5%$0.006031.5s0%
136Grok 452.5%$0.0481.7m9%
137MoonshotAI: Kimi K2.687.6%$0.0586.5m42%
138Nemotron 3 Nano30.7%$0.00101.1m2%
139Claude Opus 485.3%$0.2091.4m36%
140Hermes 3 70B30.8%$0.00101.2m0%
141Gemma 3 4B19.7%$0.000220.0s0%
142GPT-4o Mini (temp=1)20.1%$0.001234.8s0%
143GPT-4o, May 13th (temp=1)25.2%$0.03314.4s0%
144Mistral Small 3.2 24B68.9%$0.00695.7m15%
145GPT-4o, Aug. 6th (temp=1)19.5%$0.01824.4s0%
146Gemini 3 Pro (Preview)31.2%$0.05554.4s4%
147GPT-OSS 120B19.7%$0.00151.8m0%
66.74%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5.5 (Reasoning, Low)1001001001009298.3%
MiniMax M2.71001001001009198.2%
ByteDance Seed 1.6100100100978897.0%
Claude Opus 4.6 (Reasoning)100100100919196.5%
GPT-5.4 (Reasoning, Low)100100100968496.0%
Claude Opus 41001001001007995.9%
Claude Opus 4.7 (Reasoning)100100100918394.9%
Claude Sonnet 4.5100100100947994.6%
GPT-5.4 Mini (Reasoning)100100100966993.0%
GPT-5.4 Mini (Reasoning, Low)100100100966492.0%
GPT-5.5 (Reasoning)1001001001005991.8%
GPT-5.4100100100965890.8%
ByteDance Seed 2.0 Lite100100100836790.0%
Qwen 3.5 35B1001001001004789.5%
GPT-510010086827789.0%
Qwen 3.5 397B A17B1001001001004288.4%
Qwen 3.6 35B1001001001003787.4%
DeepSeek V4 Flash (Reasoning)10010093835987.2%
GPT-5.510010085756484.8%
DeepSeek V4 Pro100100100675384.0%
Gemini 3.1 Pro (Preview)10010083825083.0%
Claude Opus 4.71001001001001082.1%
Gemma 4 26B (Reasoning)100100100100080.0%
MiniMax M2.5100100100791879.4%
Claude Opus 4.6100100100801579.0%
Z.AI GLM 5100100100504378.6%
ByteDance Seed 2.0 Mini10010010091078.2%
Mistral Large 210010010079075.7%
GPT-5.4 Mini10010070623673.6%
DeepSeek V4 Pro (Reasoning)10010088472070.7%
Ministral 8B1001007973070.3%
DeepSeek V4 Flash1001009059069.8%
Writer: Palmyra X594938864067.9%
Mistral Medium 3.11009995211265.4%
Qwen 3.5 9B1009667352865.1%
Qwen3 235B A22B Instruct 2507100998825062.2%
GPT-5.4 Nano89777667061.6%
Qwen 3.5 27B1009362371260.9%
Qwen3.6 Max Preview1001007032060.5%
Qwen 3.5 122B100100984060.3%
GPT-4o, Aug. 6th (temp=0)100100947060.3%
Gemma 4 31B (Reasoning)1001001000060.0%
Claude 3.5 Sonnet91796759059.1%
Ministral 3 14B767059454558.9%
Mistral Large 3917867421157.7%
Qwen 3.5 Flash100100840056.8%
GPT-5.11001004737056.7%
GPT-5.2817168312955.9%
ByteDance Seed 1.6 Flash1007367211054.1%
Qwen 2.5 72B716353503354.1%
Qwen 3 32B10094760054.1%
Claude Sonnet 4100100700054.0%
Xiaomi MIMO v2.5 Pro100100700054.0%
Ministral 3 8B76696755053.3%
MoonshotAI: Kimi K2.5100100590051.8%
Rocinante 12B100100590051.8%
GPT-4.19383810051.4%
Mistral Small 497895021051.4%
GPT-5 Mini85846817051.0%
Mistral Small 3.2 24B10077690049.3%
GPT-5.4 Nano (Reasoning)1005041351447.9%
Grok 4.1 Fast100100390047.8%
Qwen 3.6 27B10093390046.3%
Mistral NeMO100673025044.3%
Grok 4.3 (Reasoning)67505050043.3%
Grok 4.20 (Reasoning)100100170043.3%
Qwen 3.5 Plus (2026-04-20)10010000040.0%
Qwen 3.6 Flash10010000040.0%
o4 Mini64503932738.5%
Gemini 2.5 Flash8369320036.9%
Mistral Small 4 (Reasoning)7362470036.5%
GPT-5.4 Nano (Reasoning, Low)1007720035.9%
WizardLM 2 8x22b7667310034.6%
Mistral Large947900034.6%
Ministral 3 3B887900033.2%
Claude 3 Haiku50393932032.0%
Grok 463353021029.7%
Ministral 3B835900028.5%
Aion 2.0944700028.4%
Hermes 3 70B5353250026.3%
Llama 3.1 Nemotron 70B913900026.0%
Stealth: Healer Alpha1002800025.5%
Inception Mercury1002700025.4%
Grok 4.31002500025.0%
GPT-4o, May 13th (temp=0)6339190024.2%
Nemotron 3 Super5932250023.3%
Gemma 4 26B793600022.9%
GPT-4o, May 13th (temp=1)882020021.8%
Grok 4.20 (Beta)6128190021.6%
Claude Haiku 4.54747120021.2%
Grok 4 Fast732570021.0%
Gemini 3 Flash (Preview, Reasoning)100000020.0%
o4 Mini High100000020.0%
Gemma 4 31B100000020.0%
DeepSeek-V2 Chat100000020.0%
Claude 3.7 Sonnet100000020.0%
DeepSeek V3 (2024-12-26)791700019.0%
DeepSeek V3 (2025-03-24)672500018.3%
DeepSeek V3.1592570018.2%
Arcee AI: Trinity Large (Preview)91000018.2%
Mistral Small Creative82000016.4%
Grok 4.202825182014.8%
GPT-4o, Aug. 6th (temp=1)73000014.6%
Z.AI GLM 4.7 Flash67000013.3%
Z.AI GLM 4.5 Air61400013.0%
Hermes 3 405B353000012.9%
DeepSeek V3.264000012.9%
Gemini 2.5 Pro63000012.6%
Gemini 3 Flash (Preview)59300012.4%
GPT-5 Nano52000010.3%
GPT-4.1 Mini43700010.0%
Llama 3.1 70B50000010.0%
Llama 3.1 8B50000010.0%
Z.AI GLM 4.728174009.7%
Stealth: Hunter Alpha4700009.5%
Z.AI GLM 4.53170007.6%
Z.AI GLM 4.62500005.0%
LFM2 24B2500005.0%
Grok 4.20 (Beta, Reasoning)1770004.8%
Gemini 3.1 Flash Lite (Preview)1700003.3%
Cohere Command R+ (Aug. 2024)1700003.3%
Gemini 3 Pro (Preview)700001.4%
Gemini 2.5 Flash Lite (Reasoning)700001.4%
Gemma 3 12B200000.4%
Gemini 2.5 Flash (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemini 3.1 Flash Lite000000.0%
Xiaomi MIMO v2.5000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Gemini 2.5 Flash Lite000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 27B000000.0%
Nemotron 3 Nano000000.0%
GPT-4.1 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Ministral 3 3B100100100100100100.0%
GPT-5.5 (Reasoning, Low)1001001001009799.5%
Qwen 3.5 9B1001001001009799.5%
Z.AI GLM 5.11001001001008897.5%
GPT-51001001001008396.7%
Qwen 3.5 397B A17B1001001001008396.7%
Claude Sonnet 41001001001007695.2%
GPT-5.5 (Reasoning)1001001001005991.8%
Claude Opus 4.510010097886790.3%
GPT-5.41009794945989.1%
Claude Sonnet 4.51001001001003987.8%
MoonshotAI: Kimi K2.61001001001002585.0%
GPT-5.4 Nano (Reasoning)100100100675083.3%
GPT-5 Mini100100100892382.5%
Qwen 3.5 Flash100100100991282.2%
DeepSeek V4 Pro100100100100781.4%
DeepSeek V4 Flash100100100100781.4%
Stealth: Hunter Alpha100100100891781.2%
Z.AI GLM 5 Turbo100100100100080.0%
Qwen 3.5 Plus (2026-04-20)100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Mistral Large 3100100100100080.0%
Grok 4.3100100100100080.0%
Llama 3.1 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Qwen 3.6 Flash100100100732579.6%
Claude Opus 41001009994078.8%
GPT-5.4 (Reasoning)1007973676777.0%
GPT-5.4 (Reasoning, Low)1001009779075.2%
MiniMax M2.710010010067774.8%
MiniMax M2.510010010071074.2%
DeepSeek V4 Flash (Reasoning)10010010067073.3%
Qwen 3.6 35B10010010050070.0%
Writer: Palmyra X510010010050070.0%
Qwen 3.6 27B10010010039067.8%
Qwen 3 32B908973572566.9%
Ministral 3B10010010025065.0%
Qwen 3.5 35B10010010012062.4%
DeepSeek V3.11001001007061.4%
GPT-5.110010050391761.1%
Gemini 3.1 Pro (Preview)1001001000060.0%
Z.AI GLM 51001001000060.0%
MoonshotAI: Kimi K2.51001001000060.0%
ByteDance Seed 1.61001001000060.0%
Gemini 2.5 Flash (Reasoning)1001001000060.0%
Gemini 2.5 Flash Lite (Reasoning)1001001000060.0%
Hermes 3 405B1001001000060.0%
Inception Mercury1001001000060.0%
Mistral Small Creative1001001000060.0%
Gemma 4 31B (Reasoning)100100837058.1%
GPT-5.4 Nano (Reasoning, Low)100100880057.5%
GPT-5.4 Mini (Reasoning, Low)969139391756.3%
GPT-5.294797925055.3%
Claude Opus 4.7 (Reasoning)100100590051.8%
Gemma 4 26B (Reasoning)100100500050.0%
Claude Opus 4.71005039252547.8%
Grok 4.1 Fast100100390047.8%
o4 Mini100100250045.0%
ByteDance Seed 2.0 Lite100100250045.0%
Z.AI GLM 4.7 Flash88735014044.8%
DeepSeek V4 Pro (Reasoning)10085390044.7%
Rocinante 12B10085307044.5%
GPT-5.4 Nano10079327043.6%
Mistral Small 4 (Reasoning)10059500041.8%
Ministral 3 14B10059500041.8%
GPT-5.4 Mini (Reasoning)1009477041.7%
Arcee AI: Trinity Large (Preview)10073323041.7%
DeepSeek V3 (2025-03-24)10010070041.4%
Claude Opus 4.610010000040.0%
Grok 4 Fast10010000040.0%
Gemma 4 26B10010000040.0%
Mistral Small 3.2 24B10010000040.0%
Mistral NeMO10010000040.0%
LFM2 24B10010000040.0%
Qwen 3.5 122B1009400038.9%
Z.AI GLM 4.51009100038.2%
Claude 3.5 Sonnet10045390036.7%
Hermes 3 70B1006770034.8%
DeepSeek V3.21006300032.6%
Stealth: Healer Alpha1005900031.8%
WizardLM 2 8x22b1005000030.0%
Ministral 8B10025250030.0%
GPT-4o, Aug. 6th (temp=0)1003900027.8%
Ministral 3 8B1003900027.8%
Claude 3 Haiku1003900027.8%
Mistral Small 41002570026.4%
GPT-5.4 Mini7043140025.3%
Claude Opus 4.6 (Reasoning)1002500025.0%
Grok 4.20 (Reasoning)1002500025.0%
Aion 2.01002500025.0%
GPT-4.11002500025.0%
Mistral Medium 3.11001700023.3%
Gemini 3 Flash (Preview)7620170022.4%
Grok 4.20 (Beta, Reasoning)100000020.0%
Grok 4100000020.0%
Gemma 4 31B100000020.0%
GPT-4o, May 13th (temp=0)100000020.0%
DeepSeek-V2 Chat100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
GPT-4o, Aug. 6th (temp=1)100000020.0%
ByteDance Seed 1.6 Flash100000020.0%
GPT-4.1 Nano100000020.0%
Llama 3.1 8B100000020.0%
Gemini 3.1 Flash Lite (Reasoning)593200018.3%
GPT-4.1 Mini83000016.7%
Grok 4.2059000011.8%
Claude 3.7 Sonnet56000011.2%
GPT-5 Nano3900007.8%
Gemini 3.1 Flash Lite21140006.9%
Qwen 2.5 72B17170006.7%
Gemini 3 Flash (Preview, Reasoning)2500005.0%
Gemma 3 27B2500005.0%
Gemma 3 4B2500005.0%
o4 Mini High000000.0%
Gemini 3 Pro (Preview)000000.0%
Z.AI GLM 4.7000000.0%
Xiaomi MIMO v2.5 Pro000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Xiaomi MIMO v2.5000000.0%
Nemotron 3 Super000000.0%
Grok 4.20 (Beta)000000.0%
Inception Mercury 2000000.0%
GPT-4o, May 13th (temp=1)000000.0%
Stealth: Aurora Alpha000000.0%
Z.AI GLM 4.5 Air000000.0%
Gemini 2.5 Flash Lite000000.0%
Gemini 2.5 Flash000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
GPT-4o Mini (temp=0)000000.0%
Nemotron 3 Nano000000.0%
Llama 3.1 Nemotron 70B000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.7%
Z.AI GLM 4.61001001001009999.7%
DeepSeek-V2 Chat1001001001009899.7%
Grok 4.201001001001009899.7%
Gemma 3 27B1001001001009799.5%
Mistral Small Creative1001001001009699.3%
Grok 4.20 (Reasoning)1001001001009699.2%
Gemma 4 31B1001001001009699.2%
LFM2 24B1001001001009398.7%
GPT-4.11001001001009198.2%
Gemini 2.5 Flash (Reasoning)1001001001009198.2%
ByteDance Seed 2.0 Lite1001001001009198.2%
Mistral NeMO1001001001009198.2%
Grok 4100100100999298.2%
Gemini 3.1 Pro (Preview)1001001001009098.1%
Qwen 3.5 Flash1001001001008997.9%
Gemma 4 31B (Reasoning)1001001001008997.8%
Xiaomi MIMO v2.5100100100979097.5%
Mistral Small 3.2 24B1001001001008697.2%
Xiaomi MIMO v2.5 Pro1001001001008597.0%
DeepSeek V3.21001001001008396.7%
Gemma 3 12B100100100919096.2%
GPT-5 Mini100100100988296.0%
GPT-5.4 Nano (Reasoning)100100100948495.7%
Z.AI GLM 4.5 Air1001001001007995.7%
Llama 3.1 70B1001001001007995.7%
Ministral 3B1001001001007995.7%
Claude Haiku 4.51001001001007695.2%
Gemini 2.5 Pro100100100888895.0%
Qwen 3 32B100100100898594.8%
DeepSeek V3.110010093918994.8%
Grok 4.1 Fast1001001001007394.6%
Nemotron 3 Super1001001001007394.6%
Stealth: Hunter Alpha100100100868694.4%
Z.AI GLM 4.71001001001007094.0%
Stealth: Healer Alpha1001001001006893.6%
Claude 3.7 Sonnet1001001001006693.2%
Grok 4.31001001001004388.6%
Mistral Small 41001001001003987.8%
GPT-5 Nano100100100855487.7%
GPT-4o Mini (temp=0)1009289817587.4%
Grok 4 Fast100100100923685.6%
Qwen 3.6 35B1001001001002885.5%
Qwen 3.6 Flash100100100901480.7%
GPT-5.4 Nano (Reasoning, Low)958276727079.3%
Qwen 3.5 Plus (2026-04-20)10010010096079.2%
GPT-4.1 Nano1009191813279.1%
GPT-4o, Aug. 6th (temp=0)10010083692876.1%
Gemini 3 Flash (Preview)979592484475.4%
Arcee AI: Trinity Mini10010079762175.1%
GPT-4.1 Mini10010077474273.1%
Qwen 3.5 Plus (2026-02-15)1009185473571.8%
Gemini 3.1 Flash Lite (Preview)1001007962068.1%
Cohere Command R+ (Aug. 2024)10010010035066.9%
Gemini 3 Flash (Preview, Reasoning)1009356453064.7%
Gemini 2.5 Flash Lite (Reasoning)998561501862.6%
Gemini 3.1 Flash Lite (Reasoning)1008581211760.7%
Gemini 2.5 Flash1001001000060.0%
Llama 3.1 8B100100880057.5%
Nemotron 3 Nano86848325757.2%
GPT-4o, May 13th (temp=1)1008061271356.1%
Gemini 2.5 Flash Lite1008150221754.0%
Rocinante 12B1001004517052.3%
Gemini 3 Pro (Preview)100794023549.5%
Inception Mercury 210073670047.9%
Gemma 3 4B100554732046.7%
Llama 3.1 Nemotron 70B100882014745.6%
Gemini 3.1 Flash Lite100453632042.6%
Qwen 3.6 27B10010020040.4%
GPT-OSS 120B79312927033.1%
GPT-4o, Aug. 6th (temp=1)9950170033.1%
GPT-4o Mini (temp=1)36251715018.7%
Stealth: Aurora Alpha523900018.3%
Hermes 3 70B4700009.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Grok 4.1 Fast1001001001009999.7%
Claude Sonnet 41001001001009298.5%
WizardLM 2 8x22b1001001001009298.5%
ByteDance Seed 2.0 Mini1001001001009198.2%
Qwen 3.6 Flash1001001001009098.0%
Mistral Small 41001001001008897.5%
MoonshotAI: Kimi K2.51001001001008797.3%
Qwen 3.5 Plus (2026-04-20)1001001001008597.0%
Qwen3.6 Max Preview1001001001008396.7%
Grok 4.20 (Beta, Reasoning)1001001001008396.7%
Grok 4.20 (Beta)1001001001008196.2%
Mistral Large100100100898895.4%
Grok 4.31001001001007695.2%
DeepSeek V3 (2025-03-24)100100100918394.9%
Z.AI GLM 4.5100100100967594.1%
Claude Opus 41001001001007094.0%
o4 Mini10010095888393.2%
Mistral Small Creative100100100887692.7%
Mistral Medium 3.11001001001006292.4%
Mistral Small 3.2 24B1001001001005991.8%
Qwen 3.5 27B100100100946491.6%
Stealth: Hunter Alpha1001001001005390.6%
Stealth: Healer Alpha100100100797089.7%
Claude 3 Haiku100100100965089.2%
Qwen 3.5 9B1001001001004689.2%
Grok 4.20100100100776889.0%
o4 Mini High10010090797689.0%
Qwen 3.5 122B100100100836188.9%
Gemma 4 26B (Reasoning)100100100994588.7%
Rocinante 12B10010097736787.4%
DeepSeek V3.2100100100766187.4%
Gemma 4 26B1009983777787.3%
Grok 4.20 (Reasoning)1001001001003687.1%
Gemma 4 31B (Reasoning)100100100706386.6%
ByteDance Seed 1.6 Flash100100100715986.0%
GPT-5 Mini959089837286.0%
Aion 2.0100100100646285.2%
Mistral Small 4 (Reasoning)100100100893584.8%
Gemini 3.1 Pro (Preview)1001001001002384.6%
GPT-5.4 Nano (Reasoning)10010087756084.3%
Arcee AI: Trinity Large (Preview)100100100734583.6%
GPT-5.4 Nano (Reasoning, Low)1009182747083.4%
Ministral 8B1001001001001783.3%
DeepSeek-V2 Chat100100100564880.8%
DeepSeek V3 (2024-12-26)1009084794880.3%
Qwen 3.6 35B100100100100080.0%
Claude 3.5 Sonnet10010010088778.9%
Claude 3.7 Sonnet10010095692778.3%
Nemotron 3 Super10010083703477.4%
Gemma 3 27B1008885753676.7%
Grok 4 Fast1008179635976.3%
GPT-4.110010010080076.0%
Ministral 3 8B1001009173774.3%
Llama 3.1 8B888883595073.5%
Qwen 3.5 Flash1001008380072.6%
Gemini 2.5 Pro10010010055071.0%
Xiaomi MIMO v2.510010059593670.8%
Grok 4.3 (Reasoning)10010010050070.0%
Ministral 3 14B100888870069.0%
Hermes 3 70B10010010045068.9%
Mistral NeMO10010091361768.8%
Ministral 3 3B10010079322567.2%
GPT-4o, May 13th (temp=0)1001009125063.2%
Z.AI GLM 4.7 Flash1001006943062.3%
Hermes 3 405B1001001000060.0%
Gemini 2.5 Flash86757556058.5%
Z.AI GLM 4.7837947433958.0%
Arcee AI: Trinity Mini1006753452557.9%
Ministral 3B100100797057.1%
Qwen 2.5 72B1006750441555.0%
Z.AI GLM 4.5 Air100685947054.7%
LFM2 24B94916320053.6%
DeepSeek V3.1100794239051.9%
Gemini 3 Flash (Preview, Reasoning)100804225750.8%
Grok 489853932049.2%
GPT-4o, Aug. 6th (temp=0)595352392846.3%
GPT-4o Mini (temp=0)9965630045.5%
Qwen 3 32B735050252143.8%
Llama 3.1 Nemotron 70B67675525042.6%
Gemini 3.1 Flash Lite (Preview)9181350041.4%
Cohere Command R+ (Aug. 2024)81553925039.9%
Gemini 2.5 Flash (Reasoning)81453932039.4%
Gemma 3 12B7975285037.1%
GPT-4.1 Mini75572922036.6%
GPT-4.1 Nano73633017036.6%
Qwen 3.6 27B1006300032.5%
Gemini 3 Pro (Preview)92252315031.0%
Gemma 4 31B6753290029.7%
Qwen 3.5 Plus (2026-02-15)6347360029.1%
Gemini 3.1 Flash Lite9430170028.2%
GPT-5 Nano48251512019.8%
Gemini 2.5 Flash Lite (Reasoning)59777016.1%
GPT-4o, May 13th (temp=1)521800014.2%
GPT-4o, Aug. 6th (temp=1)363500014.1%
Gemini 3.1 Flash Lite (Reasoning)25120007.4%
GPT-4o Mini (temp=1)2500005.0%
Gemini 3 Flash (Preview)1743004.7%
Gemini 2.5 Flash Lite2000003.9%
GPT-OSS 120B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Qwen 2.5 72B1001001001009999.7%
GPT-5.21001001001009599.1%
ByteDance Seed 2.0 Mini1001001001009498.9%
Gemini 3.1 Pro (Preview)1001001001009498.8%
DeepSeek V3 (2024-12-26)1001001001009298.5%
ByteDance Seed 1.61001001001009198.2%
Grok 4.1 Fast1001001001009198.2%
Gemma 4 31B (Reasoning)1001001001008897.5%
Grok 4.31001001001008897.5%
GPT-5.4 Nano (Reasoning)1001001001008797.3%
GPT-4o, May 13th (temp=0)1001001001008697.2%
o4 Mini100100100949297.1%
WizardLM 2 8x22b1001001001008496.9%
DeepSeek V3.21001001001008396.7%
MiniMax M2.5100100100988496.5%
Gemma 4 31B100100100938996.5%
o4 Mini High10010097928995.6%
Llama 3.1 70B100100100888895.0%
Mistral Small Creative1001001001007595.0%
Mistral Medium 3.110010096958194.6%
Aion 2.01001001001006993.9%
Qwen 3.6 35B100100100977293.8%
GPT-5 Mini100100100888193.6%
Gemini 2.5 Flash100100100947393.5%
Z.AI GLM 4.7 Flash100100100917493.1%
Gemma 4 26B1001001001006593.1%
Claude Haiku 4.5100100100927393.0%
DeepSeek V3.110010095937592.7%
Xiaomi MIMO v2.51001001001006292.4%
Mistral Small 41001001001006192.1%
Grok 4.3 (Reasoning)1001001001005991.8%
Qwen 3.5 122B1001001001005791.5%
Gemini 3 Flash (Preview, Reasoning)100100100827591.4%
Llama 3.1 8B100100100837391.3%
Claude 3.7 Sonnet100100100817491.1%
Grok 4.20 (Beta, Reasoning)100100100994588.7%
DeepSeek V3 (2025-03-24)100100100736988.4%
LFM2 24B1001001001004288.4%
Ministral 3B100100100816188.3%
Z.AI GLM 4.5 Air100100100934888.3%
DeepSeek-V2 Chat100100100964488.1%
Gemini 2.5 Pro10010085807387.6%
GPT-4o, Aug. 6th (temp=0)100100100766287.5%
ByteDance Seed 2.0 Lite100100100795987.5%
Ministral 3 14B100100100973987.3%
GPT-5.4 Nano (Reasoning, Low)100100100696787.2%
Z.AI GLM 4.6939189837987.1%
ByteDance Seed 1.6 Flash1001001001002985.8%
Mistral Large100100100814785.7%
Grok 4.20 (Beta)1009189757285.6%
Gemini 2.5 Flash (Reasoning)10010098834485.0%
Inception Mercury1001001001002585.0%
Stealth: Healer Alpha100100100913084.2%
GPT-5 Nano1008581767583.4%
Qwen 3.6 Flash100100100100080.0%
Hermes 3 405B100100100100080.0%
Mistral Small 4 (Reasoning)10010094673980.0%
GPT-4o Mini (temp=0)10010096881579.8%
Ministral 3 8B100100100792079.6%
MoonshotAI: Kimi K2.5100100100564179.5%
GPT-5.4 Nano1009768636177.8%
Gemini 3 Pro (Preview)1009172646177.7%
Ministral 8B100100100533577.7%
Qwen 3 32B1009183635077.5%
Mistral NeMO10010010088077.5%
Grok 4.20100949492777.5%
Grok 4 Fast10010010085077.1%
Nemotron 3 Super10010088752276.9%
Gemma 3 27B100100100542876.4%
Gemini 3 Flash (Preview)1008987733175.9%
Grok 410010097453174.6%
Mistral Small 3.2 24B1001008879373.8%
Grok 4.20 (Reasoning)1008265625272.1%
Z.AI GLM 4.710010073582871.9%
Gemini 3.1 Flash Lite (Preview)1001007768069.0%
Arcee AI: Trinity Large (Preview)10010010035066.9%
Qwen 3.6 27B1001009341066.8%
Gemini 3.1 Flash Lite (Reasoning)1009767482166.6%
GPT-4o Mini (temp=1)807354524861.3%
GPT-4.1 Nano1007059561259.5%
Qwen 3.5 Plus (2026-02-15)1001006431059.1%
Rocinante 12B100100810056.2%
Claude 3 Haiku1001004729055.1%
Hermes 3 70B100100630052.6%
Nemotron 3 Nano100776322052.3%
Ministral 3 3B100100554051.7%
Gemini 3.1 Flash Lite96935514051.4%
GPT-4o, May 13th (temp=1)83675041048.3%
GPT-4.1 Mini100752322043.8%
Inception Mercury 28382430041.6%
Gemma 3 12B76563931040.5%
Cohere Command R+ (Aug. 2024)7971200033.9%
Arcee AI: Trinity Mini10045200032.9%
Gemma 3 4B79292221030.0%
Gemini 2.5 Flash Lite (Reasoning)636300025.2%
Llama 3.1 Nemotron 70B1002500025.0%
GPT-4o, Aug. 6th (temp=1)794300024.3%
Stealth: Aurora Alpha100000020.0%
GPT-OSS 120B523630018.2%
Gemini 2.5 Flash Lite56000011.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V4 Pro (Reasoning)1001001001009198.2%
Qwen 3.5 Flash100100100979398.0%
Qwen3 235B A22B Instruct 25071001001001008997.8%
Z.AI GLM 51001001001008897.5%
GPT-5.4 Mini1001001001008897.5%
DeepSeek V3.1100100100998897.2%
Writer: Palmyra X51001001001008396.7%
Z.AI GLM 5 Turbo1001001001008196.2%
GPT-5.5100100100898895.4%
Claude Opus 4.6 (Reasoning)1001001001007394.6%
GPT-5.41009997898694.4%
Qwen3.6 Max Preview1001001001006793.3%
Ministral 3 8B100100100887392.1%
Claude Haiku 4.51001001001005991.8%
Claude Sonnet 4.51001001001005591.0%
GPT-5.21001001001005490.8%
Qwen 3.6 35B1001001001005290.5%
Ministral 8B100100100995089.7%
ByteDance Seed 1.6 Flash100100100836289.0%
DeepSeek V4 Pro100100100943986.7%
Qwen 3.5 35B10010093726786.2%
Qwen 3.5 9B100100100893985.7%
Gemini 2.5 Flash Lite1001001001002585.0%
Ministral 3 3B1001001001002585.0%
Claude Opus 4100100100733982.4%
GPT-5.110010086705582.2%
GPT-5 Mini100100100931481.3%
MiniMax M2.710010010097780.9%
DeepSeek V4 Flash100100100792580.7%
ByteDance Seed 1.6100100100100080.0%
DeepSeek V4 Flash (Reasoning)100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
ByteDance Seed 2.0 Lite100100100100080.0%
Inception Mercury100100100100080.0%
Z.AI GLM 5.110010010096079.2%
Claude Sonnet 410010010096079.2%
Mistral Large 210010010094078.9%
Qwen 3.6 Flash10010010093078.6%
DeepSeek V3 (2025-03-24)10010010085778.5%
Mistral Large10010094792078.5%
MoonshotAI: Kimi K2.5100100100503977.8%
GPT-5.4 (Reasoning, Low)10010079575377.8%
Mistral Medium 3.110010081624577.5%
Mistral Small 3.2 24B10010010083076.7%
Gemma 3 27B1008885634576.2%
DeepSeek V3.210010010077075.4%
Xiaomi MIMO v2.510010010076075.2%
GPT-5.4 Nano10010010076075.2%
GPT-5.4 Mini (Reasoning, Low)10010082692274.5%
Grok 4.20 (Reasoning)10010079672574.0%
Gemma 4 31B (Reasoning)10010010067073.3%
Qwen 3.5 Plus (2026-04-20)10010010067073.3%
Stealth: Hunter Alpha100100100391771.1%
Grok 4.2010010010053070.6%
Claude Opus 4.51001008167069.5%
o4 Mini10010010045068.9%
Mistral Small 41001008359068.5%
Qwen 3 32B1001009432766.6%
Grok 4.3100978350066.1%
Claude 3.7 Sonnet1008873353365.7%
GPT-5.4 Nano (Reasoning)1001007744765.5%
Mistral Large 310010010025065.0%
Llama 3.1 70B10010010025065.0%
WizardLM 2 8x22b1001007239062.2%
Gemma 4 26B (Reasoning)1001001007061.4%
Mistral Small Creative100965553060.8%
Claude Opus 4.7 (Reasoning)1001005945060.8%
Grok 4.3 (Reasoning)1001001000060.0%
o4 Mini High1001001000060.0%
Z.AI GLM 4.61001001000060.0%
Gemini 2.5 Pro1001001000060.0%
Xiaomi MIMO v2.5 Pro1001001000060.0%
Gemma 4 31B1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Arcee AI: Trinity Mini1001001000060.0%
Claude Opus 4.7100100940058.9%
Grok 4.20 (Beta, Reasoning)1007656391757.5%
Gemini 2.5 Flash (Reasoning)100100830056.7%
Z.AI GLM 4.710099757056.2%
Z.AI GLM 4.5100887320056.0%
Ministral 3B100100730054.6%
Ministral 3 14B100837614054.6%
Hermes 3 405B1001005017053.3%
Mistral Small 4 (Reasoning)1009439141251.8%
Stealth: Healer Alpha1001003914050.5%
Mistral NeMO1001002521049.2%
DeepSeek-V2 Chat100635525048.6%
Gemini 3.1 Pro (Preview)67675550348.2%
Z.AI GLM 4.5 Air100646213047.8%
GPT-5.4 Nano (Reasoning, Low)686157312147.7%
Grok 4 Fast1001001717046.7%
Grok 4.1 Fast100100257046.4%
ByteDance Seed 2.0 Mini100100257046.4%
GPT-4o, Aug. 6th (temp=0)100100257046.4%
Arcee AI: Trinity Large (Preview)10093320045.0%
DeepSeek V3 (2024-12-26)100453932043.2%
Gemma 4 26B100454325042.5%
Grok 49475257040.3%
Qwen 3.6 27B10010010040.2%
Qwen 2.5 72B10010000040.0%
GPT-4.1 Nano10010000040.0%
Rocinante 12B10010000040.0%
Gemma 3 4B10079120038.1%
GPT-4.110055290036.7%
Gemini 3.1 Flash Lite (Reasoning)8859302035.7%
Llama 3.1 Nemotron 70B10059170035.2%
Gemini 3 Flash (Preview)1005670032.7%
Cohere Command R+ (Aug. 2024)10035147732.5%
Gemma 3 12B7667140031.2%
Grok 4.20 (Beta)10029270031.1%
GPT-4.1 Mini1003900027.8%
Hermes 3 70B1003200026.5%
Llama 3.1 8B100000020.0%
Gemini 3.1 Flash Lite (Preview)633500019.5%
Qwen 3.5 Plus (2026-02-15)91000018.2%
GPT-4o, May 13th (temp=0)89000017.8%
Nemotron 3 Super79000015.7%
Gemini 3.1 Flash Lite501400012.7%
Gemini 3 Pro (Preview)323100012.7%
Claude 3 Haiku252500010.0%
Nemotron 3 Nano4500008.9%
GPT-5 Nano1100002.2%
Gemini 3 Flash (Preview, Reasoning)000000.0%
GPT-OSS 120B000000.0%
Inception Mercury 2000000.0%
GPT-4o, May 13th (temp=1)000000.0%
Stealth: Aurora Alpha000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
LFM2 24B000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Z.AI GLM 5.11001001001009799.5%
Claude Sonnet 4.6 (Reasoning)100100100948896.4%
MiniMax M2.51009491837588.8%
GPT-5.5999392896487.4%
Grok 4.3 (Reasoning)10010083836786.7%
DeepSeek V4 Flash1009785796785.5%
Mistral Large 31009189806284.4%
GPT-5.4 Mini (Reasoning, Low)999185825983.2%
ByteDance Seed 1.6 Flash10010094813782.5%
GPT-5.5 (Reasoning, Low)1009084676380.7%
Ministral 8B100100100732880.2%
Z.AI GLM 510010010099280.1%
Qwen 3.5 397B A17B100100100100080.0%
Qwen 3.6 Flash100100100100080.0%
Qwen 3.6 35B100100100100080.0%
DeepSeek V4 Pro898888834779.0%
GPT-5.4 (Reasoning)1008579705677.9%
ByteDance Seed 1.61001009488076.4%
GPT-5.4 (Reasoning, Low)1008482754176.4%
GPT-5.5 (Reasoning)1009582623775.2%
Qwen 3.5 9B1001009180074.3%
GPT-5.4 Mini969473565073.9%
MiniMax M2.710010085621271.9%
Z.AI GLM 5 Turbo1001009753771.5%
Mistral Large 2817167676469.9%
GPT-5.41009678571769.6%
DeepSeek V4 Flash (Reasoning)1008179533269.0%
Ministral 3 3B1001009450068.9%
Mistral Large966359595666.7%
Claude Haiku 4.51007059473962.9%
Claude Opus 4.6 (Reasoning)100857725758.9%
DeepSeek V4 Pro (Reasoning)100856939058.7%
Claude Opus 4.51006750423558.6%
GPT-5.4 Mini (Reasoning)896967322556.5%
Xiaomi MIMO v2.5100775352056.4%
Claude Opus 4.71001005020053.9%
Qwen 3.5 Flash100100670053.3%
Mistral Small 41006352361152.5%
GPT-4o, May 13th (temp=0)100734732250.8%
GPT-4.1100100500050.0%
Mistral Small 4 (Reasoning)83734746049.9%
Qwen3 235B A22B Instruct 250799734335049.9%
Inception Mercury100100400048.0%
GPT-5 Mini91565039047.3%
Mistral Medium 3.110089420046.3%
ByteDance Seed 2.0 Lite100593925746.0%
Ministral 3B10079500045.7%
MoonshotAI: Kimi K2.688733929045.7%
Qwen 2.5 72B10097200043.4%
Qwen 3.5 122B100100170043.3%
Mistral Small Creative100503925042.8%
LFM2 24B10083300042.7%
Qwen 3 32B73685612042.0%
GPT-5.2583939353240.5%
Ministral 3 14B6967650040.2%
Qwen 3.5 Plus (2026-04-20)10010000040.0%
Qwen 3.5 35B10010000040.0%
MoonshotAI: Kimi K2.58365470039.0%
Claude Opus 4.7 (Reasoning)85592520037.8%
Claude Opus 4.682363625737.2%
WizardLM 2 8x22b1007370036.0%
Claude Sonnet 4.588532512035.6%
Claude 3 Haiku7552470035.0%
Mistral NeMO887970034.6%
GPT-563464118033.6%
Aion 2.08556157032.7%
GPT-4o, Aug. 6th (temp=0)73452514031.3%
GPT-5 Nano905900029.8%
Claude 3.7 Sonnet5756332029.8%
Qwen 3.5 27B45363521428.1%
ByteDance Seed 2.0 Mini10021140026.9%
Inception Mercury 24741363025.4%
Gemini 2.5 Pro1002500025.0%
Ministral 3 8B1001820024.1%
GPT-4.1 Mini705000024.0%
Xiaomi MIMO v2.5 Pro763470023.4%
Hermes 3 405B595530023.4%
Grok 4.20645000022.9%
Claude Opus 4921900022.2%
Arcee AI: Trinity Large (Preview)6335110021.8%
Hermes 3 70B763200021.7%
Stealth: Aurora Alpha5632170021.1%
GPT-4o, May 13th (temp=1)4343170020.5%
Gemini 3.1 Pro (Preview)100000020.0%
Gemma 4 31B (Reasoning)100000020.0%
Qwen 3.6 27B100000020.0%
Grok 4.3100000020.0%
Mistral Small 3.2 24B100000020.0%
Gemini 2.5 Flash Lite534700020.0%
Claude Sonnet 4672500018.3%
DeepSeek V3.288000017.5%
Nemotron 3 Nano30251712016.7%
Qwen 3.5 Plus (2026-02-15)641030015.5%
Z.AI GLM 4.5 Air462900015.0%
GPT-5.13030142014.9%
GPT-5.4 Nano72000014.3%
Grok 4.20 (Beta, Reasoning)69000013.8%
Writer: Palmyra X562430013.7%
GPT-5.4 Nano (Reasoning)412500013.2%
DeepSeek V3 (2024-12-26)412200012.6%
Claude 3.5 Sonnet59000011.8%
GPT-5.4 Nano (Reasoning, Low)372000011.4%
Stealth: Hunter Alpha451200011.4%
Grok 4.20 (Beta)421400011.1%
Llama 3.1 8B55000011.0%
Grok 4.1 Fast352000010.8%
DeepSeek V3 (2025-03-24)4500008.9%
Z.AI GLM 4.7 Flash3570008.4%
DeepSeek V3.13900007.8%
Llama 3.1 Nemotron 70B3900007.8%
GPT-4o Mini (temp=0)22150007.3%
Gemma 3 27B2520005.4%
Z.AI GLM 4.51870005.1%
o4 Mini High2500005.0%
Z.AI GLM 4.62500005.0%
Gemma 3 4B2000003.9%
o4 Mini700001.4%
Gemini 2.5 Flash (Reasoning)700001.4%
Rocinante 12B700001.4%
Grok 4 Fast700001.4%
Gemini 2.5 Flash200000.4%
Gemma 4 26B (Reasoning)000000.0%
Grok 4.20 (Reasoning)000000.0%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Gemini 3 Pro (Preview)000000.0%
Z.AI GLM 4.7000000.0%
Grok 4000000.0%
Gemma 4 31B000000.0%
GPT-OSS 120B000000.0%
Stealth: Healer Alpha000000.0%
Gemma 4 26B000000.0%
Gemini 2.5 Flash Lite (Reasoning)000000.0%
Gemini 3 Flash (Preview)000000.0%
DeepSeek-V2 Chat000000.0%
Nemotron 3 Super000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
Llama 3.1 70B000000.0%
GPT-4.1 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Cohere Command R+ (Aug. 2024)000000.0%