Narrator intent-glossing

Test: Bad Writing Habits

Avg. Score
70.4%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast97.8%$0.001837.8s81%
2o4 Mini92.2%$0.01525.7s68%
3Qwen 3.6 Flash93.1%$0.01041.4s65%
4Qwen 3.6 35B93.7%$0.00831.0m63%
5Qwen 3.5 Plus (2026-04-20)95.1%$0.0171.8m71%
6GPT-5.4 Mini (Reasoning, Low)87.7%$0.01516.8s56%
7o4 Mini High92.9%$0.02547.2s62%
8Grok 4.387.3%$0.006930.5s55%
9GPT-5.4 Mini86.0%$0.01516.8s56%
10Qwen 3.5 9B88.7%$0.00111.4m57%
11DeepSeek V3 (2025-03-24)85.8%$0.001439.4s50%
12Hermes 3 405B89.5%$0.003253.2s50%
13ByteDance Seed 1.6 Flash84.2%$0.001327.3s48%
14GPT-5.4 (Reasoning, Low)93.1%$0.0551.4m72%
15Mistral Small Creative81.1%$0.00079.1s45%
16Qwen 3.5 Flash85.5%$0.002547.5s49%
17Qwen3 235B A22B Instruct 250787.0%$0.001159.2s47%
18GPT-5.4 Mini (Reasoning)84.4%$0.02228.1s51%
19Claude 3 Haiku83.6%$0.002514.9s41%
20GPT-5.492.0%$0.0491.4m67%
21Qwen 3.6 27B93.5%$0.0252.3m65%
22Rocinante 12B85.5%$0.001438.4s41%
23Mistral NeMO81.9%$0.000510.1s36%
24Writer: Palmyra X584.3%$0.01122.0s41%
25Mistral Medium 3.183.0%$0.004836.5s42%
26Mistral Small 479.1%$0.001418.2s38%
27Qwen3.6 Max Preview96.4%$0.0503.5m82%
28Mistral Large80.7%$0.01430.9s42%
29Qwen 3 32B80.7%$0.001554.6s42%
30Qwen 3.5 27B86.5%$0.0201.6m52%
31Grok 4 Fast77.9%$0.001724.1s37%
32Grok 4.20 (Reasoning)85.4%$0.0181.5m49%
33Qwen 3.5 35B82.4%$0.0181.0m45%
34GPT-5.4 (Reasoning)96.0%$0.0892.6m81%
35Ministral 3 14B75.1%$0.000711.7s33%
36Qwen 2.5 72B77.4%$0.001036.7s34%
37GPT-4.179.6%$0.01844.7s40%
38GPT-5.598.0%$0.1391.7m84%
39Qwen 3.5 397B A17B89.1%$0.0143.0m58%
40Qwen 3.5 122B83.1%$0.0251.1m43%
41LFM2 24B75.6%$0.000228.4s32%
42Gemini 3.1 Pro (Preview)95.7%$0.1071.8m72%
43Hermes 3 70B80.5%$0.00101.2m36%
44DeepSeek V3 (2024-12-26)78.1%$0.002154.6s35%
45DeepSeek-V2 Chat78.1%$0.002153.3s34%
46GPT-4o, May 13th (temp=0)81.6%$0.03514.1s36%
47DeepSeek V4 Flash (Reasoning)75.1%$0.000731.1s30%
48Mistral Large 374.6%$0.003330.3s31%
49GPT-5.5 (Reasoning)96.7%$0.1421.8m82%
50GPT-4o Mini (temp=1)74.4%$0.001234.8s31%
51Xiaomi MIMO v2.5 Pro76.6%$0.008553.5s35%
52GPT-4o Mini (temp=0)73.6%$0.001234.8s30%
53Grok 4.3 (Reasoning)86.8%$0.0212.3m48%
54Stealth: Hunter Alpha74.0%$0.000055.0s32%
55Ministral 3 3B70.8%$0.000511.1s24%
56GPT-5.5 (Reasoning, Low)96.1%$0.1391.8m76%
57Ministral 3B70.2%$0.00018.1s22%
58ByteDance Seed 2.0 Lite83.7%$0.0122.2m40%
59Mistral Large 272.3%$0.01329.4s28%
60Grok 4.20 (Beta)69.0%$0.01815.8s30%
61Ministral 3 8B70.1%$0.000819.6s21%
62DeepSeek V4 Flash68.3%$0.000631.6s24%
63Mistral Small 4 (Reasoning)68.6%$0.002230.2s24%
64DeepSeek V3.275.5%$0.00141.9m34%
65Gemma 3 12B68.9%$0.000441.3s24%
66Stealth: Healer Alpha66.8%$0.000023.7s21%
67Gemini 3.1 Flash Lite (Preview)64.5%$0.00308.4s20%
68GPT-4o, Aug. 6th (temp=0)72.1%$0.02322.7s24%
69Gemini 3.1 Flash Lite65.9%$0.003012.1s20%
70DeepSeek V4 Pro71.9%$0.00481.3m29%
71Qwen 3.5 Plus (2026-02-15)67.8%$0.006031.5s23%
72Grok 4.2068.1%$0.009345.7s27%
73GPT-4o, May 13th (temp=1)71.3%$0.03314.4s27%
74Ministral 8B65.4%$0.000410.4s18%
75Xiaomi MIMO v2.567.5%$0.005431.8s22%
76Z.AI GLM 4.772.6%$0.0101.4m31%
77Arcee AI: Trinity Mini66.2%$0.00039.2s16%
78Gemma 3 27B68.7%$0.000652.6s23%
79MoonshotAI: Kimi K2.583.0%$0.0193.2m46%
80Gemini 2.5 Pro72.1%$0.03636.2s29%
81Cohere Command R+ (Aug. 2024)74.2%$0.02052.5s24%
82Grok 4.20 (Beta, Reasoning)72.3%$0.03934.0s29%
83Gemini 3.1 Flash Lite (Reasoning)63.2%$0.003011.9s18%
84GPT-5.179.5%$0.0541.8m41%
85Gemma 3 4B63.6%$0.000220.0s16%
86Gemini 3 Flash (Preview)61.1%$0.007819.6s21%
87Aion 2.069.0%$0.00641.3m25%
88GPT-4.1 Mini63.3%$0.002719.0s15%
89Z.AI GLM 4.7 Flash66.1%$0.00171.2m23%
90Gemini 2.5 Flash58.5%$0.005210.6s18%
91Gemini 3 Pro (Preview)74.2%$0.05554.4s32%
92Claude Sonnet 4.669.8%$0.03139.3s23%
93ByteDance Seed 1.676.7%$0.0132.5m33%
94Z.AI GLM 566.0%$0.00841.2m22%
95GPT-4o, Aug. 6th (temp=1)64.3%$0.01824.4s17%
96Claude Opus 4.7 (Reasoning)75.3%$0.07632.0s30%
97Claude Opus 4.772.8%$0.06930.4s28%
98Z.AI GLM 4.561.9%$0.005142.1s14%
99Claude Sonnet 4.565.3%$0.03538.1s21%
100Llama 3.1 70B59.0%$0.001529.4s12%
101Z.AI GLM 5 Turbo58.3%$0.008133.2s14%
102GPT-4.1 Nano53.9%$0.000713.3s11%
103Arcee AI: Trinity Large (Preview)58.1%$0.000043.6s12%
104WizardLM 2 8x22b66.7%$0.00261.8m19%
105Z.AI GLM 5.165.2%$0.0141.5m20%
106Z.AI GLM 4.658.5%$0.006551.5s14%
107DeepSeek V3.163.6%$0.00201.8m19%
108Z.AI GLM 4.5 Air59.6%$0.002958.2s12%
109Grok 471.9%$0.0481.7m26%
110Gemini 2.5 Flash (Reasoning)51.1%$0.01121.5s13%
111Gemini 2.5 Flash Lite48.0%$0.00099.5s9%
112DeepSeek V4 Pro (Reasoning)72.7%$0.0153.1m28%
113GPT-5.4 Nano (Reasoning, Low)45.1%$0.005520.6s15%
114Gemini 3 Flash (Preview, Reasoning)51.7%$0.01230.1s12%
115Llama 3.1 Nemotron 70B48.9%$0.003831.7s10%
116Claude Opus 4.668.8%$0.0781.2m29%
117MiniMax M2.556.4%$0.00341.3m11%
118Claude 3.7 Sonnet61.0%$0.04246.7s16%
119GPT-5.4 Nano (Reasoning)44.3%$0.006124.5s12%
120MiniMax M2.753.9%$0.00401.1m10%
121GPT-5.4 Nano42.2%$0.005726.3s12%
122Claude Opus 4.6 (Reasoning)73.0%$0.0881.4m27%
123Claude 3.5 Sonnet61.1%$0.04835.5s13%
124ByteDance Seed 2.0 Mini79.3%$0.00454.9m34%
125Llama 3.1 8B53.1%$0.00031.3m9%
126Gemma 4 31B (Reasoning)57.2%$0.00142.2m16%
127Inception Mercury50.1%$0.01117.6s2%
128Gemma 4 31B52.1%$0.00101.6m13%
129Gemma 4 26B47.1%$0.000955.1s8%
130Claude Haiku 4.545.5%$0.01121.6s7%
131Claude Sonnet 4.6 (Reasoning)64.8%$0.0601.2m17%
132Claude Sonnet 454.4%$0.03243.7s9%
133GPT-572.2%$0.0652.8m32%
134Gemini 2.5 Flash Lite (Reasoning)36.8%$0.002830.8s7%
135GPT-5 Mini39.3%$0.010057.4s8%
136Gemma 4 26B (Reasoning)49.8%$0.00132.0m8%
137Inception Mercury 232.3%$0.00327.0s1%
138Mistral Small 3.2 24B80.0%$0.00695.7m27%
139Stealth: Aurora Alpha29.3%$0.00009.8s0%
140GPT-5.254.6%$0.0561.5m15%
141Claude Opus 4.552.8%$0.07053.4s11%
142MoonshotAI: Kimi K2.686.5%$0.0586.5m47%
143Nemotron 3 Super32.9%$0.00001.4m5%
144GPT-OSS 120B31.9%$0.00151.8m5%
145Nemotron 3 Nano15.6%$0.00101.1m0%
146Claude Opus 471.2%$0.2091.4m29%
147GPT-5 Nano10.4%$0.00421.4m0%
70.37%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
o4 Mini High100100100100100100.0%
Qwen 3.6 35B1001001001009799.5%
GPT-5.51001001001009799.4%
Grok 4.1 Fast1001001001009098.0%
Hermes 3 405B100100100949397.5%
Grok 4.31001001001008797.3%
Grok 4.3 (Reasoning)100100100998396.5%
DeepSeek V3.21001001001007595.1%
MoonshotAI: Kimi K2.6100100100938194.7%
DeepSeek V3 (2025-03-24)1001001001007294.4%
Claude 3 Haiku100100100977394.0%
GPT-5.11001001001005691.1%
Qwen 3.5 Flash100100100966091.1%
Qwen 3.5 27B1001001001005490.8%
GPT-5.4 Mini (Reasoning, Low)1001001001005390.6%
Qwen 3.5 122B1001001001005390.5%
Qwen 3 32B100100100767389.8%
Grok 4.20 (Reasoning)10010097856589.5%
Claude Opus 4.610010085827989.0%
GPT-5.4100100100855788.4%
Qwen3 235B A22B Instruct 2507100100100716887.7%
GPT-5.5 (Reasoning, Low)100100100993787.3%
Grok 4 Fast10010092846087.1%
GPT-5.4 (Reasoning, Low)1009088827586.8%
Qwen 3.5 9B10010098884886.7%
Gemini 3.1 Pro (Preview)100100100933685.8%
MoonshotAI: Kimi K2.510010098646284.8%
o4 Mini100100100784484.3%
Qwen 3.5 397B A17B1009390795984.3%
Mistral Small 410010093823682.4%
GPT-4.1 Mini100100100713881.7%
Mistral Large100100100752880.6%
Qwen 3.5 35B10010098752980.5%
Ministral 3 14B10010093842580.5%
GPT-5.4 Mini (Reasoning)10010095852180.3%
GPT-4o Mini (temp=1)100100100742780.3%
DeepSeek V4 Pro10010078626180.1%
Claude Sonnet 4.6 (Reasoning)100100100623880.1%
DeepSeek V3 (2024-12-26)100100100100080.0%
Hermes 3 70B10010093673679.2%
GPT-4.11009089892578.7%
Qwen 3.6 27B100100100612877.8%
ByteDance Seed 2.0 Mini10010010081376.7%
Mistral Small Creative10010089572975.1%
GPT-4o, Aug. 6th (temp=0)100100100541874.5%
GPT-5.4 Mini1009879474674.0%
Mistral NeMO1001009868073.2%
Aion 2.01008582731170.2%
Claude Opus 4.7 (Reasoning)10010010050070.0%
Gemini 3 Pro (Preview)1008373682569.9%
Grok 4.20 (Beta, Reasoning)1001009046067.3%
ByteDance Seed 1.61008964413666.1%
Arcee AI: Trinity Mini1001007654066.1%
Stealth: Hunter Alpha1007871681365.8%
Grok 4.20 (Beta)1007359593865.6%
Mistral Large 2100967250865.2%
ByteDance Seed 2.0 Lite1009967411865.1%
Qwen 2.5 72B1007167443864.0%
ByteDance Seed 1.6 Flash10010051472264.0%
Claude Sonnet 4.51001009023062.5%
Gemini 3.1 Flash Lite (Preview)100766967062.5%
Writer: Palmyra X51001001008061.7%
Claude Opus 4.6 (Reasoning)1001006443061.3%
Claude Opus 4.71009464311160.1%
Ministral 3B1001001000060.0%
GPT-4o, May 13th (temp=1)1007943413659.8%
Z.AI GLM 4.6100856447059.2%
Claude 3.5 Sonnet100938418059.2%
GPT-585848042058.3%
Claude Opus 4.5100746153157.7%
Cohere Command R+ (Aug. 2024)10093860055.8%
Z.AI GLM 5.1100886718054.7%
LFM2 24B100735046053.8%
Gemini 3 Flash (Preview)90765446053.2%
Gemma 3 27B76756936352.1%
Grok 4100100590051.8%
Mistral Medium 3.1100685527049.8%
Gemini 2.5 Pro100704133048.9%
Gemma 4 26B (Reasoning)1001003311048.8%
Gemini 2.5 Flash696940362748.5%
Qwen 3.5 Plus (2026-02-15)10093426048.3%
Z.AI GLM 4.7100100360047.3%
DeepSeek V4 Pro (Reasoning)100504638046.8%
Xiaomi MIMO v2.5 Pro9492470046.6%
DeepSeek-V2 Chat100100310046.2%
Mistral Large 310077480045.0%
GPT-4o, Aug. 6th (temp=1)10071483044.3%
DeepSeek V3.19088430044.0%
MiniMax M2.7100100160043.2%
GPT-4o, May 13th (temp=0)100100100042.1%
Gemma 4 31B (Reasoning)10059460041.0%
Z.AI GLM 510052503041.0%
Ministral 8B10010000040.0%
Gemini 3 Flash (Preview, Reasoning)8657550039.5%
Gemini 3.1 Flash Lite (Reasoning)100611816039.0%
DeepSeek V4 Flash9352500039.0%
Gemma 3 12B10050440038.9%
Llama 3.1 70B1009300038.6%
GPT-OSS 120B9473137037.5%
Gemma 3 4B7971380037.5%
Z.AI GLM 5 Turbo10059250036.8%
Llama 3.1 8B1008300036.7%
Xiaomi MIMO v2.59073100034.7%
Claude Opus 479601712033.6%
DeepSeek V4 Flash (Reasoning)1005270031.8%
GPT-5 Mini6758208030.6%
Claude Sonnet 4.65952390030.0%
Gemma 4 26B5653360029.0%
Mistral Small 4 (Reasoning)1003360027.7%
Llama 3.1 Nemotron 70B10022150027.3%
Grok 4.205442345027.1%
Z.AI GLM 4.51003500026.9%
GPT-4o Mini (temp=0)685750026.1%
Nemotron 3 Super854000025.0%
Arcee AI: Trinity Large (Preview)1002300024.7%
GPT-4.1 Nano1001400022.7%
Rocinante 12B100700021.4%
Z.AI GLM 4.5 Air100000020.0%
Inception Mercury100000020.0%
Ministral 3 8B98000019.6%
Inception Mercury 2643200019.2%
Gemini 2.5 Flash Lite (Reasoning)5720170018.8%
Ministral 3 3B90000018.1%
GPT-5.4 Nano (Reasoning, Low)80500017.0%
Gemini 2.5 Flash Lite503500016.9%
GPT-5.4 Nano (Reasoning)452195016.1%
Claude 3.7 Sonnet76000015.3%
Stealth: Healer Alpha462600014.4%
Gemma 4 31B72000014.4%
WizardLM 2 8x22b551000012.9%
Z.AI GLM 4.7 Flash401400010.8%
Gemini 3.1 Flash Lite54000010.8%
MiniMax M2.548600010.7%
GPT-5.4 Nano37970010.5%
Stealth: Aurora Alpha3900007.8%
Nemotron 3 Nano3100006.2%
GPT-5.21730004.1%
Claude Haiku 4.5731002.3%
Gemini 2.5 Flash (Reasoning)550001.9%
Claude Sonnet 4000000.0%
GPT-5 Nano000000.0%
Mistral Small 3.2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.5 (Reasoning)1001001001009699.1%
Grok 4.3 (Reasoning)1001001001009498.9%
Qwen3.6 Max Preview1001001001009398.6%
Qwen 3 32B100100100989498.5%
Claude Opus 4.6 (Reasoning)1001001001009098.0%
GPT-5.41001001001008997.7%
GPT-5.1100100100949397.4%
Z.AI GLM 4.71001001001007695.1%
GPT-4o, May 13th (temp=0)100100100888895.1%
GPT-5.5 (Reasoning, Low)1001001001007494.8%
ByteDance Seed 2.0 Mini1001001001007294.4%
Mistral Small 41001001001006593.1%
GPT-5.4 Mini (Reasoning)1001001001006292.5%
Claude 3.5 Sonnet1001001001006292.5%
DeepSeek V3 (2025-03-24)1001001001006192.1%
Rocinante 12B1001001001005791.5%
GPT-5.51001001001005791.3%
GPT-4o, Aug. 6th (temp=0)100100100797891.2%
Qwen 3.5 397B A17B10010098896390.0%
GPT-4o, Aug. 6th (temp=1)100100100806589.0%
Qwen3 235B A22B Instruct 25071001001001004488.9%
WizardLM 2 8x22b100100100964588.2%
Grok 4.20 (Reasoning)1009491866887.9%
MoonshotAI: Kimi K2.6100100100954487.9%
ByteDance Seed 1.6 Flash10010090886187.7%
DeepSeek-V2 Chat100100100815486.9%
Claude Sonnet 4.6 (Reasoning)100100100795686.8%
Xiaomi MIMO v2.5 Pro1009898716586.6%
o4 Mini100100100814685.3%
Qwen 3.5 122B100100100646184.9%
GPT-4.11008888836584.8%
DeepSeek V4 Flash100100100883684.8%
Mistral Small 4 (Reasoning)1009487756784.5%
Mistral Large 310010089795484.3%
Ministral 3 8B100100100655483.8%
GPT-4.1 Mini10010082785983.8%
GPT-5.4 Mini1009490745983.6%
Qwen 3.5 Plus (2026-02-15)1001001001001683.2%
Z.AI GLM 5.110010090804683.2%
GPT-5.4 Mini (Reasoning, Low)10010079676782.7%
Writer: Palmyra X5100100100654882.6%
Qwen 3.6 Flash1001001001001182.2%
Grok 4.201009292784982.0%
Qwen 3.5 Plus (2026-04-20)100100100100981.8%
Mistral Small Creative10010087675681.8%
DeepSeek V4 Flash (Reasoning)100100100100681.2%
Z.AI GLM 5100100100713581.1%
DeepSeek V3.210010098822480.7%
MoonshotAI: Kimi K2.5100100100732980.4%
Grok 4 Fast10010094822780.4%
ByteDance Seed 2.0 Lite100100100100080.0%
Mistral Large100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Grok 4.310010098841379.2%
Grok 4.20 (Beta, Reasoning)10010079625479.1%
GPT-510010084654578.7%
Qwen 3.5 35B10010096593678.2%
DeepSeek V3 (2024-12-26)100100100613078.0%
Claude Opus 4100100100504077.9%
Aion 2.010010089732777.6%
Gemini 3 Pro (Preview)10010081762776.7%
Grok 4.20 (Beta)999280674576.5%
Gemini 3.1 Flash Lite (Reasoning)10010010079075.7%
GPT-4o, May 13th (temp=1)10010099463375.5%
DeepSeek V4 Pro10010078594075.2%
Mistral Large 210010071564874.9%
Claude Opus 4.51008072725074.7%
Claude Opus 4.7 (Reasoning)100100100522174.5%
GPT-5.4 Nano (Reasoning)10010089552874.4%
Claude 3 Haiku100100100571173.7%
Gemini 2.5 Pro1007574615773.5%
Qwen 3.5 27B1008679693473.5%
Arcee AI: Trinity Large (Preview)10010096551773.5%
GPT-5.21009569524872.9%
Qwen 3.5 9B1001008472872.9%
Claude Sonnet 4.51001008876072.8%
Claude Sonnet 4.610010010057372.1%
Claude Opus 4.6978581762071.9%
Stealth: Healer Alpha1008172703571.5%
Qwen 3.5 Flash1008279642870.5%
Claude Opus 4.7100100100272570.4%
Ministral 3 14B1009696441670.2%
Qwen 2.5 72B95898275068.1%
GPT-4o Mini (temp=0)1008685531567.7%
Claude Sonnet 41009267651467.5%
Gemini 3.1 Flash Lite1001008057067.4%
Xiaomi MIMO v2.51009266601867.2%
DeepSeek V3.110010010027666.5%
ByteDance Seed 1.61007357544866.5%
Hermes 3 405B1001008448066.4%
Z.AI GLM 4.5 Air1001006759065.2%
GPT-5.4 Nano (Reasoning, Low)94847671065.0%
LFM2 24B10010055541665.0%
Z.AI GLM 4.7 Flash96939043064.3%
Mistral NeMO1001008833064.2%
Hermes 3 70B10010048482263.5%
Stealth: Hunter Alpha97947845162.9%
GPT-5.4 Nano1006457552961.0%
Z.AI GLM 4.61001001000060.0%
Z.AI GLM 4.51001001000060.0%
Gemma 3 4B888459392258.3%
Gemini 2.5 Flash100855148056.9%
Gemma 3 27B867260501556.5%
Gemma 3 12B1001005625056.3%
MiniMax M2.7100965035056.2%
Arcee AI: Trinity Mini100100810056.1%
Z.AI GLM 5 Turbo757171511155.8%
Ministral 8B10092716053.7%
Gemini 2.5 Flash (Reasoning)90907114053.0%
Claude Haiku 4.510090730052.8%
Gemini 3 Flash (Preview)100100640052.8%
Grok 41008150211152.6%
Cohere Command R+ (Aug. 2024)1001003625052.3%
Ministral 3 3B100814625050.3%
Ministral 3B10072577047.3%
Claude 3.7 Sonnet100733416345.1%
Gemini 3.1 Flash Lite (Preview)87744711043.9%
Llama 3.1 70B10084293043.2%
GPT-4o Mini (temp=1)100503328042.2%
MiniMax M2.510073150037.6%
Gemma 4 31B (Reasoning)88433815036.7%
GPT-5 Mini8934230029.1%
Gemma 4 26B50362520026.2%
Gemini 2.5 Flash Lite (Reasoning)4643410026.0%
GPT-4.1 Nano8225110023.7%
Gemma 4 31B853100023.2%
Gemma 4 26B (Reasoning)861770022.0%
Llama 3.1 Nemotron 70B791860020.6%
Gemini 2.5 Flash Lite100100020.2%
Nemotron 3 Super722060019.6%
Gemini 3 Flash (Preview, Reasoning)671800017.0%
GPT-OSS 120B3823110014.3%
GPT-5 Nano1295005.2%
Inception Mercury 21700003.3%
Llama 3.1 8B1700003.3%
Inception Mercury500000.9%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
GPT-5.4 (Reasoning)1001001001009999.9%
Qwen 3.6 27B1001001001009999.9%
Mistral Small Creative1001001001009999.8%
Claude 3 Haiku1001001001009999.8%
GPT-5.5 (Reasoning, Low)1001001001009999.7%
GPT-5.4 Mini1001001001009899.7%
GPT-5.4 Mini (Reasoning)1001001001009899.6%
Qwen 3.5 27B1001001001009799.4%
GPT-5.4 Mini (Reasoning, Low)100100100989798.9%
Grok 4.20 (Reasoning)100100100969497.9%
Qwen 3.5 122B1001001001008997.8%
Z.AI GLM 4.71001001001008797.3%
Claude Sonnet 4.610010099949297.1%
Grok 4.3 (Reasoning)100100100949097.0%
Claude Opus 4.71001001001008496.8%
DeepSeek V3 (2024-12-26)1001001001008296.5%
Writer: Palmyra X5100100100968696.3%
Xiaomi MIMO v2.51001001001008196.1%
MoonshotAI: Kimi K2.610010099968696.1%
Claude Opus 4.6 (Reasoning)1001001001008095.9%
o4 Mini1001001001007995.7%
Z.AI GLM 4.61001001001007895.5%
Hermes 3 70B100100100968195.2%
Qwen 3.5 Flash100100100997595.0%
ByteDance Seed 2.0 Lite1001001001007494.8%
Grok 4.31001001001007494.8%
GPT-5.5 (Reasoning)1001001001007294.5%
Grok 41001001001007294.4%
Ministral 3 3B1001001001007294.4%
Claude Opus 4100100100907893.6%
MiniMax M2.5100100100937293.0%
GPT-5.4 (Reasoning, Low)1001001001006492.8%
ByteDance Seed 2.0 Mini1001001001006292.5%
Claude Sonnet 4.5100100100828092.2%
Arcee AI: Trinity Mini1001001001005991.8%
Gemini 3 Pro (Preview)1001001001005991.7%
Grok 4 Fast1001001001005691.3%
GPT-510010096887291.1%
GPT-4.11001001001005490.8%
Mistral NeMO100100100856890.5%
WizardLM 2 8x22b1001001001005090.0%
Stealth: Healer Alpha10010098836889.9%
DeepSeek V4 Pro (Reasoning)100100100965489.9%
GPT-5.41001001001004989.8%
Claude Sonnet 4100100100826589.4%
Rocinante 12B1001001001004689.3%
Claude 3.7 Sonnet1001001001004689.2%
GPT-5.51001001001004288.4%
MoonshotAI: Kimi K2.5100100100845688.0%
GPT-4o Mini (temp=0)1001001001004087.9%
Z.AI GLM 5 Turbo1001001001003987.8%
Z.AI GLM 510010092747287.6%
Qwen 3.5 9B100100100894887.4%
Qwen3 235B A22B Instruct 25071001001001003687.3%
Mistral Large100100100963886.7%
Xiaomi MIMO v2.5 Pro100100100736086.6%
DeepSeek V4 Flash (Reasoning)100100100676486.1%
Grok 4.20 (Beta)10010085826185.5%
Qwen 2.5 72B10010092746185.4%
Gemini 3.1 Flash Lite (Preview)100100100794684.9%
Nemotron 3 Super100100100863183.4%
Z.AI GLM 4.5 Air1009990883682.7%
DeepSeek V3.11009182796282.6%
Mistral Small 4 (Reasoning)10010086665781.8%
GPT-4o, May 13th (temp=1)100100100693681.1%
Ministral 3 8B100100100594480.4%
Claude 3.5 Sonnet100100100100080.0%
Hermes 3 405B100100100100080.0%
Ministral 8B100100100100080.0%
DeepSeek-V2 Chat10010072656279.9%
Arcee AI: Trinity Large (Preview)10010010099079.8%
Claude Opus 4.510010096822079.6%
Claude Opus 4.6100100100574079.4%
Gemini 3.1 Flash Lite10010096613678.5%
Gemini 2.5 Pro10010076675078.5%
Mistral Large 3100100100771378.0%
Stealth: Aurora Alpha1009594752377.5%
Ministral 3B10010093633177.4%
Qwen 3.5 Plus (2026-02-15)10010010085077.0%
DeepSeek V3.21009485812376.6%
o4 Mini High10010010080075.9%
Aion 2.01001009182675.7%
Inception Mercury 210010010075075.0%
Gemini 2.5 Flash10010082691873.9%
Gemini 3 Flash (Preview, Reasoning)1001008582073.5%
Z.AI GLM 5.11007269645972.8%
Gemma 4 31B1001009369072.5%
Gemma 4 31B (Reasoning)10010094521171.5%
DeepSeek V4 Flash10010098362271.2%
GPT-4o, May 13th (temp=0)10010010052170.5%
Gemini 2.5 Flash (Reasoning)10010010051070.3%
Mistral Small 4100100100411070.2%
Ministral 3 14B10010099341769.9%
GPT-4o, Aug. 6th (temp=1)10010010048069.6%
Claude Opus 4.7 (Reasoning)100897979069.2%
Cohere Command R+ (Aug. 2024)10010010046069.2%
Grok 4.20 (Beta, Reasoning)1001009547068.5%
GPT-5.21007969573066.9%
Gemma 3 12B10010061412565.4%
Gemma 3 27B1001009529064.8%
Gemini 2.5 Flash Lite (Reasoning)10010068321763.4%
Claude Sonnet 4.6 (Reasoning)1001005754062.3%
Claude Haiku 4.5100767361062.1%
GPT-5.4 Nano (Reasoning, Low)1008649413462.0%
GPT-5.11006756513060.8%
GPT-4.1 Mini1001001000060.0%
Inception Mercury1001001000060.0%
Gemma 4 26B (Reasoning)100100970059.4%
Stealth: Hunter Alpha10010050361059.2%
GPT-OSS 120B916857442256.6%
GPT-4o Mini (temp=1)100766835055.8%
Llama 3.1 70B1001004625354.8%
Gemma 4 26B1001004625054.2%
GPT-5.4 Nano807048392953.2%
GPT-5 Mini92913633852.1%
Gemini 3.1 Flash Lite (Reasoning)100100570051.5%
Mistral Large 2100100468050.8%
GPT-5.4 Nano (Reasoning)1008728191750.4%
GPT-4.1 Nano10065618046.9%
Nemotron 3 Nano100100270045.5%
Gemini 3 Flash (Preview)8785468045.1%
Llama 3.1 8B100100250045.0%
Gemma 3 4B10082163340.9%
LFM2 24B10061230036.7%
Gemini 2.5 Flash Lite10036200031.2%
Llama 3.1 Nemotron 70B8128220026.1%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen 3 32B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.41001001001009999.9%
ByteDance Seed 1.61001001001009699.1%
Rocinante 12B1001001001009398.5%
Qwen 2.5 72B1001001001009098.1%
Xiaomi MIMO v2.5100100100989398.1%
GPT-5.51001001001009098.0%
Mistral Large100100100989298.0%
Qwen 3.6 35B1001001001008997.8%
ByteDance Seed 2.0 Mini1001001001008897.5%
Qwen 3.5 Flash1001001001008396.7%
DeepSeek V3.21001001001008296.5%
Claude Opus 4.7100100100929096.5%
o4 Mini100100100929096.5%
Llama 3.1 8B100100100998296.1%
Stealth: Hunter Alpha1001001001007995.7%
Hermes 3 405B1001001001007995.7%
Grok 4.20 (Beta)1001001001007895.6%
GPT-4.11001001001007695.3%
Qwen3 235B A22B Instruct 25071001001001007695.3%
GPT-5.11001001001007695.1%
GPT-4.1 Mini100100100948195.0%
ByteDance Seed 2.0 Lite1001001001007494.8%
GPT-5.210010097918694.8%
Claude Sonnet 4.51001001001007394.6%
Mistral Small 3.2 24B1001001001006993.9%
GPT-4o, Aug. 6th (temp=0)1001001001006793.3%
GPT-51001001001006693.1%
Gemini 2.5 Pro1001001001006593.1%
Mistral Medium 3.11001001001006593.0%
Qwen3.6 Max Preview1001001001006492.8%
Grok 4.31001001001006492.8%
Mistral Small 4 (Reasoning)1001001001006192.1%
Gemma 3 27B1001001001005991.8%
Gemma 3 12B100100100946491.6%
Gemma 4 31B (Reasoning)100100100946291.3%
Grok 4.2010010098946290.8%
Llama 3.1 70B1001001001005490.8%
Ministral 3 3B1001001001005490.8%
Hermes 3 70B100100100886490.3%
DeepSeek V3 (2025-03-24)1001001001005090.0%
Qwen 3.5 27B100100100995089.8%
Arcee AI: Trinity Large (Preview)100100100905488.9%
Xiaomi MIMO v2.5 Pro100100100816288.6%
DeepSeek V4 Pro (Reasoning)100100100974488.1%
Qwen 3.5 35B100100100796188.0%
Claude Sonnet 4.6100100100924888.0%
Claude Opus 4.7 (Reasoning)1001001001003987.8%
Mistral Large 31001001001003987.8%
DeepSeek V4 Flash (Reasoning)100100100924687.5%
GPT-4o, May 13th (temp=1)100100100885087.5%
Claude 3 Haiku100100100885087.5%
Claude Opus 4100100100736287.2%
Claude Opus 4.6 (Reasoning)100100100765486.1%
Claude 3.5 Sonnet100100100725785.9%
GPT-4o Mini (temp=0)1001001001002785.4%
Gemma 3 4B100100100764784.7%
Qwen 3.5 Plus (2026-02-15)10010088736284.7%
Aion 2.0100100100625984.3%
Claude Sonnet 4.6 (Reasoning)100100100655684.2%
Llama 3.1 Nemotron 70B100100100675484.1%
WizardLM 2 8x22b1009790686483.9%
Grok 4.20 (Reasoning)100100100902983.9%
Z.AI GLM 4.51009797892882.1%
Grok 410010085784681.7%
Z.AI GLM 4.710010097743581.3%
Z.AI GLM 5 Turbo10010093694481.2%
MiniMax M2.5100100100574981.2%
DeepSeek V4 Flash10010093694180.8%
Gemma 4 26B1008381796080.4%
Writer: Palmyra X510010081734880.3%
DeepSeek V4 Pro100100100100080.0%
GPT-4o Mini (temp=1)100100100100080.0%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Ministral 3B100100100100080.0%
Z.AI GLM 4.5 Air10010086645079.9%
Mistral Small Creative10010010098079.6%
GPT-4.1 Nano10010099564179.2%
Claude Haiku 4.5100100100831178.9%
Mistral Small 41001009694078.0%
DeepSeek V3.1100100100612777.6%
GPT-5.4 Nano (Reasoning, Low)1009491831877.2%
Z.AI GLM 510010072712573.5%
Z.AI GLM 5.110010010067073.3%
MiniMax M2.710010097393173.3%
Gemini 2.5 Flash (Reasoning)1008269625372.9%
Arcee AI: Trinity Mini1001009667072.5%
Grok 4.3 (Reasoning)10010074464172.3%
Claude Opus 4.61008380722572.0%
Gemini 2.5 Flash10010069611769.3%
Gemini 3.1 Flash Lite (Reasoning)1009673621468.9%
Mistral Large 210010072441866.8%
Ministral 8B100927261365.5%
Gemini 3.1 Flash Lite10010010025065.0%
DeepSeek V3 (2024-12-26)1001006954064.6%
GPT-OSS 120B10010050413164.5%
Gemini 3 Flash (Preview)10010062461163.7%
Z.AI GLM 4.7 Flash100100978061.2%
GPT-4o, Aug. 6th (temp=1)1001008916061.0%
Nemotron 3 Super1009946302860.5%
GPT-5.4 Nano (Reasoning)1009068261660.2%
Claude 3.7 Sonnet1001005735058.4%
Stealth: Aurora Alpha100100910058.2%
Gemini 3.1 Flash Lite (Preview)100815946558.1%
Inception Mercury 2100865538055.8%
Grok 4 Fast100796733055.6%
Gemini 3 Pro (Preview)100865723754.8%
GPT-5 Mini1007257291654.6%
Nemotron 3 Nano100866021053.2%
GPT-5.4 Nano935147382350.3%
Ministral 3 8B100100500050.0%
Stealth: Healer Alpha895231312244.9%
Z.AI GLM 4.68074627044.6%
Ministral 3 14B100673618044.3%
Gemini 2.5 Flash Lite10080357044.3%
Claude Opus 4.5100544618043.6%
Gemma 4 31B100644011042.9%
Gemini 3 Flash (Preview, Reasoning)85613025741.5%
Gemini 2.5 Flash Lite (Reasoning)9768410041.2%
Claude Sonnet 410010000040.0%
Inception Mercury10010000040.0%
Gemma 4 26B (Reasoning)1007360035.8%
GPT-5 Nano900001.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Z.AI GLM 4.7 Flash1001001001009999.9%
Grok 4.20 (Reasoning)1001001001009899.6%
Qwen3 235B A22B Instruct 25071001001001009899.6%
Claude 3 Haiku1001001001009899.6%
o4 Mini1001001001009799.4%
Mistral NeMO1001001001009799.4%
Mistral Large 31001001001009498.9%
Z.AI GLM 4.5100100100989698.7%
Qwen 3.5 Flash1001001001009398.6%
o4 Mini High1001001001009298.4%
Grok 4.20 (Beta, Reasoning)100100100979498.1%
GPT-5.5 (Reasoning, Low)1001001001008597.1%
GPT-5.4 (Reasoning, Low)1001001001008296.5%
Llama 3.1 70B1001001001008296.5%
Qwen 3.5 122B100100100909096.2%
DeepSeek V3.11001001001008196.1%
Mistral Small 41001001001008196.1%
ByteDance Seed 1.6100100100898995.6%
Ministral 8B1001001001007795.4%
Gemini 3.1 Pro (Preview)1001001001007695.3%
Mistral Small 3.2 24B1001001001007695.3%
GPT-5.4100100100997695.2%
GPT-5.4 Mini1009694948894.6%
Claude Opus 4.7 (Reasoning)100100100908294.6%
Grok 4.3 (Reasoning)100100100987494.4%
Grok 4.31001001001006993.9%
Xiaomi MIMO v2.5 Pro100100100878193.5%
Qwen 3.5 35B1001001001006793.3%
DeepSeek V4 Pro1001001001006793.3%
Aion 2.0100100100946992.8%
Qwen 3.5 9B100100100897492.7%
GPT-5.4 (Reasoning)1001001001006292.3%
Grok 4.20 (Beta)1001001001005991.8%
GPT-4o, May 13th (temp=1)100100100936591.7%
GPT-4o Mini (temp=0)1001001001005691.1%
DeepSeek V4 Pro (Reasoning)100100100965990.9%
MoonshotAI: Kimi K2.61001001001005290.4%
Claude Opus 4.71009390828189.3%
Ministral 3 14B1001001001004689.2%
GPT-5.1100100100816589.1%
DeepSeek-V2 Chat100100100925489.1%
Z.AI GLM 4.5 Air100100100974688.5%
Claude Opus 41001001001004388.5%
GPT-4o, Aug. 6th (temp=1)100100100944888.5%
GPT-5.4 Mini (Reasoning)1001001001004088.0%
GPT-4.11001001001003586.9%
Cohere Command R+ (Aug. 2024)100100100924186.6%
Inception Mercury 210010091875486.3%
Stealth: Healer Alpha100100100745786.3%
ByteDance Seed 1.6 Flash1001001001002985.8%
Arcee AI: Trinity Mini100100100923485.1%
DeepSeek V3 (2025-03-24)100100100675784.8%
GPT-5100100100784684.8%
GPT-4.1 Mini1001001001002384.6%
Gemini 3 Pro (Preview)100100100685184.0%
Grok 4100100100744383.4%
Z.AI GLM 5.11001001001001583.0%
Z.AI GLM 510010096645482.7%
Gemini 2.5 Flash10010079736182.6%
Stealth: Hunter Alpha100100100615082.1%
Gemma 3 12B100100100565481.9%
Claude Sonnet 4.51009692863581.6%
Mistral Small Creative10010097644681.4%
Inception Mercury100100100100681.2%
Writer: Palmyra X510010092882580.9%
Gemma 4 31B10010089595680.8%
Claude Sonnet 410010088625280.3%
Xiaomi MIMO v2.5100100100100080.0%
Qwen 2.5 72B100100100100080.0%
LFM2 24B10010010099079.8%
ByteDance Seed 2.0 Lite100100100811879.8%
Ministral 3 3B100100100613679.4%
Claude Sonnet 4.6 (Reasoning)1001009998079.4%
MiniMax M2.510010010093379.1%
Gemini 2.5 Flash (Reasoning)10010090673678.7%
Grok 4.20100100100751878.7%
WizardLM 2 8x22b10010010089077.8%
Mistral Small 4 (Reasoning)10010084613976.8%
Arcee AI: Trinity Large (Preview)10010010081076.1%
Gemini 2.5 Pro10010071565175.5%
Claude Opus 4.6 (Reasoning)10010010074074.8%
GPT-5.4 Nano (Reasoning)949389821674.8%
Claude 3.7 Sonnet100100100383574.5%
GPT-OSS 120B10010074653174.0%
Mistral Large 210010076563673.7%
Z.AI GLM 5 Turbo10010010068073.6%
MiniMax M2.710010010060773.4%
Gemini 3 Flash (Preview)10010078532771.4%
Claude Opus 4.510010010056071.3%
Gemini 3 Flash (Preview, Reasoning)1001007975070.8%
GPT-5.4 Nano969185403970.2%
Gemma 3 27B10010073671170.2%
Gemini 3.1 Flash Lite (Reasoning)100989756070.1%
GPT-4o Mini (temp=1)10010062592769.7%
Claude Opus 4.61008481561166.3%
Ministral 3B10010064412566.0%
Claude Sonnet 4.692908462065.8%
Gemini 3.1 Flash Lite (Preview)10010010022064.4%
Z.AI GLM 4.61001006952064.3%
Gemma 3 4B1001001008662.9%
Grok 4 Fast1008368461762.8%
GPT-4.1 Nano100979621062.6%
Stealth: Aurora Alpha1001007530061.0%
Ministral 3 8B100827444060.1%
DeepSeek V3.2100907435059.7%
Gemini 2.5 Flash Lite100865932055.5%
DeepSeek V4 Flash100100760055.3%
Qwen 3 32B1007650311153.7%
Gemma 4 26B (Reasoning)1001002318048.3%
Gemma 4 26B10071447044.3%
Llama 3.1 Nemotron 70B8864446040.1%
Nemotron 3 Nano9055400036.9%
Gemini 2.5 Flash Lite (Reasoning)1007571036.7%
GPT-5 Nano10047136033.3%
GPT-5 Mini10030200029.9%
GPT-5.4 Nano (Reasoning, Low)7642310029.8%
Llama 3.1 8B6565110028.3%
GPT-5.251392814427.2%
Claude Haiku 4.51002900025.8%
Nemotron 3 Super4441211021.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
o4 Mini100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3B100100100100100100.0%
Cohere Command R+ (Aug. 2024)1001001001009999.8%
Grok 4.20 (Reasoning)1001001001009899.6%
Claude 3 Haiku100100100999799.2%
Qwen 3.5 27B1001001001009398.6%
GPT-4.11001001001009398.6%
Grok 4 Fast100100100978997.2%
GPT-5.4 Mini100100100949197.1%
Qwen 3.6 Flash1001001001008296.5%
GPT-5.4 Mini (Reasoning)100100100938896.0%
Rocinante 12B1001001001008095.9%
Claude Sonnet 4.6 (Reasoning)1001001001007995.7%
Claude Opus 4.61009998988295.7%
MoonshotAI: Kimi K2.6100100100898294.3%
Hermes 3 405B1001001001007194.1%
Qwen 3.6 35B100100100888193.9%
GPT-4o Mini (temp=1)1001001001006993.9%
Qwen 3.5 122B1001001001006893.7%
Qwen3.6 Max Preview1001001001006793.3%
GPT-4o Mini (temp=0)100100100838293.0%
Z.AI GLM 4.510010096888292.9%
DeepSeek-V2 Chat1001001001006492.8%
GPT-4o, May 13th (temp=1)100100100828292.6%
Qwen 2.5 72B1001001001006292.4%
DeepSeek V4 Flash (Reasoning)100100100995991.6%
GPT-5.2100100100847391.4%
LFM2 24B100100100886891.3%
GPT-4.1 Nano1001001001005691.1%
Gemma 3 4B100100100767590.4%
Claude Sonnet 4.51001001001004989.7%
GPT-5.1100100100875688.6%
Gemini 3.1 Flash Lite (Preview)100100100736888.3%
Claude Opus 4.7 (Reasoning)100100100716887.7%
Ministral 3 8B10010089747487.5%
Grok 410010098776287.4%
MoonshotAI: Kimi K2.5100100100736186.8%
Qwen 3.5 397B A17B100100100825186.6%
Z.AI GLM 510010090885185.9%
Stealth: Healer Alpha10010093746085.4%
Claude Opus 4.7100100100982785.0%
DeepSeek V4 Pro10010086766184.6%
Z.AI GLM 5.1100100100972384.0%
GPT-510010093755284.0%
Qwen 3 32B100100100724783.8%
Claude Sonnet 4.61001001001001883.7%
Mistral Small Creative100100100685083.6%
Qwen 3.5 Plus (2026-02-15)1009392676583.3%
Qwen 3.6 27B100100100971883.2%
Claude Opus 4.6 (Reasoning)100100100665083.1%
Llama 3.1 70B1001001001001583.0%
ByteDance Seed 1.6 Flash1001001001001482.7%
DeepSeek V3 (2025-03-24)1001001001001182.2%
Mistral Small 4100100100743581.9%
Ministral 3 14B10010098733581.1%
Gemini 3.1 Flash Lite (Reasoning)100100100743181.0%
Z.AI GLM 4.7 Flash10010097614680.7%
Qwen 3.5 35B1009990724180.5%
Inception Mercury100100100841780.1%
Z.AI GLM 5 Turbo100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Llama 3.1 8B100100100100080.0%
DeepSeek V4 Flash10010089713879.5%
Claude Opus 410010090594979.5%
Stealth: Hunter Alpha10010076754278.9%
Z.AI GLM 4.710010089574678.4%
Xiaomi MIMO v2.5 Pro1001009980576.8%
Mistral Medium 3.1989381733375.6%
Claude Sonnet 410010078593674.5%
DeepSeek V3.21009862625074.4%
Z.AI GLM 4.61001008682073.7%
Gemma 3 12B1001009372072.9%
Gemma 4 26B (Reasoning)1001008973072.4%
Llama 3.1 Nemotron 70B1001008181072.2%
MiniMax M2.71008959565672.0%
Gemini 3 Flash (Preview, Reasoning)100100100471271.9%
DeepSeek V3.11001008970071.8%
Ministral 3 3B1001008269771.8%
Stealth: Aurora Alpha1008164575771.7%
Mistral NeMO100100503370.7%
Gemini 2.5 Pro10010069602270.3%
DeepSeek V4 Pro (Reasoning)1001008065670.2%
Aion 2.01001009846369.3%
Claude 3.5 Sonnet1001008661069.3%
Mistral Large 310010091381568.8%
Grok 4.201001007358968.1%
Arcee AI: Trinity Large (Preview)1001008454067.6%
Gemini 3.1 Flash Lite1001007167067.5%
WizardLM 2 8x22b837266565366.1%
Mistral Large10010010027065.4%
Gemma 4 31B (Reasoning)1007169542764.2%
GPT-5.4 Nano1009076441164.2%
Mistral Small 4 (Reasoning)1001007247063.8%
Mistral Large 21001009816062.8%
Claude Haiku 4.510010088131162.5%
Grok 4.20 (Beta, Reasoning)1001006050062.0%
Ministral 8B10010046411660.6%
Xiaomi MIMO v2.592846759060.3%
Hermes 3 70B1001001000060.0%
Gemma 4 31B1008155471759.9%
Grok 4.20 (Beta)966059433859.3%
GPT-5.4 Nano (Reasoning, Low)1007453462058.8%
Gemini 3 Pro (Preview)1006966321155.7%
Claude Opus 4.51006949332054.2%
Gemini 2.5 Flash93686246153.9%
GPT-5 Mini976745361552.0%
Nemotron 3 Super100645136050.3%
Claude 3.7 Sonnet9984660049.9%
Gemini 2.5 Flash Lite10078710049.6%
GPT-5.4 Nano (Reasoning)100634735049.1%
Gemini 2.5 Flash (Reasoning)100625020046.5%
MiniMax M2.5100100290045.8%
Gemini 3 Flash (Preview)10065640045.7%
GPT-OSS 120B62595747045.0%
Nemotron 3 Nano100100220044.4%
Z.AI GLM 4.5 Air10082320043.0%
Inception Mercury 2100471110033.6%
Gemma 4 26B10027260030.6%
Gemini 2.5 Flash Lite (Reasoning)1001370024.0%
GPT-5 Nano442900014.6%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast100100100100100100.0%
Qwen3.6 Max Preview1001001001008396.6%
o4 Mini100100100919096.2%
GPT-5.5100100100977594.3%
Gemini 3.1 Pro (Preview)1009594888392.0%
Mistral Small 3.2 24B1001001001005490.9%
GPT-5.5 (Reasoning)1009797897090.7%
GPT-5.4 (Reasoning)10010099735785.8%
Mistral NeMO100100100615983.9%
Claude 3 Haiku10010092724882.3%
GPT-5.5 (Reasoning, Low)10010088853381.2%
Qwen 3.6 Flash10010089684881.0%
Hermes 3 405B10010081675279.8%
Qwen 3.6 27B10010085703878.6%
Claude Sonnet 4.610010097751176.7%
Grok 4 Fast1008873665275.7%
Qwen 3.5 Plus (2026-04-20)1009688602974.5%
Llama 3.1 8B1001009674074.0%
GPT-5.41009877494373.5%
Qwen3 235B A22B Instruct 250710010010065073.1%
ByteDance Seed 1.6100100100412272.6%
Grok 4.20 (Reasoning)10010075454172.2%
Ministral 3 14B1001009060069.9%
DeepSeek V3 (2025-03-24)10010062443467.9%
Qwen 3.6 35B1001008158067.8%
GPT-4o, May 13th (temp=0)1009469621167.3%
Qwen 3.5 9B10010068491867.0%
o4 Mini High100928855066.8%
Grok 4.3908163544466.2%
GPT-5.4 (Reasoning, Low)1008168433765.9%
Writer: Palmyra X510010072291863.9%
GPT-4o Mini (temp=1)1001006841061.9%
DeepSeek-V2 Chat1001008220561.4%
ByteDance Seed 1.6 Flash786259575261.4%
Rocinante 12B10010076181161.2%
GPT-5.4 Mini1006247454459.6%
MoonshotAI: Kimi K2.5816256524759.6%
Gemini 3 Pro (Preview)100716360058.7%
Qwen 3.5 397B A17B765756504957.4%
Mistral Large1007672181656.6%
LFM2 24B100786440056.2%
Claude Opus 4.71001004434055.5%
GPT-5.4 Mini (Reasoning, Low)1001004921054.1%
Qwen 3.5 122B100756921053.2%
Grok 4.20955756362253.0%
Gemini 2.5 Pro1001005311052.7%
Stealth: Hunter Alpha10094617052.5%
Qwen 3 32B100745631152.3%
Gemma 3 27B100735435052.3%
Z.AI GLM 574726841051.1%
DeepSeek V3.11001003516050.1%
Gemma 3 12B76625644047.8%
Claude Opus 4.7 (Reasoning)100504836347.5%
GPT-4o Mini (temp=0)10069640046.6%
Qwen 3.5 35B10096259046.1%
DeepSeek V4 Pro74625042045.8%
GPT-5.4 Mini (Reasoning)100614522045.5%
GPT-510072514045.3%
Mistral Small Creative10067590045.2%
Mistral Medium 3.110091340045.0%
Z.AI GLM 4.7 Flash944640281544.6%
Mistral Large 2100100230044.6%
Grok 4.20 (Beta, Reasoning)100672925044.3%
MoonshotAI: Kimi K2.6100642523944.3%
Grok 4.3 (Reasoning)8868568044.1%
Grok 4.20 (Beta)100533630043.8%
Gemini 2.5 Flash Lite8882349042.7%
Claude Opus 4.689822317042.1%
Hermes 3 70B9696160041.5%
Gemini 3.1 Flash Lite (Reasoning)10010050041.0%
ByteDance Seed 2.0 Lite10072310040.6%
Z.AI GLM 4.610086170040.5%
Ministral 3B10076223040.2%
Gemini 2.5 Flash1009400038.9%
Mistral Small 4 (Reasoning)10071180037.9%
Claude Opus 4.6 (Reasoning)1008500037.0%
DeepSeek V3 (2024-12-26)10050330036.6%
DeepSeek V4 Flash (Reasoning)7367380035.6%
Gemini 3.1 Flash Lite (Preview)8946430035.5%
Qwen 3.5 Flash7656430035.1%
GPT-4.17565301034.2%
DeepSeek V3.26253476033.5%
Xiaomi MIMO v2.5 Pro9452220033.5%
Grok 48151340033.1%
Gemini 3.1 Flash Lite72363123333.1%
Qwen 3.5 27B5757500033.0%
GPT-4.1 Nano1006400032.8%
DeepSeek V4 Flash