Subordinate conjunction sentence starts

Test: Bad Writing Habits

Avg. Score
32.4%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5.4 Nano44.2%$0.005726.3s15%
2ByteDance Seed 1.6 Flash45.7%$0.001327.3s11%
3Gemini 2.5 Flash Lite43.8%$0.00099.5s11%
4Stealth: Healer Alpha37.9%$0.000023.7s13%
5Gemini 2.5 Flash Lite (Reasoning)44.3%$0.002830.8s10%
6Grok 4.20 (Beta)38.0%$0.01815.8s12%
7Qwen 3 32B46.1%$0.001554.6s10%
8Stealth: Hunter Alpha41.8%$0.000055.0s11%
9Gemma 3 4B42.9%$0.000220.0s9%
10Writer: Palmyra X544.4%$0.01122.0s9%
11Gemma 3 12B37.9%$0.000441.3s10%
12Gemini 3.1 Flash Lite (Preview)47.6%$0.00308.4s6%
13Rocinante 12B48.7%$0.001438.4s6%
14GPT-5.4 Nano (Reasoning, Low)39.1%$0.005520.6s9%
15Z.AI GLM 541.3%$0.00841.2m10%
16Qwen 3.5 Plus (2026-02-15)38.5%$0.006031.5s9%
17Xiaomi MIMO v2.532.9%$0.005431.8s11%
18Qwen 3.5 397B A17B54.2%$0.0143.0m10%
19Gemini 3.1 Flash Lite (Reasoning)46.6%$0.003011.9s5%
20Z.AI GLM 5 Turbo36.5%$0.008133.2s9%
21Grok 4.2034.7%$0.009345.7s10%
22Mistral Small Creative36.9%$0.00079.1s7%
23GPT-5.4 Mini33.1%$0.01516.8s10%
24Gemma 3 27B39.2%$0.000652.6s8%
25Qwen3 235B A22B Instruct 250739.3%$0.001159.2s8%
26Qwen 3.6 35B42.3%$0.00831.0m8%
27Gemini 3.1 Flash Lite45.0%$0.003012.1s4%
28Gemini 2.5 Flash (Reasoning)38.0%$0.01121.5s7%
29GPT-5.4 Mini (Reasoning, Low)32.8%$0.01516.8s9%
30Claude Sonnet 445.8%$0.03243.7s7%
31GPT-5.4 (Reasoning, Low)44.9%$0.0551.4m10%
32o4 Mini29.5%$0.01525.7s9%
33GPT-5 Nano38.3%$0.00421.4m8%
34Z.AI GLM 4.736.5%$0.0101.4m9%
35GPT-5.4 (Reasoning)47.6%$0.0892.6m14%
36GPT-4o, Aug. 6th (temp=1)45.8%$0.01824.4s4%
37GPT-5.4 Mini (Reasoning)32.4%$0.02228.1s9%
38Claude Opus 4.634.7%$0.0781.2m14%
39GPT-5.444.6%$0.0491.4m9%
40Qwen 3.6 Flash36.0%$0.01041.4s7%
41Aion 2.032.1%$0.00641.3m9%
42Grok 4.20 (Reasoning)37.5%$0.0181.5m9%
43GPT-5.4 Nano (Reasoning)30.1%$0.006124.5s8%
44Z.AI GLM 5.139.3%$0.0141.5m8%
45GPT-5 Mini33.0%$0.010057.4s8%
46Gemini 3 Pro (Preview)40.3%$0.05554.4s9%
47GPT-5.135.4%$0.0541.8m12%
48Cohere Command R+ (Aug. 2024)47.1%$0.02052.5s3%
49Hermes 3 70B47.9%$0.00101.2m3%
50Claude Opus 4.6 (Reasoning)38.8%$0.0881.4m11%
51Qwen 3.5 35B36.3%$0.0181.0m5%
52Mistral Small 4 (Reasoning)31.0%$0.002230.2s4%
53Mistral NeMO38.7%$0.000510.1s0%
54Qwen 3.5 Flash35.4%$0.002547.5s3%
55DeepSeek V4 Pro (Reasoning)38.5%$0.0153.1m8%
56GPT-4o Mini (temp=1)40.9%$0.001234.8s0%
57GPT-4.1 Nano36.0%$0.000713.3s0%
58Llama 3.1 Nemotron 70B37.6%$0.003831.7s0%
59Ministral 3 14B33.2%$0.000711.7s0%
60Gemini 3.5 Flash (Reasoning, Minimal)37.4%$0.01812.0s0%
61Llama 3.1 8B41.4%$0.00031.3m0%
62Claude Haiku 4.535.3%$0.01121.6s0%
63GPT-534.6%$0.0652.8m10%
64Arcee AI: Trinity Mini30.9%$0.00039.2s0%
65o4 Mini High26.5%$0.02547.2s5%
66Claude 3 Haiku31.4%$0.002514.9s0%
67Gemini 2.5 Flash31.2%$0.005210.6s0%
68Grok 4 Fast31.9%$0.001724.1s0%
69Mistral Large35.1%$0.01430.9s0%
70Mistral Small 430.1%$0.001418.2s0%
71Gemini 3 Flash (Preview, Reasoning)33.5%$0.01230.1s0%
72Z.AI GLM 4.7 Flash36.3%$0.00171.2m0%
73GPT-5.5 (Reasoning, Low)34.3%$0.1391.8m12%
74DeepSeek V4 Flash (Reasoning)30.2%$0.000731.1s0%
75Qwen 3.5 Plus (2026-04-20)30.6%$0.0171.8m4%
76GPT-5.224.8%$0.0561.5m8%
77GPT-5.532.9%$0.1391.7m12%
78Z.AI GLM 4.531.8%$0.005142.1s0%
79Claude Sonnet 4.637.6%$0.03139.3s0%
80Ministral 3 8B27.6%$0.000819.6s0%
81Gemma 4 26B31.6%$0.000955.1s0%
82Z.AI GLM 4.5 Air32.3%$0.002958.2s0%
83Z.AI GLM 4.632.0%$0.006551.5s0%
84Mistral Large 328.0%$0.003330.3s0%
85GPT-4o, May 13th (temp=1)33.2%$0.03314.4s0%
86Ministral 8B23.9%$0.000410.4s0%
87Mistral Medium 3.128.3%$0.004836.5s0%
88GPT-4.1 Mini25.4%$0.002719.0s0%
89Arcee AI: Trinity Large (Preview)27.9%$0.000043.6s0%
90GPT-4o Mini (temp=0)27.0%$0.001234.8s0%
91Hermes 3 405B29.9%$0.003253.2s0%
92Gemma 4 31B34.7%$0.00101.6m0%
93Gemini 3 Flash (Preview)25.2%$0.007819.6s0%
94Claude 3.7 Sonnet36.6%$0.04246.7s0%
95Qwen3.6 Max Preview37.1%$0.0503.5m7%
96Mistral Large 226.5%$0.01329.4s0%
97DeepSeek V3.234.6%$0.00141.9m0%
98DeepSeek-V2 Chat26.5%$0.002153.3s0%
99DeepSeek V4 Flash23.2%$0.000631.6s0%
100MiniMax M2.728.6%$0.00401.1m0%
101Llama 3.1 70B23.0%$0.001529.4s0%
102GPT-5.5 (Reasoning)30.1%$0.1421.8m11%
103MiniMax M2.529.0%$0.00341.3m0%
104LFM2 24B22.0%$0.000228.4s0%
105DeepSeek V3 (2024-12-26)25.4%$0.002154.6s0%
106Ministral 3B18.4%$0.00018.1s0%
107Xiaomi MIMO v2.5 Pro26.3%$0.008553.5s0%
108Qwen 2.5 72B22.0%$0.001036.7s0%
109DeepSeek V3.131.7%$0.00201.8m0%
110Nemotron 3 Nano25.5%$0.00101.1m0%
111Gemini 2.5 Pro29.5%$0.03636.2s0%
112DeepSeek V4 Pro27.0%$0.00481.3m0%
113GPT-4.125.9%$0.01844.7s0%
114Gemma 4 31B (Reasoning)33.1%$0.00142.2m0%
115GPT-4o, Aug. 6th (temp=0)23.4%$0.02322.7s0%
116Grok 4.20 (Beta, Reasoning)28.6%$0.03934.0s0%
117Qwen 3.5 122B29.0%$0.0251.1m0%
118ByteDance Seed 1.636.2%$0.0132.5m0%
119Grok 4.319.1%$0.006930.5s0%
120Gemini 3.5 Flash (Reasoning)35.3%$0.07137.6s0%
121GPT-4o, May 13th (temp=0)23.3%$0.03514.1s0%
122WizardLM 2 8x22b27.8%$0.00261.8m0%
123Claude Sonnet 4.526.0%$0.03538.1s0%
124DeepSeek V3 (2025-03-24)17.4%$0.001439.4s0%
125Ministral 3 3B13.0%$0.000511.1s0%
126Qwen 3.5 9B22.3%$0.00111.4m0%
127Stealth: Aurora Alpha12.5%$0.00009.8s0%
128Claude Opus 4.535.1%$0.07053.4s0%
129ByteDance Seed 2.0 Lite31.3%$0.0122.2m0%
130Nemotron 3 Super21.1%$0.00001.4m0%
131Qwen 3.5 27B27.1%$0.0201.6m0%
132Grok 4.1 Fast14.6%$0.001837.8s0%
133Gemma 4 26B (Reasoning)24.7%$0.00132.0m0%
134MoonshotAI: Kimi K2.537.5%$0.0193.2m0%
135Inception Mercury 29.2%$0.00327.0s0%
136Claude Opus 4.727.4%$0.06930.4s0%
137Qwen 3.6 27B28.4%$0.0252.3m0%
138Claude Sonnet 4.6 (Reasoning)27.8%$0.0601.2m0%
139Grok 428.4%$0.0481.7m0%
140Claude Opus 4.7 (Reasoning)26.0%$0.07632.0s0%
141Claude Opus 439.1%$0.2091.4m8%
142Claude 3.5 Sonnet16.6%$0.04835.5s0%
143Inception Mercury3.7%$0.01117.6s0%
144Qwen3.7 Max33.2%$0.0682.3m0%
145ByteDance Seed 2.0 Mini38.0%$0.00454.9m0%
146GPT-OSS 120B11.3%$0.00151.8m0%
147Grok 4.3 (Reasoning)19.1%$0.0212.3m0%
148Gemini 3.1 Pro (Preview)30.7%$0.1071.8m0%
149MoonshotAI: Kimi K2.628.5%$0.0586.5m0%
150Mistral Small 3.2 24B5.9%$0.00695.7m0%
32.43%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemma 3 27B1001001001006492.8%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Gemini 2.5 Flash Lite10010070646078.8%
Hermes 3 70B10010010071074.3%
Gemma 3 4B10010010071074.3%
Stealth: Hunter Alpha1001009865072.6%
DeepSeek V3.110010010062072.3%
Gemini 2.5 Pro1001008870071.6%
Gemma 3 12B1001007971070.2%
GPT-4o Mini (temp=0)1001007965068.9%
Mistral Large 2100887263064.5%
Gemini 3 Pro (Preview)1008552513163.6%
GPT-4o, Aug. 6th (temp=1)1001001000060.0%
Rocinante 12B1001001000060.0%
Llama 3.1 8B100100960059.2%
Xiaomi MIMO v2.5100675639052.4%
Z.AI GLM 4.7100625342051.3%
Claude Sonnet 4.610085710051.2%
LFM2 24B10077760050.5%
Writer: Palmyra X5100100490049.7%
GPT-5.4 (Reasoning)885945282348.7%
Gemini 3 Flash (Preview, Reasoning)100100430048.5%
Ministral 3 8B10070670047.4%
DeepSeek V3.210083490046.4%
Z.AI GLM 510060600044.0%
Gemini 2.5 Flash Lite (Reasoning)8665630042.9%
Xiaomi MIMO v2.5 Pro10065390040.9%
Arcee AI: Trinity Large (Preview)10056480040.9%
Qwen3 235B A22B Instruct 250710010000040.0%
Ministral 3 14B10010000040.0%
Mistral Small 4 (Reasoning)8368440039.1%
GPT-4.150494947038.8%
ByteDance Seed 2.0 Mini1009300038.5%
MoonshotAI: Kimi K2.51008800037.5%
Grok 41008300036.7%
ByteDance Seed 1.6 Flash10050280035.4%
DeepSeek V4 Pro1007700035.4%
Stealth: Healer Alpha56493534034.8%
Gemini 3.5 Flash (Reasoning, Minimal)1006800033.7%
GPT-4o, May 13th (temp=0)937500033.4%
Qwen 3.5 Flash10052140033.2%
Z.AI GLM 4.51006500033.0%
MiniMax M2.71006400032.8%
Claude Opus 47064300032.8%
Grok 4.207059330032.5%
Mistral Large9838270032.5%
Grok 4 Fast1006100032.2%
Ministral 8B6752420032.0%
Claude 3 Haiku897000031.9%
DeepSeek V4 Pro (Reasoning)1005800031.6%
GPT-5.4 (Reasoning, Low)9634270031.4%
WizardLM 2 8x22b7242400030.9%
Grok 4.20 (Reasoning)6944390030.5%
Z.AI GLM 5 Turbo1004900029.8%
DeepSeek V4 Flash915600029.4%
Qwen3.7 Max1004700029.3%
Gemini 2.5 Flash786800029.2%
Claude Sonnet 4.51004200028.5%
GPT-4o, May 13th (temp=1)717000028.4%
Mistral Small Creative716700027.6%
Aion 2.01003800027.6%
DeepSeek V4 Flash (Reasoning)746100026.9%
Mistral NeMO963700026.6%
GPT-5.4 Nano (Reasoning, Low)58272522026.3%
Nemotron 3 Nano1003100026.2%
GPT-5 Mini1003000026.0%
Mistral Small 4685700025.1%
GPT-5.55929260022.8%
Claude Sonnet 4.6 (Reasoning)575700022.7%
Claude Opus 4.6585500022.6%
Grok 4.20 (Beta)5129270021.5%
GPT-56324200021.5%
Claude Opus 4.6 (Reasoning)574900021.2%
GPT-5.4 Nano (Reasoning)4827270020.4%
ByteDance Seed 1.6100000020.0%
Claude Opus 4.7100000020.0%
Mistral Large 3100000020.0%
Claude Haiku 4.5100000020.0%
DeepSeek-V2 Chat100000020.0%
Z.AI GLM 4.7 Flash100000020.0%
ByteDance Seed 2.0 Lite100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
Hermes 3 405B100000020.0%
DeepSeek V3 (2025-03-24)100000020.0%
Llama 3.1 70B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
GPT-4.1 Nano100000020.0%
Arcee AI: Trinity Mini100000020.0%
Ministral 3B100000020.0%
Qwen 3.5 35B543800018.4%
GPT-5.4 Nano642500017.8%
GPT-4o, Aug. 6th (temp=0)88000017.5%
MoonshotAI: Kimi K2.6454100017.1%
Claude Sonnet 483000016.7%
Claude Opus 4.7 (Reasoning)78000015.6%
Gemma 4 31B (Reasoning)75000014.9%
Qwen 2.5 72B75000014.9%
Gemma 4 31B72000014.5%
Qwen 3.6 35B432900014.5%
GPT-4.1 Mini69000013.9%
GPT-5.4 Mini (Reasoning, Low)353300013.7%
Gemini 2.5 Flash (Reasoning)68000013.5%
GPT-5.5 (Reasoning)2422220013.5%
GPT-5.4 Mini343300013.4%
Gemini 3.1 Flash Lite67000013.3%
GPT-5.465000013.1%
Claude Opus 4.565000013.0%
Gemini 3.5 Flash (Reasoning)61000012.2%
Qwen 3 32B60000012.0%
Qwen 3.5 Plus (2026-02-15)59000011.8%
Qwen 3.5 9B58000011.6%
Gemma 4 26B52000010.4%
Gemini 3 Flash (Preview)51000010.1%
Qwen 3.5 397B A17B28220009.9%
Z.AI GLM 4.5 Air4900009.8%
Qwen 3.5 Plus (2026-04-20)4900009.7%
Grok 4.20 (Beta, Reasoning)4800009.6%
Mistral Medium 3.14800009.6%
o4 Mini4700009.4%
Gemini 3.1 Pro (Preview)4700009.3%
GPT-5.5 (Reasoning, Low)24220009.2%
GPT-5.124220009.1%
GPT-5 Nano24210009.1%
GPT-5.4 Mini (Reasoning)4100008.2%
Claude 3.7 Sonnet4000007.9%
o4 Mini High3100006.3%
Qwen3.6 Max Preview3100006.3%
GPT-5.23000006.0%
Z.AI GLM 5.1000000.0%
Grok 4.3 (Reasoning)000000.0%
Qwen 3.5 122B000000.0%
Gemma 4 26B (Reasoning)000000.0%
Qwen 3.5 27B000000.0%
Qwen 3.6 Flash000000.0%
Qwen 3.6 27B000000.0%
Grok 4.1 Fast000000.0%
Z.AI GLM 4.6000000.0%
MiniMax M2.5000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Nemotron 3 Super000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
GPT-4o Mini (temp=1)000000.0%
Grok 4.3000000.0%
Mistral Small 3.2 24B000000.0%
Ministral 3 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Flash Lite10010095747188.0%
Claude Opus 4.7 (Reasoning)100100100100080.0%
ByteDance Seed 2.0 Lite10010010077075.4%
MiniMax M2.71009978583373.6%
Claude Sonnet 4.610010010061072.2%
Aion 2.01008069563668.2%
GPT-5 Mini10010075362968.1%
Qwen3 235B A22B Instruct 250710010010040067.9%
ByteDance Seed 1.6 Flash1009745423764.2%
Mistral Medium 3.11006853514763.8%
Z.AI GLM 5 Turbo1006053535263.5%
Rocinante 12B1001006552063.3%
GPT-4o, Aug. 6th (temp=1)1001006947063.3%
GPT-5737069643862.7%
Gemini 2.5 Flash Lite (Reasoning)100896360062.4%
GPT-4o, May 13th (temp=1)1001001000060.0%
Gemma 3 27B1001005246059.5%
MiniMax M2.5100945147058.4%
Mistral Large 31001004943058.3%
DeepSeek V4 Flash (Reasoning)1001005536058.2%
Writer: Palmyra X51001004441057.1%
DeepSeek V4 Pro (Reasoning)1005653472856.9%
Claude Sonnet 4.51001004341056.8%
GPT-5.5 (Reasoning, Low)1008250312056.7%
Z.AI GLM 4.5 Air100985036056.7%
GPT-5.4 Nano846564432055.2%
Claude Opus 4.7100595654053.9%
Cohere Command R+ (Aug. 2024)100100680053.5%
Grok 4.20 (Beta, Reasoning)836942413153.3%
Claude Opus 4.6 (Reasoning)10086790053.0%
Claude Sonnet 4.6 (Reasoning)100100650053.0%
WizardLM 2 8x22b1006836352653.0%
Gemma 3 4B100615746052.9%
GPT-5.11005145412752.7%
Grok 4.3 (Reasoning)93626148052.7%
ByteDance Seed 2.0 Mini9386830052.4%
Stealth: Hunter Alpha10085760052.2%
GPT-4.1 Nano100100570051.5%
Claude Haiku 4.593784638051.2%
Xiaomi MIMO v2.51004940343251.1%
Mistral Large10083710051.0%
Xiaomi MIMO v2.5 Pro846137363650.8%
GPT-5.4 (Reasoning)837834302850.5%
Grok 4100100520050.4%
GPT-5.4 Nano (Reasoning, Low)969023212150.3%
Gemini 3 Flash (Preview)100100450049.0%
Gemini 3.5 Flash (Reasoning)99535043048.9%
Gemini 3 Pro (Preview)10079650048.9%
Claude Opus 4.6100823428048.9%
Claude Sonnet 4100100430048.7%
Claude 3.7 Sonnet83724543048.5%
GPT-5.4 Mini (Reasoning)100703731047.6%
Grok 4.20 (Beta)735548392247.3%
Nemotron 3 Super10070630046.6%
Z.AI GLM 4.510071610046.5%
GPT-4.1 Mini10070590045.8%
Ministral 8B10064630045.5%
Z.AI GLM 510074500044.7%
Qwen 3.5 Plus (2026-02-15)100563227043.1%
GPT-5.5 (Reasoning)1003535212142.6%
GPT-4o Mini (temp=1)10058520042.0%
Mistral NeMO8662610041.8%
GPT-5.595643018041.4%
Z.AI GLM 5.110056420039.6%
Mistral Large 27467560039.3%
Hermes 3 70B1009400038.9%
GPT-4.19052510038.5%
GPT-5.28358500038.1%
Grok 4.20 (Reasoning)10047380036.8%
GPT-5.4 Mini (Reasoning, Low)67602926036.2%
Stealth: Healer Alpha10041390035.9%
Gemma 3 12B8951380035.7%
GPT-5.466572826035.5%
Gemini 3.5 Flash (Reasoning, Minimal)1007500034.9%
MoonshotAI: Kimi K2.61007300034.6%
Mistral Small 4 (Reasoning)9445270033.3%
DeepSeek V3.210034320033.3%
GPT-OSS 120B1006500033.0%
MoonshotAI: Kimi K2.51006200032.3%
Gemma 4 31B1006200032.3%
GPT-5 Nano52383834032.3%
Qwen 3 32B57532626032.2%
Nemotron 3 Nano1005600031.1%
Ministral 3 8B1005500031.0%
Gemini 3.1 Flash Lite (Reasoning)1005200030.4%
Gemini 2.5 Pro1004900029.8%
Grok 4.1 Fast1004300028.7%
DeepSeek V4 Pro1004000027.9%
Mistral Small Creative5441360026.4%
Gemini 2.5 Flash834900026.3%
DeepSeek-V2 Chat884400026.3%
Gemma 4 26B (Reasoning)665700024.5%
Gemini 2.5 Flash (Reasoning)714900024.0%
Gemma 4 31B (Reasoning)605400022.7%
Claude Opus 4.5595400022.5%
GPT-5.4 (Reasoning, Low)29292825022.1%
Mistral Small 4555300021.6%
o4 Mini733400021.4%
Claude Opus 44439190020.4%
ByteDance Seed 1.6100000020.0%
Z.AI GLM 4.7 Flash100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
DeepSeek V4 Flash100000020.0%
Llama 3.1 70B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Arcee AI: Trinity Mini100000020.0%
Ministral 3 3B100000020.0%
GPT-5.4 Nano (Reasoning)682900019.4%
Qwen3.7 Max474300018.0%
Claude 3 Haiku89000017.9%
Gemini 3 Flash (Preview, Reasoning)444300017.4%
o4 Mini High2929270017.0%
GPT-5.4 Mini522800016.1%
Gemma 4 26B393700015.3%
Ministral 3 14B76000015.2%
Grok 4.20512200014.5%
Grok 4.3432800014.1%
LFM2 24B69000013.9%
Claude 3.5 Sonnet67000013.3%
DeepSeek V3.166000013.2%
GPT-4o Mini (temp=0)64000012.8%
Inception Mercury 263000012.7%
Qwen 3.6 35B372400012.2%
GPT-4o, May 13th (temp=0)60000012.0%
Z.AI GLM 4.657000011.5%
Arcee AI: Trinity Large (Preview)55000011.0%
Hermes 3 405B54000010.8%
Qwen 3.5 Flash351900010.7%
Gemini 3.1 Flash Lite (Preview)51000010.2%
Qwen 2.5 72B51000010.2%
Qwen 3.6 27B4800009.6%
Llama 3.1 8B4300008.7%
Qwen 3.5 35B24180008.4%
Gemini 3.1 Pro (Preview)4100008.3%
Qwen 3.5 Plus (2026-04-20)3800007.6%
Z.AI GLM 4.73800007.6%
Qwen 3.6 Flash3100006.2%
Qwen 3.5 27B2100004.3%
GPT-4o, Aug. 6th (temp=0)700001.4%
Qwen3.6 Max Preview000000.0%
Qwen 3.5 397B A17B000000.0%
Qwen 3.5 122B000000.0%
Grok 4 Fast000000.0%
Qwen 3.5 9B000000.0%
Gemini 3.1 Flash Lite000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Inception Mercury000000.0%
Mistral Small 3.2 24B000000.0%
Ministral 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B1001008282072.8%
GPT-4.1 Nano1001008377072.1%
Claude Haiku 4.510010010057071.4%
Claude Opus 410010010056071.2%
Qwen 3.5 397B A17B10010086362068.6%
Mistral Medium 3.11001008043064.6%
ByteDance Seed 1.6 Flash976959443661.1%
Qwen 3 32B93856661060.8%
Mistral Large1001001000060.0%
Writer: Palmyra X510096740054.0%
MoonshotAI: Kimi K2.510085820053.3%
GPT-4o, Aug. 6th (temp=1)9691790053.3%
Grok 4.2010091740053.0%
Rocinante 12B10094670052.2%
Mistral Small 4 (Reasoning)95744542051.1%
Gemma 3 4B100100540050.9%
GPT-5.4 Nano (Reasoning, Low)100953127050.6%
o4 Mini High100863531050.5%
Grok 4.20 (Beta)100625037049.8%
Qwen 3.6 35B100675130049.4%
GPT-5.4 Nano (Reasoning)74595655048.8%
Gemini 2.5 Flash Lite (Reasoning)76615347047.2%
Qwen3.6 Max Preview100633428045.0%
GPT-5.4 Nano10075490044.8%
Claude Sonnet 410064580044.4%
GPT-5.181534336042.7%
Qwen 2.5 72B7471650042.0%
Stealth: Hunter Alpha10057520041.8%
Mistral Small 49570380040.5%
LFM2 24B7067640040.2%
GPT-5.4 Mini8874390040.2%
MoonshotAI: Kimi K2.610010000040.0%
ByteDance Seed 2.0 Mini10010000040.0%
Llama 3.1 70B10010000040.0%
GPT-5.4100353330039.7%
DeepSeek V4 Pro (Reasoning)7265600039.5%
Nemotron 3 Nano8168470039.3%
Gemini 3.1 Flash Lite1009600039.2%
GPT-5.4 (Reasoning, Low)100363030039.2%
Qwen 3.5 35B80643811038.8%
Z.AI GLM 4.68257530038.4%
DeepSeek V3.21008800037.7%
Z.AI GLM 58355500037.6%
Claude Opus 4.51008800037.5%
Mistral Large 310046410037.4%
Gemini 3 Pro (Preview)1008600037.2%
Gemma 4 31B968900037.1%
Arcee AI: Trinity Large (Preview)1008300036.7%
Gemini 2.5 Pro7567420036.7%
GPT-5.4 Mini (Reasoning, Low)10042390036.2%
GPT-5 Nano9743380035.7%
Grok 41007800035.6%
Qwen 3.5 Plus (2026-04-20)8254400035.2%
Gemini 3.5 Flash (Reasoning, Minimal)898100034.0%
GPT-4o Mini (temp=1)1006900033.9%
Qwen 3.5 27B1006900033.8%
Gemma 4 26B1006700033.3%
Grok 4.20 (Reasoning)848200033.2%
Qwen 3.6 Flash7858290033.1%
GPT-5.5 (Reasoning, Low)71402726032.9%
Ministral 3 14B1006400032.8%
Gemini 2.5 Flash (Reasoning)1006300032.7%
Ministral 3B887500032.5%
Grok 4.31006100032.2%
Qwen 3.5 Plus (2026-02-15)5655500032.1%
GPT-4o, May 13th (temp=1)887100031.8%
Qwen 3.5 9B1005090031.7%
DeepSeek V3 (2024-12-26)867100031.5%
GPT-55856430031.3%
Ministral 8B1005600031.2%
Grok 4 Fast1005500031.0%
GPT-5.4 Mini (Reasoning)7841350031.0%
Claude Sonnet 4.6827100030.7%
Xiaomi MIMO v2.5955800030.7%
Qwen3 235B A22B Instruct 25071005200030.3%
Nemotron 3 Super935400029.4%
Gemini 3.5 Flash (Reasoning)964900029.0%
GPT-5.56453280029.0%
GPT-5.5 (Reasoning)56343023028.4%
Mistral Small Creative1004200028.4%
Stealth: Healer Alpha1004100028.2%
Qwen3.7 Max726800028.2%
Xiaomi MIMO v2.5 Pro1003700027.4%
Claude Opus 4.6634600021.8%
DeepSeek V4 Flash575100021.6%
Z.AI GLM 5.1565200021.5%
Claude 3.7 Sonnet525100020.6%
Z.AI GLM 5 Turbo100000020.0%
Claude Opus 4.7 (Reasoning)100000020.0%
Grok 4.20 (Beta, Reasoning)100000020.0%
ByteDance Seed 1.6100000020.0%
MiniMax M2.7100000020.0%
GPT-4.1100000020.0%
Hermes 3 405B100000020.0%
Mistral Large 2100000020.0%
Gemma 3 12B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Claude 3 Haiku100000020.0%
GPT-5.4 (Reasoning)633000018.8%
Gemma 4 31B (Reasoning)93000018.5%
WizardLM 2 8x22b484200017.9%
Z.AI GLM 4.5 Air86000017.2%
Claude Opus 4.785000016.9%
Gemini 3.1 Flash Lite (Reasoning)85000016.9%
GPT-4o, Aug. 6th (temp=0)82000016.4%
Hermes 3 70B79000015.9%
o4 Mini423400015.3%
Qwen 3.6 27B393400014.6%
Gemini 3 Flash (Preview, Reasoning)72000014.5%
GPT-4o, May 13th (temp=0)71000014.3%
DeepSeek V4 Pro68000013.7%
Arcee AI: Trinity Mini66000013.2%
GPT-5 Mini343100013.0%
Grok 4.1 Fast64000012.8%
Gemini 2.5 Flash Lite62000012.3%
Qwen 3.5 Flash60000011.9%
DeepSeek V4 Flash (Reasoning)58000011.6%
GPT-5.2292900011.5%
GPT-4.1 Mini54000010.9%
Claude Sonnet 4.551000010.2%
Gemini 3.1 Pro (Preview)4900009.8%
Stealth: Aurora Alpha4900009.8%
Mistral Small 3.2 24B4520009.4%
DeepSeek V3.14600009.2%
Gemma 3 27B4200008.5%
Aion 2.03600007.2%
Z.AI GLM 4.73400006.8%
Ministral 3 8B3300006.6%
GPT-OSS 120B3200006.3%
Qwen 3.5 122B2900005.7%
Claude Opus 4.6 (Reasoning)000000.0%
Claude Sonnet 4.6 (Reasoning)000000.0%
Grok 4.3 (Reasoning)000000.0%
Gemma 4 26B (Reasoning)000000.0%
MiniMax M2.5000000.0%
Z.AI GLM 4.5000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Gemini 3 Flash (Preview)000000.0%
DeepSeek-V2 Chat000000.0%
Z.AI GLM 4.7 Flash000000.0%
ByteDance Seed 2.0 Lite000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Gemini 2.5 Flash000000.0%
Inception Mercury000000.0%
GPT-4o Mini (temp=0)000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
Ministral 3 3B000000.0%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o Mini (temp=1)10010096937592.7%
Mistral NeMO10010085824782.7%
Claude Haiku 4.510010068524573.0%
Z.AI GLM 51001001000060.0%
ByteDance Seed 1.61001001000060.0%
Claude Opus 4.6 (Reasoning)91895450056.8%
Writer: Palmyra X510088840054.5%
Z.AI GLM 5.1100685646054.0%
Claude Sonnet 4100100690053.9%
Stealth: Hunter Alpha100575753053.5%
Gemma 3 27B100100650053.0%
Gemma 3 12B75686356052.2%
Z.AI GLM 4.510082780052.0%
Qwen3 235B A22B Instruct 250777686548051.7%
GPT-4o Mini (temp=0)10082660049.6%
Mistral Small 4 (Reasoning)10081620048.5%
Gemini 3 Pro (Preview)100100410048.3%
Gemini 3.5 Flash (Reasoning, Minimal)10072650047.5%
GPT-51004341282547.3%
Claude Opus 4.7 (Reasoning)8576700046.2%
Gemini 3.1 Flash Lite (Preview)10060600044.0%
Grok 410063560043.8%
Stealth: Healer Alpha88454433042.1%
Claude 3.7 Sonnet10056500041.1%
Claude Opus 4.510010000040.0%
Z.AI GLM 4.610010000040.0%
Z.AI GLM 4.5 Air10010000040.0%
GPT-4o, Aug. 6th (temp=1)10010000040.0%
DeepSeek V4 Flash10010000040.0%
Hermes 3 70B10010000040.0%
Llama 3.1 8B10010000040.0%
GPT-4o, May 13th (temp=0)1009800039.6%
Gemma 3 4B7169560039.4%
o4 Mini High8575360039.2%
Ministral 3 14B1009400038.9%
Mistral Large7470500038.8%
Cohere Command R+ (Aug. 2024)1009300038.5%
GPT-5.4 Nano474640272637.3%
GPT-4.1 Nano1008100036.1%
Gemini 2.5 Flash Lite6156560034.5%
Gemma 4 31B1007200034.5%
Xiaomi MIMO v2.553434135034.3%
Mistral Large 27460360034.1%
DeepSeek V3.11006900033.9%
Gemini 2.5 Flash (Reasoning)6756470033.9%
GPT-5.4 (Reasoning)10037320033.7%
Z.AI GLM 5 Turbo1006700033.3%
GPT-4o, May 13th (temp=1)937100032.8%
Qwen 3 32B5654520032.4%
Qwen 3.5 397B A17B97272315032.3%
Qwen 3.5 Plus (2026-04-20)10047140032.1%
Claude Sonnet 4.6827800032.0%
DeepSeek V4 Flash (Reasoning)1005700031.5%
GPT-5.4 (Reasoning, Low)6460340031.4%
Rocinante 12B985800031.2%
Claude Sonnet 4.6 (Reasoning)1005500031.0%
Gemini 3 Flash (Preview)1005500031.0%
Gemma 4 26B1005200030.4%
MiniMax M2.7916000030.1%
Qwen 3.5 Plus (2026-02-15)1004500028.9%
Mistral Medium 3.15949350028.5%
Qwen3.7 Max746800028.4%
GPT-5.5 (Reasoning, Low)6253250028.2%
o4 Mini1003400026.8%
Gemini 3.5 Flash (Reasoning)765800026.8%
GPT-5.4 Nano (Reasoning)6835290026.5%
Grok 4.20 (Beta)5345330026.4%
GPT-5.137323130026.0%
Grok 4 Fast685900025.3%
Inception Mercury 2853500024.0%
GPT-5.56627260023.8%
Mistral Small Creative724200023.0%
Z.AI GLM 4.7 Flash605000022.0%
DeepSeek-V2 Chat614900022.0%
GPT-5 Nano6622220022.0%
GPT-5.4743300021.4%
Z.AI GLM 4.7100000020.0%
ByteDance Seed 2.0 Mini100000020.0%
Gemini 3.1 Flash Lite (Reasoning)100000020.0%
ByteDance Seed 2.0 Lite100000020.0%
Claude 3.5 Sonnet100000020.0%
Hermes 3 405B100000020.0%
DeepSeek V3 (2025-03-24)100000020.0%
Llama 3.1 70B100000020.0%
Arcee AI: Trinity Mini100000020.0%
Ministral 3B100000020.0%
LFM2 24B100000020.0%
GPT-5.4 Mini3633300019.9%
Grok 4.1 Fast593900019.6%
Mistral Small 4514600019.5%
Mistral Small 3.2 24B87640019.4%
Claude Sonnet 4.596000019.2%
Gemini 2.5 Flash Lite (Reasoning)94000018.9%
Arcee AI: Trinity Large (Preview)91000018.2%
GPT-5.4 Nano (Reasoning, Low)4027220017.9%
Qwen3.6 Max Preview543400017.7%
Mistral Large 388000017.7%
GPT-5 Mini3228270017.5%
GPT-4.1 Mini86000017.2%
Claude Opus 4483700016.9%
Gemini 3.1 Flash Lite83000016.7%
GPT-5.4 Mini (Reasoning)433900016.4%
DeepSeek V3 (2024-12-26)81000016.1%
MiniMax M2.571000014.3%
Qwen 3.5 35B2626170013.8%
MoonshotAI: Kimi K2.568000013.7%
GPT-4o, Aug. 6th (temp=0)68000013.7%
Claude Opus 4.665000013.0%
DeepSeek V4 Pro64000012.8%
Gemini 2.5 Pro63000012.5%
Xiaomi MIMO v2.5 Pro63000012.5%
ByteDance Seed 1.6 Flash57000011.4%
Grok 4.356000011.2%
GPT-5.5 (Reasoning)282800011.0%
Ministral 3 8B55000011.0%
DeepSeek V4 Pro (Reasoning)54000010.9%
GPT-5.2272600010.7%
Aion 2.053000010.6%
Grok 4.20 (Reasoning)53000010.5%
DeepSeek V3.250000010.0%
GPT-4.14900009.7%
Qwen 3.6 27B4800009.5%
Qwen 3.5 Flash31150009.3%
Nemotron 3 Super4400008.8%
GPT-OSS 120B3900007.8%
Nemotron 3 Nano3900007.8%
GPT-5.4 Mini (Reasoning, Low)3800007.6%
Qwen 3.5 27B3610007.4%
Grok 4.203200006.3%
Qwen 3.6 Flash2900005.7%
Qwen 3.5 122B1900003.9%
Qwen 3.6 35B600001.2%
Gemini 3.1 Pro (Preview)000000.0%
Grok 4.3 (Reasoning)000000.0%
MoonshotAI: Kimi K2.6000000.0%
Gemma 4 31B (Reasoning)000000.0%
Gemma 4 26B (Reasoning)000000.0%
Grok 4.20 (Beta, Reasoning)000000.0%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Claude Opus 4.7000000.0%
Qwen 3.5 9B000000.0%
Stealth: Aurora Alpha000000.0%
Gemini 2.5 Flash000000.0%
Inception Mercury000000.0%
Qwen 2.5 72B000000.0%
Llama 3.1 Nemotron 70B000000.0%
Claude 3 Haiku000000.0%
WizardLM 2 8x22b000000.0%
Ministral 3 3B000000.0%
Ministral 8B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Cohere Command R+ (Aug. 2024)10010098856990.4%
DeepSeek V3.21001009795078.5%
Gemma 3 4B10010010085076.9%
Writer: Palmyra X510010010068073.5%
Qwen 3.5 397B A17B1009669632670.9%
Gemini 2.5 Flash Lite (Reasoning)1009166494369.8%
GPT-4o Mini (temp=1)100888181069.8%
Arcee AI: Trinity Mini100947776069.4%
Z.AI GLM 4.6100797571065.1%
Qwen 3.5 Flash1001006148061.9%
Claude Sonnet 41001001000060.0%
Hermes 3 70B1001001000060.0%
ByteDance Seed 1.6100100910058.2%
Gemini 3.5 Flash (Reasoning, Minimal)10096880056.8%
Stealth: Healer Alpha100815742055.8%
Gemini 2.5 Flash Lite100100680053.7%
Xiaomi MIMO v2.510096680052.7%
Gemma 4 31B (Reasoning)9181770049.7%
Gemma 4 31B8582770048.7%
Claude 3 Haiku9378710048.4%
Qwen 3.5 27B100100410048.2%
GPT-4o Mini (temp=0)8581700047.2%
GPT-5.472705834046.9%
Z.AI GLM 5.110068630046.0%
Grok 410075520045.3%
Gemini 3 Flash (Preview, Reasoning)10063600044.4%
GPT-4o, May 13th (temp=1)8875570044.0%
Gemini 2.5 Flash10070440042.9%
Z.AI GLM 4.7 Flash10059540042.6%
Gemma 3 12B10056550042.2%
Mistral Large 26968680041.1%
Claude Haiku 4.57566640040.9%
Z.AI GLM 510060420040.4%
Claude Sonnet 4.510010000040.0%
ByteDance Seed 2.0 Mini10010000040.0%
Gemini 3.1 Flash Lite (Preview)10010000040.0%
Ministral 8B10010000040.0%
Qwen 3.6 Flash10062350039.3%
GPT-5.4 Nano (Reasoning, Low)86552727039.1%
GPT-5.4 (Reasoning, Low)10060350038.9%
Llama 3.1 Nemotron 70B1009300038.5%
Mistral Small 4 (Reasoning)10049410038.0%
Claude Opus 4.6 (Reasoning)10051360037.3%
Qwen 3.5 Plus (2026-02-15)10048390037.3%
Claude Opus 4.71008600037.2%
Z.AI GLM 4.5 Air1008500036.9%
GPT-5 Mini10043420036.9%
GPT-5.4 Mini (Reasoning, Low)10045370036.3%
Mistral Large908900035.9%
Claude Opus 4.56755530034.8%
GPT-4.1 Mini987500034.5%
Gemma 3 27B1007000034.1%
Grok 4.20 (Reasoning)888100033.8%
LFM2 24B1006800033.7%
Gemini 2.5 Flash (Reasoning)1006600033.2%
Claude Opus 47056370032.8%
Qwen 3.6 35B7558300032.8%
Z.AI GLM 4.5887400032.2%
Qwen 3.5 Plus (2026-04-20)1006000032.0%
Mistral Small Creative1005900031.8%
Gemini 3.1 Pro (Preview)896400030.7%
Claude Sonnet 4.6836800030.4%
Claude Opus 4.65451460030.2%
ByteDance Seed 1.6 Flash5746460029.8%
DeepSeek V3 (2024-12-26)816800029.6%
MiniMax M2.5746800028.5%
Gemini 3 Pro (Preview)1004200028.4%
DeepSeek V4 Pro (Reasoning)1004200028.3%
GPT-5.4 Nano46412624027.4%
Rocinante 12B696400026.7%
GPT-OSS 120B755300025.6%
Grok 4.204939380025.2%
Qwen3 235B A22B Instruct 2507685700024.9%
o4 Mini794300024.4%
GPT-5.4 Nano (Reasoning)5233270022.3%
Mistral Medium 3.1515000020.0%
Qwen 3.5 122B100000020.0%
MoonshotAI: Kimi K2.5100000020.0%
Gemini 3.1 Flash Lite100000020.0%
Mistral Large 3100000020.0%
GPT-4o, May 13th (temp=0)100000020.0%
ByteDance Seed 2.0 Lite100000020.0%
Claude 3.5 Sonnet100000020.0%
GPT-4o, Aug. 6th (temp=1)100000020.0%
GPT-5 Nano100000020.0%
Mistral Small 3.2 24B100000020.0%
Llama 3.1 70B100000020.0%
Nemotron 3 Nano100000020.0%
Mistral Small 4100000020.0%
Qwen 2.5 72B100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
Ministral 3 14B100000020.0%
Mistral NeMO100000020.0%
DeepSeek V4 Pro514800019.8%
Grok 4.3 (Reasoning)98000019.6%
GPT-5.4 (Reasoning)633500019.6%
Qwen 3.6 27B722300019.0%
MiniMax M2.7504500018.9%
Z.AI GLM 4.791000018.2%
DeepSeek V4 Flash (Reasoning)88000017.5%
Gemini 3.1 Flash Lite (Reasoning)88000017.5%
Claude Opus 4.7 (Reasoning)83000016.7%
Stealth: Aurora Alpha77000015.4%
GPT-4.1 Nano77000015.4%
Llama 3.1 8B76000015.2%
DeepSeek V3.175000014.9%
Gemma 4 26B (Reasoning)74000014.7%
Qwen 3 32B74000014.7%
Ministral 3B71000014.2%
GPT-5.5 (Reasoning)452500013.9%
GPT-5.4 Mini432500013.5%
GPT-5.2362900013.0%
DeepSeek-V2 Chat63000012.7%
GPT-5302900011.8%
Grok 4.20 (Beta)401800011.6%
Gemini 2.5 Pro54000010.9%
Claude 3.7 Sonnet54000010.8%
GPT-5.5 (Reasoning, Low)272600010.6%
Qwen 3.5 9B52000010.4%
Grok 4.20 (Beta, Reasoning)4900009.7%
Gemini 3 Flash (Preview)4600009.3%
Xiaomi MIMO v2.5 Pro4500009.0%
Grok 4 Fast4500009.0%
GPT-5.54300008.7%
Inception Mercury 24200008.4%
Qwen3.6 Max Preview23180008.3%
Stealth: Hunter Alpha4100008.2%
Aion 2.03900007.8%
o4 Mini High3300006.5%
WizardLM 2 8x22b2700005.3%
Qwen 3.5 35B400000.7%
Qwen3.7 Max000000.0%
Z.AI GLM 5 Turbo000000.0%
Gemini 3.5 Flash (Reasoning)000000.0%
Claude Sonnet 4.6 (Reasoning)000000.0%
GPT-5.1000000.0%
MoonshotAI: Kimi K2.6000000.0%
GPT-5.4 Mini (Reasoning)000000.0%
Grok 4.1 Fast000000.0%
GPT-4.1000000.0%
Gemma 4 26B000000.0%
Nemotron 3 Super000000.0%
Hermes 3 405B000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
DeepSeek V4 Flash000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Inception Mercury000000.0%
Grok 4.3000000.0%
Ministral 3 8B000000.0%
Ministral 3 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Cohere Command R+ (Aug. 2024)10010010091078.2%
Gemini 2.5 Flash (Reasoning)10010010088077.5%
Rocinante 12B10010010079075.7%
Gemma 3 4B10010010059071.8%
Qwen3 235B A22B Instruct 2507989465583870.7%
Claude 3 Haiku100987772069.5%
Gemini 3.5 Flash (Reasoning, Minimal)100888660066.8%
GPT-4o, May 13th (temp=1)1001006152062.6%
GPT-4o Mini (temp=0)100995750061.2%
Gemini 3.1 Flash Lite (Preview)1001001000060.0%
Llama 3.1 Nemotron 70B1001001000060.0%
Mistral NeMO100100960059.2%
Gemini 3 Flash (Preview, Reasoning)100100900058.0%
Qwen 3.5 397B A17B1009555202057.9%
Gemini 2.5 Flash Lite (Reasoning)100100860057.2%
Gemma 3 27B99775343054.3%
GPT-4o, May 13th (temp=0)97595855053.8%
Gemma 3 12B10097690053.3%
Claude Opus 4.7100100570051.4%
Gemma 4 26B (Reasoning)10097600051.3%
Qwen 3.5 Plus (2026-02-15)100634846051.3%
WizardLM 2 8x22b854847332948.4%
Z.AI GLM 4.774666438048.2%
Xiaomi MIMO v2.5 Pro100653731046.5%
Writer: Palmyra X510090420046.4%
Arcee AI: Trinity Large (Preview)10069600045.8%
Gemini 2.5 Flash Lite10060600044.1%
Stealth: Healer Alpha9176390041.1%
Gemma 4 31B10056490040.9%
Gemini 3.1 Flash Lite10010000040.0%
DeepSeek-V2 Chat10010000040.0%
GPT-4o Mini (temp=1)10010000040.0%
Llama 3.1 8B10010000040.0%
Qwen 3 32B1009800039.6%
Mistral Small 48867430039.6%
Z.AI GLM 4.5 Air10050480039.5%
Z.AI GLM 5 Turbo58524441039.1%
GPT-5706121201838.3%
MiniMax M2.57560530037.5%
Grok 4.20 (Beta)72652621036.9%
GPT-5.4 Nano (Reasoning)8775210036.7%
GPT-5.4 Nano66474422035.8%
GPT-4o, Aug. 6th (temp=1)1007500034.9%
Claude 3.5 Sonnet1007400034.7%
Qwen 3.6 35B54504523034.5%
Nemotron 3 Nano1007200034.4%
GPT-4.1 Nano1006700033.3%
Claude Sonnet 4.61006600033.2%
Claude 3.7 Sonnet8540400033.0%
Claude Sonnet 4.6 (Reasoning)1006500033.0%
Mistral Large 21006500033.0%
DeepSeek V3.1947000032.9%
GPT-5.4 Mini (Reasoning)8344370032.8%
Gemini 3.5 Flash (Reasoning)1006300032.5%
Grok 4.3906800031.7%
Gemma 4 31B (Reasoning)1005700031.5%
Z.AI GLM 4.7 Flash857200031.3%
Gemini 3.1 Pro (Preview)45413832031.3%
Claude Opus 4.67448340031.2%
GPT-5.4443028272631.2%
DeepSeek V4 Pro (Reasoning)5856390030.6%
Claude Sonnet 45655400030.1%
GPT-5 Nano61402920030.0%
o4 Mini High8534300029.9%
Qwen 2.5 72B6346410029.8%
Grok 4.20 (Reasoning)8434300029.6%
Gemini 2.5 Pro6447370029.6%
DeepSeek V4 Pro6343410029.2%
GPT-5.4 (Reasoning, Low)8630270028.6%
Mistral Medium 3.11004200028.4%
Mistral Small Creative1003900027.8%
Ministral 8B716700027.6%
Z.AI GLM 4.6874700026.8%
GPT-4.14544420026.3%
Qwen 3.6 Flash775300026.0%
Gemini 2.5 Flash795000025.9%
GPT-OSS 120B5742280025.6%
Gemma 4 26B814600025.3%
GPT-5.5 (Reasoning)422421211925.3%
GPT-5 Mini43292822024.4%
Stealth: Aurora Alpha5240290024.2%
Claude Opus 45531310023.5%
Mistral Small 4 (Reasoning)4834330022.9%
GPT-5.4 Nano (Reasoning, Low)4745190022.4%
DeepSeek V3 (2024-12-26)575400022.4%
Z.AI GLM 5.1684100022.0%
GPT-5.4 (Reasoning)555400021.9%
GPT-5.144232120021.8%
Claude Opus 4.6 (Reasoning)4037310021.8%
Grok 4.20 (Beta, Reasoning)3838300021.2%
MoonshotAI: Kimi K2.5100000020.0%
ByteDance Seed 1.6100000020.0%
DeepSeek V4 Flash (Reasoning)100000020.0%
ByteDance Seed 2.0 Mini100000020.0%
DeepSeek V3.2100000020.0%
DeepSeek V3 (2025-03-24)100000020.0%
Hermes 3 70B100000020.0%
Arcee AI: Trinity Mini100000020.0%
Llama 3.1 70B98000019.6%
Grok 497000019.4%
GPT-5.54435180019.3%
Xiaomi MIMO v2.595000019.0%
Ministral 3 8B494400018.6%
Claude Opus 4.592000018.3%
Qwen 3.6 27B593200018.2%
Stealth: Hunter Alpha533600017.9%
Gemini 3 Flash (Preview)474100017.5%
Z.AI GLM 5454300017.5%
Claude Sonnet 4.5473700016.9%
Hermes 3 405B83000016.7%
Grok 4.203428200016.5%
Claude Haiku 4.5404000016.1%
Grok 4 Fast473200015.9%
o4 Mini2926210015.3%
Qwen 3.5 122B373600014.8%
Gemini 3 Pro (Preview)423000014.4%
Gemini 3.1 Flash Lite (Reasoning)69000013.9%
Z.AI GLM 4.567000013.3%
GPT-5.4 Mini66000013.2%
Qwen 3.5 9B65000013.0%
GPT-5.5 (Reasoning, Low)382500012.7%
Mistral Large63000012.7%
GPT-4.1 Mini63000012.5%
Grok 4.3 (Reasoning)58000011.6%
DeepSeek V4 Flash57000011.4%
Ministral 3 14B54000010.9%
Mistral Large 353000010.5%
Qwen3.6 Max Preview51000010.2%
Inception Mercury 2282300010.1%
MoonshotAI: Kimi K2.64900009.7%
Grok 4.1 Fast4500009.1%
MiniMax M2.73900007.8%
ByteDance Seed 1.6 Flash3800007.6%
Qwen 3.5 Plus (2026-04-20)3200006.5%
Qwen 3.5 35B3100006.3%
GPT-5.4 Mini (Reasoning, Low)3100006.3%
GPT-5.22300004.7%
Qwen 3.5 Flash2100004.2%
Inception Mercury1200002.4%
Qwen3.7 Max000000.0%
Claude Opus 4.7 (Reasoning)000000.0%
Qwen 3.5 27B000000.0%
Aion 2.0000000.0%
ByteDance Seed 2.0 Lite000000.0%
Nemotron 3 Super000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
Mistral Small 3.2 24B000000.0%
Ministral 3 3B000000.0%
Ministral 3B000000.0%
LFM2 24B000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Qwen 3.5 397B A17B1001001001006091.9%
Gemini 3.1 Pro (Preview)100100100635283.1%
GPT-4o Mini (temp=0)1007977756980.1%
Z.AI GLM 4.5100100100100080.0%
Stealth: Hunter Alpha10010093643979.3%
GPT-5.4 (Reasoning)1009491604578.2%
GPT-5.11009085793277.3%
Claude Opus 410010066535274.1%
Gemini 3.5 Flash (Reasoning)10010010064072.8%
Gemini 3.1 Flash Lite (Reasoning)10010010063072.7%
Claude Sonnet 41001008671071.5%
Claude 3.7 Sonnet10010010052070.4%
GPT-4o Mini (temp=1)1001007866068.8%
Z.AI GLM 4.71001009546068.3%
GPT-5.4 Nano1009774412968.2%
Qwen 3.6 27B10010010039067.8%
ByteDance Seed 1.6 Flash10010010038067.5%
Claude Opus 4.7 (Reasoning)98857972066.9%
Qwen 3.5 Plus (2026-02-15)1001007062066.4%
GPT-5.21009977282765.9%
GPT-5.5 (Reasoning)937861613565.6%
GPT-5.4 (Reasoning, Low)1008365502765.0%
Z.AI GLM 4.7 Flash1001006357064.0%
Qwen 3.5 9B1001006355063.6%
Qwen3.7 Max100967645063.4%
Z.AI GLM 5 Turbo100827260062.9%
Claude Sonnet 4.579777768060.3%
Qwen 3.5 27B1001001000060.0%
Gemini 3 Flash (Preview, Reasoning)1001001000060.0%
Z.AI GLM 4.61001001000060.0%
Qwen 3.5 35B1001001000060.0%
Hermes 3 405B1001001000060.0%
Gemini 2.5 Flash Lite1001001000060.0%
Rocinante 12B1001001000060.0%
Z.AI GLM 5100716861060.0%
Gemini 2.5 Pro100865653058.9%
Claude Opus 4.5100716359058.7%
Gemini 2.5 Flash1005050474558.2%
GPT-5 Mini1007544373457.9%
GPT-4o, Aug. 6th (temp=1)10096890057.1%
Gemini 3.5 Flash (Reasoning, Minimal)10093910056.7%
Grok 4.20827360333356.4%
GPT-5.4 Mini (Reasoning, Low)100896229055.9%
Arcee AI: Trinity Mini100100750054.9%
ByteDance Seed 2.0 Mini10094740053.6%
Grok 4.20 (Reasoning)1007434312953.3%
Cohere Command R+ (Aug. 2024)10098680053.3%
Claude Opus 4.6100565149051.1%
Z.AI GLM 4.5 Air10094560050.0%
Claude Opus 4.78982760049.4%
GPT-4.1 Mini8585760049.0%
Xiaomi MIMO v2.5100100430048.7%
Mistral Large100100430048.6%
Gemini 2.5 Flash (Reasoning)10094450047.8%
GPT-5.4 Nano (Reasoning)665249432547.0%
Gemma 4 26B10068670046.8%
MoonshotAI: Kimi K2.59578600046.6%
Mistral Small 410089420046.3%
GPT-5.5 (Reasoning, Low)695350361945.5%
DeepSeek V4 Flash (Reasoning)10063630045.2%
Claude Sonnet 4.6 (Reasoning)10068580045.1%
GPT-5.4 Mini (Reasoning)993531302945.0%
GPT-5.4 Nano (Reasoning, Low)734739372644.4%
Gemma 3 12B10066540044.0%
Mistral Large 310076440044.0%
Claude Sonnet 4.610061590044.0%
GPT-4.110061550043.2%
MiniMax M2.510059570043.1%
GPT-5.5635748291842.8%
Grok 4.20 (Beta)10077360042.6%
Qwen3 235B A22B Instruct 25078875490042.3%
DeepSeek V4 Pro (Reasoning)8968540042.2%
Qwen3.6 Max Preview773934322942.1%
DeepSeek V3 (2024-12-26)8364620041.8%
Claude Opus 4.6 (Reasoning)8862530040.7%
MoonshotAI: Kimi K2.610057460040.7%
Qwen 3.5 122B10063400040.6%
Qwen 3.6 Flash8678380040.4%
Claude 3.5 Sonnet10010000040.0%
DeepSeek V3.110010000040.0%
Llama 3.1 70B10010000040.0%
Llama 3.1 Nemotron 70B10010000040.0%
GPT-5.48579320039.2%
Gemini 3 Pro (Preview)10051420038.5%
Qwen 3.5 Plus (2026-04-20)10046400037.3%
Z.AI GLM 5.110052340037.1%
Gemini 2.5 Flash Lite (Reasoning)53524833037.0%
Hermes 3 70B1008500036.9%
GPT-5.4 Mini1008400036.8%
Xiaomi MIMO v2.5 Pro54494536036.6%
GPT-4.1 Nano868300033.9%
DeepSeek V4 Pro1006600033.2%
Writer: Palmyra X55854520032.9%
Ministral 3 14B6859370032.9%
Gemma 4 31B (Reasoning)1006400032.8%
Gemini 3 Flash (Preview)1006400032.8%
DeepSeek V3 (2025-03-24)1006200032.3%
DeepSeek-V2 Chat1005700031.4%
Claude 3 Haiku817600031.3%
Qwen 3.5 Flash817500031.2%
Mistral Large 21005600031.1%
GPT-4o, Aug. 6th (temp=0)797400030.6%
Grok 4.36551350030.1%
Aion 2.0945100029.0%
Qwen 3 32B826000028.3%
Grok 4.3 (Reasoning)4949410027.9%
Qwen 3.6 35B7233320027.4%
Claude Haiku 4.5706500027.1%
o4 Mini874800026.9%
GPT-5 Nano7528240025.4%
Stealth: Healer Alpha794700025.2%
Ministral 3 8B6842100023.9%
Mistral NeMO635600023.7%
Mistral Medium 3.1634400021.5%
Grok 4.1 Fast535200020.9%
DeepSeek V3.2515000020.1%
Gemma 4 26B (Reasoning)100000020.0%
ByteDance Seed 1.6100000020.0%
MiniMax M2.7100000020.0%
Gemma 4 31B100000020.0%
ByteDance Seed 2.0 Lite100000020.0%
Llama 3.1 8B100000020.0%
Ministral 3B100000020.0%
Ministral 3 3B96000019.2%
WizardLM 2 8x22b93000018.5%
GPT-5494300018.3%
Gemma 3 4B86000017.2%
Mistral Small Creative81000016.1%
Mistral Small 4 (Reasoning)77000015.3%
o4 Mini High403400014.8%
Arcee AI: Trinity Large (Preview)74000014.7%
Ministral 8B71000014.2%
Gemma 3 27B69000013.9%
Grok 4.20 (Beta, Reasoning)323100012.6%
GPT-4o, May 13th (temp=1)60000012.0%
Nemotron 3 Super58000011.6%
GPT-OSS 120B56000011.1%
Grok 4 Fast54000010.9%
GPT-4o, May 13th (temp=0)54000010.9%
DeepSeek V4 Flash51000010.2%
LFM2 24B51000010.1%
Nemotron 3 Nano5000009.9%
Mistral Small 3.2 24B200000.4%
Grok 4000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Qwen 2.5 72B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4100100100100100100.0%
Z.AI GLM 4.71001001001009098.1%
Writer: Palmyra X51001001001007294.4%
Claude Opus 4.6 (Reasoning)100100100906991.8%
GPT-5.4 (Reasoning, Low)100100100955890.5%
Gemini 3.1 Flash Lite1001001001005190.1%
Qwen 3.6 27B1009985838089.4%
Claude Opus 4.5100100100795586.9%
GPT-5.4 Nano (Reasoning, Low)999999845286.4%
Ministral 3 14B10010094736285.8%
Claude Opus 4.610010074717083.1%
Qwen 3 32B100100100575482.2%
Claude Sonnet 4.6 (Reasoning)10010087685381.5%
Z.AI GLM 510010091753480.1%
GPT-5.4 Nano100100100533777.9%
Gemma 3 27B10010091534577.9%
Gemini 2.5 Flash (Reasoning)1008277715877.8%
Qwen 3.5 397B A17B10010074664877.6%
GPT-5 Nano10010088722276.4%
GPT-5.4 Nano (Reasoning)100100100631976.3%
Qwen 3.5 Plus (2026-02-15)100100100413875.7%
Z.AI GLM 5.11009367615875.7%
Qwen3.6 Max Preview1009593603075.5%
GPT-51009865555474.4%
Gemini 3 Pro (Preview)1009383563573.4%
GPT-5.11009474534272.6%
Gemini 3.1 Pro (Preview)1009769642971.9%
Gemma 4 31B (Reasoning)10010054534570.5%
Nemotron 3 Nano10010010051070.2%
GPT-4.110010010044068.8%
MoonshotAI: Kimi K2.6888275702968.8%
Llama 3.1 8B1001007568068.4%
GPT-4o, Aug. 6th (temp=1)1001007268068.2%
GPT-5.5 (Reasoning)857969653967.5%
Mistral Small Creative1001009835066.6%
Gemini 2.5 Flash1001009733066.0%
Mistral Medium 3.11001006960065.9%
GPT-5.51007066652765.5%
GPT-5 Mini10010010027065.4%
Qwen 3.6 35B1009959363265.2%
GPT-5.4 Mini (Reasoning, Low)1001006359064.3%
Claude Sonnet 4686866615664.0%
Qwen 3.6 Flash100968043063.7%
MiniMax M2.71001007938063.4%
Gemini 2.5 Flash Lite (Reasoning)1001008234063.2%
GPT-5.4 (Reasoning)1006060524262.9%
Mistral Small 4 (Reasoning)1001005955062.8%
DeepSeek V3 (2025-03-24)100796760061.1%
GPT-5.21001006739061.0%
Ministral 3 8B1006453434160.3%
Gemini 3 Flash (Preview)878443434259.9%
GPT-5.4 Mini797371502659.7%
Llama 3.1 Nemotron 70B100100960059.2%
Qwen 3.5 Flash1001006531059.0%
GPT-4o Mini (temp=1)100726360058.9%
Gemma 3 12B100965345058.8%
Stealth: Healer Alpha100877531058.7%
Gemini 3.5 Flash (Reasoning, Minimal)86817848058.5%
Qwen 3.5 35B100817634058.2%
Gemma 4 26B1005048474658.0%
MoonshotAI: Kimi K2.5100756051057.2%
GPT-5.5 (Reasoning, Low)10010043281356.9%
GPT-5.4 Mini (Reasoning)100847526056.8%
Grok 41001004341056.8%
Gemini 3 Flash (Preview, Reasoning)846945453856.2%
Gemini 3.1 Flash Lite (Preview)100645754055.1%
Claude Sonnet 4.695646352054.8%
Z.AI GLM 5 Turbo97765640054.0%
DeepSeek V4 Flash100100690053.8%
Mistral Large 2100636342053.5%
DeepSeek V4 Pro100100670053.4%
Grok 4.20 (Reasoning)705854523153.0%
Gemini 3.1 Flash Lite (Reasoning)100585350052.1%
Qwen 3.5 9B100765031051.4%
GPT-4.1 Nano10075720049.4%
Mistral Large 310081650049.1%
GPT-4o, Aug. 6th (temp=0)10075700049.0%
Qwen3.7 Max79704946048.8%
Qwen 3.5 27B100100390047.8%
Gemini 3.5 Flash (Reasoning)76704646047.6%
DeepSeek V4 Pro (Reasoning)100100380047.6%
Mistral Small 4100100350047.1%
Claude Opus 4.7 (Reasoning)10072600046.5%
DeepSeek V3.1100504240046.3%
Stealth: Aurora Alpha100100310046.2%
Stealth: Hunter Alpha82595535046.2%
Gemma 3 4B10065630045.6%
ByteDance Seed 1.610068570045.2%
GPT-4o, May 13th (temp=1)10067590045.1%
Mistral Large10072500044.4%
Qwen 3.5 122B100443936043.9%
Gemma 4 31B10062570043.8%
Arcee AI: Trinity Mini7968650042.6%
Grok 4.20835826252042.5%
ByteDance Seed 1.6 Flash10064480042.3%
o4 Mini100393635041.9%
Gemini 2.5 Pro8684360041.3%
Aion 2.071663533041.0%
Qwen3 235B A22B Instruct 25079474350040.7%
Rocinante 12B10010000040.0%
Cohere Command R+ (Aug. 2024)8361560040.0%
GPT-4o Mini (temp=0)7264590039.1%
Grok 4.20 (Beta)77543427038.5%
Arcee AI: Trinity Large (Preview)7663530038.3%
Gemini 2.5 Flash Lite10047430038.0%
Z.AI GLM 4.7 Flash8755440037.2%
Z.AI GLM 4.5 Air10043380036.3%
Xiaomi MIMO v2.57965340035.4%
GPT-4.1 Mini6355540034.5%
ByteDance Seed 2.0 Mini1006800033.7%
Z.AI GLM 4.56154520033.4%
DeepSeek V3 (2024-12-26)1006700033.3%
LFM2 24B1006500033.0%
o4 Mini High60413430032.9%
Grok 4.20 (Beta, Reasoning)10033300032.5%
Xiaomi MIMO v2.5 Pro1006100032.2%
Nemotron 3 Super1006000032.0%
Hermes 3 70B1005800031.6%
Ministral 8B1005800031.6%
ByteDance Seed 2.0 Lite857000031.0%
Mistral NeMO826600029.6%
Claude Sonnet 4.51004500028.9%
Grok 4.31004300028.5%
Z.AI GLM 4.65847380028.5%
Ministral 3 3B706300026.7%
Grok 4 Fast4843390026.0%
Qwen 3.5 Plus (2026-04-20)1003000026.0%
Claude 3.7 Sonnet4843360025.5%
GPT-4o, May 13th (temp=0)685500024.5%
Inception Mercury100600021.2%
DeepSeek-V2 Chat545200021.2%
Gemma 4 26B (Reasoning)535000020.4%
Claude Opus 4.7100000020.0%
DeepSeek V4 Flash (Reasoning)100000020.0%
Hermes 3 405B100000020.0%
Llama 3.1 70B100000020.0%
Claude Haiku 4.5544200019.2%
Inception Mercury 295000019.0%
DeepSeek V3.23530280018.7%
MiniMax M2.5494300018.3%
Claude 3.5 Sonnet88000017.5%
Grok 4.1 Fast533400017.2%
GPT-OSS 120B68000013.5%
WizardLM 2 8x22b57000011.5%
Grok 4.3 (Reasoning)52000010.4%
Claude Opus 4000000.0%
Mistral Small 3.2 24B000000.0%
Qwen 2.5 72B000000.0%
Claude 3 Haiku000000.0%
Ministral 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.510010093833481.9%
Mistral Small Creative1009582715480.6%
Ministral 3 8B100100100100080.0%
Mistral Large10010010096079.2%
Llama 3.1 Nemotron 70B10010010091078.2%
Qwen 3.6 35B1001009691077.4%
ByteDance Seed 1.6 Flash100969690076.5%
Qwen3.6 Max Preview1008584743074.8%
Grok 4.20 (Beta, Reasoning)10010081424273.1%
Gemini 3.1 Pro (Preview)1008180544471.8%
Nemotron 3 Super10010010053070.5%
GPT-5.4 Mini (Reasoning)1009999312270.1%
GPT-5.4 Mini (Reasoning, Low)1008482562669.6%
Ministral 8B10010010048069.5%
Grok 4 Fast10010010046069.2%
GPT-5.4 Mini10010062572568.9%
Qwen 3.6 Flash1001008154066.9%
Grok 4.20 (Beta)1001007553065.6%
Qwen 3.5 397B A17B1008353493964.9%
MoonshotAI: Kimi K2.51001006161064.4%
GPT-5.5 (Reasoning, Low)1009959361862.4%
Qwen 3.5 Plus (2026-04-20)1001001000060.0%
Qwen 3 32B1001001000060.0%
Mistral Medium 3.11001001000060.0%
Nemotron 3 Nano1001001000060.0%
Gemini 3.1 Flash Lite (Reasoning)100955649060.0%
Xiaomi MIMO v2.5 Pro1001005542059.3%
GPT-5.490896843058.4%
Stealth: Hunter Alpha100965441058.2%
GPT-5.4 Nano1007344383557.9%
Gemini 2.5 Flash Lite (Reasoning)92915340055.1%
GPT-5.21006447422054.7%
Claude Opus 4100100660053.2%
Rocinante 12B82786637052.6%
GPT-5.5 (Reasoning)746658451952.4%
Gemini 3.1 Flash Lite100100610052.2%
MoonshotAI: Kimi K2.6100100600051.9%
Grok 4100545450051.6%
Claude Sonnet 4100100570051.5%
Qwen3.7 Max100754138050.9%
Mistral Small 4100684537050.1%
Z.AI GLM 4.7100534947049.7%
Grok 4.20 (Reasoning)81804640049.5%
Qwen 3.6 27B10095520049.4%
GPT-5.4 Nano (Reasoning)1001002521049.3%
Qwen 3.5 27B100100460049.2%
GPT-5.4 (Reasoning, Low)98744825049.0%
Grok 4.3 (Reasoning)9479710049.0%
Ministral 3 14B100100420048.5%
o4 Mini10086430045.9%
Stealth: Aurora Alpha65625050045.2%
Gemini 3.5 Flash (Reasoning, Minimal)10068530044.3%
Z.AI GLM 5.110065560044.1%
Qwen 3.5 9B10061590044.0%
Qwen 3.5 122B8271660043.8%
Gemini 3 Flash (Preview, Reasoning)7668680042.5%
Claude 3.7 Sonnet10057520041.8%
Gemini 3.1 Flash Lite (Preview)10056510041.4%
GPT-5 Mini10064420041.1%
GPT-575683327040.6%
Gemini 2.5 Flash Lite10054480040.5%
ByteDance Seed 1.610010000040.0%
Qwen 3.5 Flash10010000040.0%
Arcee AI: Trinity Large (Preview)10010000040.0%
Arcee AI: Trinity Mini10010000040.0%
Ministral 3B10010000040.0%
Mistral Large 210052480039.8%
Grok 4.1 Fast1009900039.8%
Llama 3.1 8B1009800039.6%
GPT-5.4 Nano (Reasoning, Low)100522422039.5%
Hermes 3 70B1009600039.2%
Grok 4.2083423635039.0%
Claude Opus 4.69653450038.8%
Hermes 3 405B1009300038.5%
GPT-4.1 Mini1009100038.2%
Gemma 4 31B (Reasoning)1008600037.2%
Qwen 2.5 72B1008200036.4%
GPT-5.4 (Reasoning)8466270035.6%
Inception Mercury 26558470034.0%
GPT-4o, Aug. 6th (temp=1)868300033.9%
GPT-5.1712523232232.7%
ByteDance Seed 2.0 Mini1006300032.7%
DeepSeek V3 (2025-03-24)937000032.6%
Gemini 3.5 Flash (Reasoning)5554510031.9%
Qwen3 235B A22B Instruct 2507986100031.8%
GPT-4o, May 13th (temp=1)827200030.9%
o4 Mini High5151500030.2%
Claude Sonnet 4.6786800029.3%
Gemini 3 Pro (Preview)1004600029.3%
DeepSeek V3.15049430028.3%
Gemini 2.5 Flash5447390028.1%
Gemma 4 26B746700028.0%
GPT-5 Nano6140340026.8%
DeepSeek V4 Pro (Reasoning)5142420026.8%
Mistral Large 3686000025.4%
DeepSeek V3.2893700025.2%
Mistral NeMO824100024.5%
Aion 2.04140380023.8%
Gemma 3 27B535200021.1%
Z.AI GLM 5100000020.0%
Claude Opus 4.7100000020.0%
Claude Opus 4.5100000020.0%
DeepSeek V4 Flash (Reasoning)100000020.0%
MiniMax M2.5100000020.0%
Gemini 2.5 Flash (Reasoning)100000020.0%
Inception Mercury100000020.0%
GPT-4o Mini (temp=1)100000020.0%
Grok 4.3100000020.0%
Llama 3.1 70B100000020.0%
WizardLM 2 8x22b100000020.0%
Gemma 3 4B100000020.0%
ByteDance Seed 2.0 Lite98000019.6%
Z.AI GLM 4.7 Flash494800019.4%
Cohere Command R+ (Aug. 2024)96000019.2%
Stealth: Healer Alpha514200018.5%
Qwen 3.5 35B464200017.7%
Claude Opus 4.6 (Reasoning)434100016.8%
Claude 3 Haiku83000016.7%
GPT-4o, Aug. 6th (temp=0)81000016.1%
Gemini 2.5 Pro80000016.0%
Gemma 4 31B77000015.4%
Qwen 3.5 Plus (2026-02-15)75000014.9%
Writer: Palmyra X569000013.9%
Z.AI GLM 5 Turbo68000013.7%
Z.AI GLM 4.568000013.7%
Claude Haiku 4.568000013.7%
Claude Sonnet 4.6 (Reasoning)68000013.5%
DeepSeek V4 Flash67000013.3%
GPT-4o, May 13th (temp=0)66000013.2%
Gemini 3 Flash (Preview)66000013.2%
Gemma 4 26B (Reasoning)65000013.0%
DeepSeek-V2 Chat60000012.0%
Claude Sonnet 4.559000011.8%
Gemma 3 12B58000011.6%
GPT-4.157000011.4%
GPT-OSS 120B57000011.4%
DeepSeek V3 (2024-12-26)56000011.1%
Mistral Small 3.2 24B56000011.1%
Z.AI GLM 4.652000010.3%
Xiaomi MIMO v2.551000010.2%
MiniMax M2.74600009.2%
Z.AI GLM 4.5 Air4300008.5%
Mistral Small 4 (Reasoning)4200008.3%
Ministral 3 3B3100006.2%
Claude Opus 4.7 (Reasoning)000000.0%
Claude 3.5 Sonnet000000.0%
DeepSeek V4 Pro000000.0%
GPT-4o Mini (temp=0)000000.0%
GPT-4.1 Nano000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite1001001001005490.9%
Qwen 3.6 35B10010096916690.6%
Qwen 3.5 397B A17B1001001001004889.5%
Claude Sonnet 4.6100100100726988.4%
Qwen 3 32B10010094934386.0%
Qwen3.7 Max100100100100080.0%
Gemma 4 31B (Reasoning)100100100100080.0%
Qwen3.6 Max Preview10010090564578.1%
GPT-5.4 Nano100100100572376.2%
Mistral NeMO10010010081076.1%
GPT-5.4 (Reasoning)10010097542975.9%
GPT-5 Mini1008875644073.5%
GPT-4o Mini (temp=1)100918682071.8%
Gemini 3.1 Flash Lite (Reasoning)10010010057071.5%
GPT-5.41001008271070.6%
Gemini 3.1 Pro (Preview)1001009452069.2%
Gemini 2.5 Flash Lite (Reasoning)10010010045069.0%
GPT-5.4 Nano (Reasoning, Low)1008887442268.3%
GPT-5.4 (Reasoning, Low)10010083282567.2%
Grok 4.20 (Beta, Reasoning)10010010036067.2%
Grok 4 Fast1001007263067.0%
Claude Opus 4.7 (Reasoning)89827770063.7%
GPT-5.4 Mini100817760063.6%
GPT-5.5 (Reasoning, Low)1008146393660.5%
Qwen 3.5 35B1001005348060.3%
Claude 3.7 Sonnet1001001000060.0%
Hermes 3 70B1001001000060.0%
GPT-4o, Aug. 6th (temp=1)100100930058.5%
Claude 3 Haiku100100930058.5%
Gemma 4 26B100696261058.4%
Mistral Small Creative100686757058.2%
Grok 4.20100817633058.0%
Aion 2.090885555057.5%
MoonshotAI: Kimi K2.6100924946057.4%
Qwen 3.5 Flash100100810056.2%
Gemini 2.5 Flash (Reasoning)9595880055.8%
GPT-5.5907156391955.0%
ByteDance Seed 2.0 Mini10091830054.8%
Rocinante 12B100100710054.3%
Grok 4.3 (Reasoning)10091790054.1%
ByteDance Seed 1.6100100680053.7%
Gemini 3.5 Flash (Reasoning)10098640052.4%
GPT-4.1 Nano10086760052.4%
GPT-5 Nano100745730052.2%
Claude Opus 4100100600052.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100590051.8%
Gemini 2.5 Flash100664745051.6%
GPT-5965750292551.5%
Qwen 3.5 122B100100550051.0%
GPT-5.4 Mini (Reasoning)100823932050.8%
Mistral Small 4 (Reasoning)100100540050.8%
Stealth: Healer Alpha100100510050.2%
ByteDance Seed 1.6 Flash686449373249.9%
Qwen 3.5 Plus (2026-02-15)75635951049.4%
Qwen 3.5 Plus (2026-04-20)86854231048.9%
DeepSeek V3.2100100440048.8%
Grok 4.20 (Reasoning)100100420048.5%
GPT-4o, May 13th (temp=0)10079600047.8%
Grok 4.310086530047.8%
Gemini 3 Flash (Preview)72575650047.2%
Z.AI GLM 4.58685640047.0%
Gemma 4 31B10067640046.2%
o4 Mini High88525040046.0%
Claude Sonnet 4.510069570045.4%
Gemini 2.5 Flash Lite9279530044.7%
Mistral Large 310067520043.6%
Grok 4.1 Fast78544240042.7%
Writer: Palmyra X557575345042.5%
Claude Opus 4.6 (Reasoning)72514542042.0%
GPT-5.4 Nano (Reasoning)10064430041.4%
Qwen3 235B A22B Instruct 25077866630041.3%
Z.AI GLM 5 Turbo10061440041.0%
MoonshotAI: Kimi K2.510010000040.0%
DeepSeek-V2 Chat10010000040.0%
Z.AI GLM 4.7 Flash10010000040.0%
Claude 3.5 Sonnet10010000040.0%
Llama 3.1 Nemotron 70B10010000040.0%
GPT-5.2100512523039.7%
Z.AI GLM 51009400038.9%
Qwen 3.5 27B7068560038.6%
o4 Mini8954450037.6%
Gemma 3 12B6861580037.3%
Qwen 3.6 Flash10051340037.0%
DeepSeek V3 (2024-12-26)1008300036.7%
Grok 4.20 (Beta)10048340036.3%
Gemini 2.5 Pro49464540036.0%
Qwen 2.5 72B1007800035.6%
Cohere Command R+ (Aug. 2024)1007700035.4%
Z.AI GLM 4.66160550035.2%
GPT-5.4 Mini (Reasoning, Low)10043300034.5%
Inception Mercury 21007000034.0%
Ministral 3 8B6854460033.7%
Hermes 3 405B1006600033.2%
DeepSeek V3.11006600033.2%
GPT-4o Mini (temp=0)887700032.9%
DeepSeek V4 Pro (Reasoning)6649490032.8%
Arcee AI: Trinity Large (Preview)1006200032.3%
Nemotron 3 Nano1005500031.0%
Xiaomi MIMO v2.56251400030.5%
GPT-4.11005300030.5%
Ministral 3 14B1005300030.5%
GPT-4o, May 13th (temp=1)777500030.3%
Gemini 3 Pro (Preview)5951410030.1%
Grok 41005000030.0%
GPT-5.5 (Reasoning)553920191529.5%
DeepSeek V4 Pro5048470028.9%
Gemma 3 27B895400028.7%
GPT-5.163292625028.7%
Z.AI GLM 4.75346420028.2%
Mistral Small 41004100028.1%
Gemma 4 26B (Reasoning)716800028.0%
Stealth: Hunter Alpha5145430027.9%
Nemotron 3 Super776200027.7%
Claude Opus 4.64946420027.4%
Z.AI GLM 4.5 Air716500027.3%
Claude Haiku 4.5785700027.1%
Mistral Small 3.2 24B645680025.5%
MiniMax M2.5725300025.0%
GPT-OSS 120B793800023.4%
Mistral Large753900022.8%
Qwen 3.6 27B605000022.0%
Qwen 3.5 9B712900020.1%
Z.AI GLM 5.1100000020.0%
Gemini 3 Flash (Preview, Reasoning)100000020.0%
GPT-4o, Aug. 6th (temp=0)100000020.0%
Llama 3.1 70B100000020.0%
WizardLM 2 8x22b100000020.0%
Arcee AI: Trinity Mini100000020.0%
Llama 3.1 8B100000020.0%
ByteDance Seed 2.0 Lite98000019.6%
Claude Opus 4.5524400019.2%
Mistral Medium 3.1464400018.0%
DeepSeek V4 Flash (Reasoning)89000017.9%
LFM2 24B85000016.9%
Ministral 3B83000016.7%
DeepSeek V4 Flash74000014.7%
Gemma 3 4B72000014.5%
Mistral Large 271000014.3%
Ministral 8B68000013.5%
Stealth: Aurora Alpha363000013.4%
MiniMax M2.763000012.7%
Ministral 3 3B55000011.0%
Claude Sonnet 4.6 (Reasoning)5000009.9%
Xiaomi MIMO v2.5 Pro4100008.3%
Claude Opus 4.7000000.0%
Claude Sonnet 4000000.0%
GPT-4.1 Mini000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview1001001001008997.9%
Qwen 3.6 Flash100100100755185.0%
Gemini 3.1 Flash Lite (Reasoning)100100100605983.8%
Qwen 3.6 35B100100100100080.0%
Llama 3.1 8B10010010093078.5%
GPT-5.4 Mini (Reasoning, Low)1009765585374.5%
GPT-5.4 (Reasoning, Low)1009466535373.4%
Gemma 3 12B10010010065073.0%
Qwen 3 32B96938883072.1%
MoonshotAI: Kimi K2.510010010054070.9%
ByteDance Seed 1.6 Flash10010010054070.9%
GPT-5.410010093372170.4%
MiniMax M2.71007957575469.4%
GPT-5.4 Mini (Reasoning)10010088302869.2%
Gemini 2.5 Flash (Reasoning)1001007665068.2%
Qwen 3.5 397B A17B10010010032066.3%
Rocinante 12B1001007952066.3%
Gemini 3.1 Flash Lite1001007058065.7%
Z.AI GLM 5 Turbo1001006357064.0%
Aion 2.01009154423063.3%
GPT-5.4 (Reasoning)10010044432863.0%
Qwen 3.5 35B1001007040062.0%
GPT-5.11008452472461.3%
Grok 4 Fast1001005449060.7%
Grok 494916355060.5%
Hermes 3 70B1001001000060.0%
Mistral Small 4838245453958.6%
Mistral Small Creative100776548058.0%
ByteDance Seed 2.0 Mini100100830056.7%
Hermes 3 405B100100770055.4%
DeepSeek V4 Pro (Reasoning)100100750055.0%
DeepSeek-V2 Chat100100690053.9%
Gemma 3 27B100595651053.2%
Gemini 2.5 Pro100874138053.2%
Z.AI GLM 5.1100605550053.0%
Gemini 3.1 Pro (Preview)100100560051.2%
DeepSeek V3.2100100550051.0%
ByteDance Seed 1.610077760050.5%
Claude Opus 4.563636362050.3%
LFM2 24B10076680048.9%
Gemma 4 26B (Reasoning)10076670048.5%
Ministral 3 8B100962917048.4%
DeepSeek V4 Flash (Reasoning)10072680048.2%
DeepSeek V4 Pro10085530047.6%
Stealth: Hunter Alpha9695460047.5%
GPT-5 Nano864241333046.4%
GPT-5.5 (Reasoning, Low)827229281845.7%
Z.AI GLM 4.510068600045.6%
Mistral Large10075510045.1%
GPT-5.4 Nano946823231845.1%
Mistral Small 4 (Reasoning)100593329044.3%
Claude Opus 4.6 (Reasoning)9180490044.0%
Nemotron 3 Nano10060600043.8%
Qwen 3.5 Plus (2026-02-15)7871690043.8%
GPT-5.4 Mini80585427043.7%
Gemini 3.1 Flash Lite (Preview)9965530043.3%
Claude 3.7 Sonnet10056540042.0%
GPT-5.5674737312441.2%
Qwen 3.6 27B61524942040.8%
Z.AI GLM 4.5 Air7970530040.6%
Qwen3.7 Max10010000040.0%
ByteDance Seed 2.0 Lite10010000040.0%
Ministral 3B1009800039.6%
GPT-5.2656228211938.9%
GPT-5.4 Nano (Reasoning)77494623038.7%
o4 Mini61504934038.6%
GPT-4o, Aug. 6th (temp=1)1009300038.5%
Ministral 8B78494321037.9%
Claude 3 Haiku1008900037.9%
GPT-5.4 Nano (Reasoning, Low)1008800037.7%
Gemma 4 31B1008300036.7%
Qwen 3.5 Flash6755550035.4%
Stealth: Aurora Alpha1007500035.0%
Writer: Palmyra X51006700033.3%
Gemini 3.5 Flash (Reasoning)1006200032.3%
Mistral Large 31006100032.2%
DeepSeek V3 (2024-12-26)1006100032.2%
Qwen 3.5 122B1006000032.0%
Claude Sonnet 41006000031.9%
Ministral 3 3B916800031.9%
GPT-5 Mini7545350030.8%
Xiaomi MIMO v2.55452450030.3%
Qwen 2.5 72B797100030.2%
Z.AI GLM 4.65353410029.2%
Gemma 4 26B776800028.9%
Z.AI GLM 4.71004100028.2%
GPT-5.5 (Reasoning)463621181727.5%
GPT-5825400027.1%
Xiaomi MIMO v2.5 Pro4541410025.5%
GPT-4.1675800025.0%
Arcee AI: Trinity Large (Preview)685000023.5%
Qwen 3.5 Plus (2026-04-20)635400023.4%
Gemini 3 Flash (Preview, Reasoning)635300023.3%
Gemini 2.5 Flash Lite644800022.4%
Claude Opus 4.6575000021.3%
Gemini 2.5 Flash Lite (Reasoning)555000020.9%
DeepSeek V3.1545100020.9%
Claude Opus 4544800020.4%
Z.AI GLM 5544700020.2%
Claude Sonnet 4.6 (Reasoning)100000020.0%
Qwen 3.5 27B100000020.0%
Claude Sonnet 4.5100000020.0%
GPT-4o, May 13th (temp=1)100000020.0%
GPT-4.1 Mini100000020.0%
Llama 3.1 70B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Ministral 3 14B100000020.0%
WizardLM 2 8x22b100000020.0%
Cohere Command R+ (Aug. 2024)100000020.0%
Gemma 3 4B100000020.0%
Mistral NeMO100000020.0%
o4 Mini High524700019.9%
Arcee AI: Trinity Mini98000019.6%
DeepSeek V3 (2025-03-24)94000018.9%
GPT-4.1 Nano94000018.9%
Claude 3.5 Sonnet89000017.9%
Grok 4.20 (Reasoning)464000017.2%
Gemini 3.5 Flash (Reasoning, Minimal)83000016.7%
Grok 4.3 (Reasoning)81000016.1%
Z.AI GLM 4.7 Flash79000015.9%
Grok 4.20 (Beta)413300014.8%
Gemini 3 Pro (Preview)413300014.8%
Gemma 4 31B (Reasoning)70000014.1%
GPT-4o Mini (temp=1)70000014.1%
GPT-4o Mini (temp=0)69000013.9%
GPT-4o, May 13th (temp=0)68000013.5%
Claude Haiku 4.567000013.3%
Nemotron 3 Super67000013.3%
DeepSeek V4 Flash65000013.0%
Gemini 3 Flash (Preview)63000012.5%
Grok 4.358000011.6%
Qwen3 235B A22B Instruct 250753000010.5%
Grok 4.1 Fast4500008.9%
Stealth: Healer Alpha4400008.8%
Mistral Medium 3.14200008.3%
GPT-OSS 120B4000007.9%
Grok 4.20 (Beta, Reasoning)3400006.8%
Grok 4.202600005.2%
Inception Mercury1700003.4%
Claude Opus 4.7 (Reasoning)000000.0%
MoonshotAI: Kimi K2.6000000.0%
Claude Sonnet 4.6000000.0%
Claude Opus 4.7000000.0%
MiniMax M2.5000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
Mistral Large 2000000.0%
Gemini 2.5 Flash000000.0%
Mistral Small 3.2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning)1001001001009599.0%
Qwen 3.5 35B1001001001008196.3%
Gemini 3.1 Flash Lite (Reasoning)1001001001007795.4%
GPT-5.4 (Reasoning)100100100908695.2%
Qwen 3.5 Flash1001001001007194.3%
Qwen 3.5 27B1001001001005991.8%
Gemma 4 26B1001001001005691.1%
Gemini 3 Flash (Preview, Reasoning)1001001001005390.5%
Gemini 3.1 Flash Lite100100100727188.8%
Qwen 3.6 Flash10010093806186.7%
GPT-5 Nano10010092766586.6%
Qwen 3.6 35B10010093705282.9%
GPT-5.4 Nano (Reasoning, Low)100100100664782.6%
o4 Mini High100100100684181.7%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100080.0%
Hermes 3 405B100100100100080.0%
Hermes 3 70B100100100100080.0%
Qwen 3.5 122B10010093574779.4%
Qwen 3.5 Plus (2026-02-15)10010010097079.4%
GPT-5.4 (Reasoning, Low)100100100722579.4%
Gemini 3.1 Pro (Preview)10010010096079.2%
Gemma 4 31B (Reasoning)10010068666279.0%
LFM2 24B868578686776.8%
Claude Sonnet 410010010083076.7%
Gemini 3.1 Flash Lite (Preview)10010010077075.4%
Z.AI GLM 4.7 Flash10010010077075.4%
Qwen3.6 Max Preview10010071643974.9%
GPT-5.4 Mini (Reasoning, Low)100100100323172.7%
o4 Mini1009983423972.6%
MoonshotAI: Kimi K2.510010010057071.5%
Qwen 3.6 27B10010063603170.9%
Gemma 4 31B10010010053070.6%
Claude Opus 4.6 (Reasoning)100968968070.4%
GPT-5.4 Mini (Reasoning)10010063543169.6%
Gemini 3 Pro (Preview)1008578444069.5%
Qwen 3.5 9B1001008957069.4%
Gemini 2.5 Flash Lite (Reasoning)10010066403668.4%
Stealth: Hunter Alpha1009072433668.2%
Grok 4.20 (Beta, Reasoning)97927473067.1%
GPT-5.5 (Reasoning)1008680521767.1%
ByteDance Seed 1.693868370066.5%
Claude Opus 4.51001008642065.6%
GPT-5.41008858532965.5%
GPT-5.11009661412664.7%
Gemma 3 4B100886868064.6%
GPT-5.4 Nano1009045444163.9%
Aion 2.0100767567063.5%
Z.AI GLM 4.793767471062.9%
Gemini 2.5 Flash (Reasoning)10010039383662.7%
Qwen3 235B A22B Instruct 25071001006746062.6%
Nemotron 3 Super1001006943062.4%
GPT-4o, Aug. 6th (temp=0)88817262060.5%
Claude Opus 41001001000060.0%
DeepSeek-V2 Chat1001001000060.0%
ByteDance Seed 2.0 Lite1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Llama 3.1 70B1001001000060.0%
Llama 3.1 Nemotron 70B1001001000060.0%
GPT-4.1 Nano1001001000060.0%
Mistral NeMO100100970059.4%
GPT-4.1100915452059.2%
Z.AI GLM 5 Turbo100965147058.9%
Ministral 3 8B100675958056.7%
Ministral 3B100100830056.6%
GPT-51007851231854.1%
Z.AI GLM 5.1100100690053.8%
Claude Opus 4.7100100680053.7%
Gemini 2.5 Pro1004442393952.9%
Gemini 3 Flash (Preview)100100610052.2%
Arcee AI: Trinity Mini100100610052.2%
Claude 3 Haiku100100580051.6%
Claude Opus 4.6100694542051.2%
GPT-5.4 Mini100735427050.8%
Grok 4.20 (Reasoning)100763735049.7%
DeepSeek V4 Flash (Reasoning)95545445049.7%
DeepSeek V4 Pro (Reasoning)100100490049.7%
Z.AI GLM 4.610093560049.6%
Z.AI GLM 4.510085630049.4%
Z.AI GLM 4.5 Air66656352049.2%
Grok 4.3100100450049.1%
Grok 4.20 (Beta)100635824048.9%
Grok 4.2010074670048.2%
Gemma 3 27B10092460047.6%
Mistral Small 481625736047.4%
Qwen 3 32B10079560047.0%
GPT-5.4 Nano (Reasoning)88724523045.6%
Claude Sonnet 4.610077470044.8%
MiniMax M2.510064580044.4%
GPT-4o, May 13th (temp=0)10068540044.4%
Xiaomi MIMO v2.583484644044.3%
DeepSeek V3 (2024-12-26)10071470043.7%
Qwen 2.5 72B9363620043.5%
GPT-5 Mini9963550043.4%
Grok 4.3 (Reasoning)8171630042.9%
Claude Haiku 4.510059480041.4%
DeepSeek V3.110057470040.9%
Writer: Palmyra X58961520040.4%
Ministral 8B10052500040.3%
ByteDance Seed 2.0 Mini10010000040.0%
Cohere Command R+ (Aug. 2024)10010000040.0%
ByteDance Seed 1.6 Flash100383226039.2%
Nemotron 3 Nano6865560037.9%
Grok 4 Fast1008900037.9%
GPT-4o, Aug. 6th (temp=1)1008800037.5%
GPT-5.5714828271337.4%
GPT-4.1 Mini1008300036.7%
DeepSeek V3 (2025-03-24)1007600035.2%
Claude 3.5 Sonnet938200034.9%
GPT-5.28067230034.1%
MoonshotAI: Kimi K2.66355520034.1%
Llama 3.1 8B917900034.1%
Mistral Small Creative8740380033.1%
Claude Sonnet 4.6 (Reasoning)1006300032.7%
Mistral Large6561350032.2%
MiniMax M2.71005600031.2%
GPT-4o Mini (temp=1)836900030.6%
Gemini 2.5 Flash Lite1005100030.1%
Gemma 3 12B1005000030.0%
Z.AI GLM 55746420029.2%
DeepSeek V4 Pro994600029.0%
GPT-5.5 (Reasoning, Low)553316161426.8%
Claude Opus 4.7 (Reasoning)676600026.5%
DeepSeek V4 Flash933800026.3%
Stealth: Healer Alpha4545400026.1%
Xiaomi MIMO v2.5 Pro4241380024.3%
GPT-4o Mini (temp=0)635400023.4%
Claude Sonnet 4.5100000020.0%
Claude 3.7 Sonnet100000020.0%
Mistral Large 2100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
Ministral 3 14B593900019.6%
DeepSeek V3.2583800019.2%
Mistral Small 4 (Reasoning)3731280019.1%
Grok 4.1 Fast78000015.6%
Rocinante 12B75000014.9%
Stealth: Aurora Alpha462800014.7%
Ministral 3 3B68000013.5%
GPT-4o, May 13th (temp=1)57000011.5%
GPT-OSS 120B362200011.5%
Inception Mercury 23100006.3%
Mistral Medium 3.13100006.3%
Mistral Small 3.2 24B200000.5%
Grok 4000000.0%
Mistral Large 3000000.0%
Inception Mercury000000.0%
WizardLM 2 8x22b000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Flash Lite10010010071074.3%
Gemma 3 4B797572676371.1%
Z.AI GLM 4.5 Air10010010054070.8%
DeepSeek V3.1100767670064.4%
ByteDance Seed 1.6 Flash1001006846062.8%
Gemini 3.1 Flash Lite (Preview)100706664060.1%
Gemini 3.1 Flash Lite (Reasoning)1001001000060.0%
Claude 3.7 Sonnet1001001000060.0%
Cohere Command R+ (Aug. 2024)1001001000060.0%
Claude Opus 4.6 (Reasoning)100855252057.8%
MiniMax M2.510098790055.5%
Gemini 2.5 Flash Lite (Reasoning)100100740054.7%
Hermes 3 70B100100710054.3%
Claude Haiku 4.5100100670053.3%
Xiaomi MIMO v2.599874236052.8%
Gemini 2.5 Flash82655554051.2%
WizardLM 2 8x22b100674541050.6%
Gemma 4 31B10069680047.6%
DeepSeek V4 Pro9779560046.5%
DeepSeek V4 Flash8177690045.4%
GPT-5.410094280044.4%
Z.AI GLM 510064540043.7%
Mistral Small Creative9870410041.8%
Qwen 3.6 35B63545042041.7%
Qwen 3.5 35B57574745041.3%
Gemini 2.5 Flash Lite7871560041.1%
Grok 4.209779260040.5%
Gemini 3 Flash (Preview, Reasoning)8858540040.0%
Claude 3.5 Sonnet10010000040.0%
Hermes 3 405B10010000040.0%
Arcee AI: Trinity Mini10010000040.0%
GPT-4.11009800039.6%
Z.AI GLM 4.51009600039.2%
Arcee AI: Trinity Large (Preview)1009600039.2%
Llama 3.1 8B1009600039.2%
GPT-4o, Aug. 6th (temp=1)1009300038.5%
Ministral 3 14B8560440037.8%
Qwen 3 32B10062260037.6%
Z.AI GLM 4.61008600037.2%
GPT-5.4 Nano68633219036.4%
Claude Opus 46968450036.3%
Gemma 3 12B6160570035.5%
Mistral Small 47655450035.3%
DeepSeek V3 (2024-12-26)1007200034.5%
Gemma 4 26B1007100034.3%
Stealth: Healer Alpha6360480034.2%
Claude Opus 4.7937800034.1%
DeepSeek V4 Flash (Reasoning)1006800033.5%
GPT-5.5 (Reasoning, Low)54423535033.3%
Grok 4.20 (Beta)8947260032.5%
Z.AI GLM 5.11006100032.2%
GPT-4o, May 13th (temp=1)1005900031.8%
DeepSeek V3.26052450031.4%
Qwen3 235B A22B Instruct 25076845440031.4%
GPT-5.160472622030.7%
Gemini 2.5 Flash (Reasoning)5554430030.6%
GPT-4o Mini (temp=1)767200029.6%
Writer: Palmyra X55450420029.1%
Claude Sonnet 4.6 (Reasoning)747100029.0%
GPT-5.4 Nano (Reasoning)10021200028.3%
Gemini 2.5 Pro686000025.7%
Qwen 3.6 Flash6040270025.5%
MiniMax M2.7715600025.5%
Gemini 3 Pro (Preview)4441380024.8%
GPT-5.4 (Reasoning)6432260024.4%
Stealth: Hunter Alpha744700024.1%
MoonshotAI: Kimi K2.5695100024.0%
Z.AI GLM 4.7575600022.7%
Gemini 3 Flash (Preview)634900022.3%
GPT-5.55241180022.2%
Qwen 3.5 397B A17B6231170022.1%
Qwen 3.5 Flash931350022.1%
Gemini 3.5 Flash (Reasoning)604800021.6%
Qwen3.7 Max733400021.5%
GPT-5.25030240020.8%
GPT-5.4 (Reasoning, Low)535100020.8%
GPT-5 Mini633700020.0%
Claude Opus 4.5100000020.0%
Claude Sonnet 4100000020.0%
ByteDance Seed 2.0 Mini100000020.0%
GPT-4.1 Mini100000020.0%
Llama 3.1 70B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Mistral NeMO100000020.0%
Nemotron 3 Nano683100019.9%
Aion 2.0504900019.7%
Z.AI GLM 4.7 Flash97000019.4%
DeepSeek-V2 Chat623300018.9%
Claude 3 Haiku94000018.9%
Grok 4.20 (Reasoning)563600018.5%
Claude Sonnet 4.689000017.9%
DeepSeek V3 (2025-03-24)89000017.9%
GPT-4.1 Nano89000017.9%
Claude Opus 4.6454400017.8%
Gemini 3.5 Flash (Reasoning, Minimal)86000017.2%
Mistral Medium 3.1434300017.2%
Grok 4.3 (Reasoning)83000016.7%
GPT-5.4 Nano (Reasoning, Low)23221917016.3%
GPT-5.4 Mini81000016.1%
Claude Opus 4.7 (Reasoning)79000015.9%
GPT-4o, Aug. 6th (temp=0)76000015.2%
o4 Mini452800014.5%
Gemma 4 31B (Reasoning)71000014.3%
LFM2 24B67000013.3%
Ministral 8B60000011.9%
GPT-4o Mini (temp=0)57000011.4%
GPT-5282600010.8%
Mistral Small 4 (Reasoning)51000010.2%
MoonshotAI: Kimi K2.64900009.7%
ByteDance Seed 1.64800009.6%
GPT-5.5 (Reasoning)26210009.3%
Qwen 3.5 27B4500009.0%
Mistral Large 24400008.8%
Grok 4.1 Fast4000008.1%
Qwen 3.6 27B26140008.0%
Grok 43800007.7%
Rocinante 12B3300006.7%
Grok 4.33200006.3%
Grok 4.20 (Beta, Reasoning)3000006.1%
o4 Mini High2900005.7%
GPT-5 Nano2500005.0%
Qwen3.6 Max Preview2000003.9%
Gemini 3.1 Pro (Preview)000000.0%
Z.AI GLM 5 Turbo000000.0%
Qwen 3.5 122B000000.0%
Qwen 3.5 Plus (2026-04-20)000000.0%
Gemma 4 26B (Reasoning)000000.0%
GPT-5.4 Mini (Reasoning)000000.0%
DeepSeek V4 Pro (Reasoning)000000.0%
Claude Sonnet 4.5000000.0%
Xiaomi MIMO v2.5 Pro000000.0%
GPT-OSS 120B000000.0%
Grok 4 Fast000000.0%
Qwen 3.5 9B000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
GPT-5.4 Mini (Reasoning, Low)000000.0%
Mistral Large 3000000.0%
GPT-4o, May 13th (temp=0)000000.0%
ByteDance Seed 2.0 Lite000000.0%
Nemotron 3 Super000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Mistral Large000000.0%
Inception Mercury000000.0%
Mistral Small 3.2 24B000000.0%
Gemma 3 27B000000.0%
Qwen 2.5 72B000000.0%
Ministral 3 8B000000.0%
Ministral 3 3B000000.0%
Ministral 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o, Aug. 6th (temp=1)10010076747083.9%
Mistral NeMO1009482765781.9%
Hermes 3 70B1001009685076.2%
Qwen 3 32B1008875624774.4%
Aion 2.010010086413873.0%
Llama 3.1 8B10010010060071.9%
GPT-5.4 (Reasoning)1009683292967.5%
WizardLM 2 8x22b1008476373566.3%
GPT-4.1 Nano1001006561065.2%
Claude Sonnet 4.6 (Reasoning)1001006657064.7%
Qwen3 235B A22B Instruct 25071001007033060.6%
GPT-5.4 Nano988942403059.8%
Gemini 3 Pro (Preview)100100930058.7%
Claude Haiku 4.5100100860057.2%
DeepSeek V3.210095890056.9%
Gemini 2.5 Flash Lite100676353056.6%
GPT-4o, Aug. 6th (temp=0)82776757056.5%
Rocinante 12B100100810056.1%
Claude Sonnet 4100100750054.9%
Arcee AI: Trinity Mini100100740054.7%
Qwen 3.5 Plus (2026-02-15)1006638363254.3%
Ministral 3 14B100635848053.7%
DeepSeek V4 Flash (Reasoning)99686239053.5%
ByteDance Seed 1.6 Flash100685343052.8%
Grok 4.20 (Beta)1006354232352.5%
Arcee AI: Trinity Large (Preview)100100620052.3%
Z.AI GLM 59785760051.6%
DeepSeek V4 Pro (Reasoning)100863732051.1%
MiniMax M2.5100100550051.0%
Gemma 3 12B100525250050.6%
Claude Opus 410094540049.6%
Z.AI GLM 5.110099450048.9%
Z.AI GLM 4.7 Flash100724032048.9%
DeepSeek V4 Flash10083600048.7%
GPT-5.4 (Reasoning, Low)865352272648.7%
GPT-51006527251947.3%
Cohere Command R+ (Aug. 2024)10069630046.4%
Xiaomi MIMO v2.5100494340046.3%
Claude Opus 4.6 (Reasoning)100723128046.2%
Claude Opus 4.610094330045.3%
GPT-4o, May 13th (temp=1)10069540044.8%
Gemini 3.1 Flash Lite10069520044.3%
Claude Opus 4.7 (Reasoning)10062590044.1%
Gemini 3.5 Flash (Reasoning)64635235042.6%
GPT-5.5 (Reasoning, Low)664634343242.4%
Claude Sonnet 4.67472620041.5%
Stealth: Healer Alpha82424040040.6%
Gemma 3 4B10058430040.2%
GPT-4o Mini (temp=1)10010000040.0%
Gemma 3 27B10010000040.0%
Llama 3.1 Nemotron 70B10010000040.0%
Writer: Palmyra X57872460039.1%
MoonshotAI: Kimi K2.688423331038.8%
DeepSeek V4 Pro1008800037.5%
Gemini 2.5 Flash1008800037.5%
GPT-4.1 Mini1008600037.2%
DeepSeek V3 (2024-12-26)6961550037.1%
Gemma 4 31B (Reasoning)49464545037.1%
Claude Opus 4.51008300036.5%
GPT-5.554523935035.8%
GPT-5 Nano977800034.9%
Ministral 8B1007100034.3%
Gemini 3 Flash (Preview)414034342033.7%
Mistral Medium 3.16057500033.3%
Grok 4.20 (Beta, Reasoning)8649310033.3%
Stealth: Hunter Alpha473229282832.8%
ByteDance Seed 2.0 Mini1006300032.5%
Mistral Large 35756490032.3%
GPT-5.5 (Reasoning)444036211931.9%
GPT-4.11006000031.9%
GPT-5.4 Mini49473229031.1%
Grok 4.20583427191631.0%
Gemma 4 31B5554450030.9%
o4 Mini High66352725030.6%
GPT-5.47251290030.4%
Gemma 4 26B (Reasoning)955300029.6%
DeepSeek V3.15350430029.1%
Qwen 3.5 397B A17B10026190029.0%
Gemini 3.1 Pro (Preview)38383730028.7%
Hermes 3 405B796400028.7%
Gemini 2.5 Flash Lite (Reasoning)1004100028.3%
Claude Sonnet 4.54646460027.8%
Qwen3.7 Max756300027.7%
Z.AI GLM 4.65249350027.1%
Nemotron 3 Nano825300027.0%
Qwen 3.6 Flash8635120026.7%
Grok 4795400026.7%
Grok 4 Fast884000025.8%
Claude Opus 4.7656300025.5%
Qwen 2.5 72B646300025.3%
Mistral Small 4 (Reasoning)824300025.1%
MiniMax M2.7804300024.5%
Gemini 2.5 Flash (Reasoning)645700024.3%
GPT-5.1754400023.8%
Xiaomi MIMO v2.5 Pro724500023.3%
Qwen 3.6 27B8018170023.0%
GPT-5.2575500022.5%
GPT-5.4 Mini (Reasoning, Low)31292624022.0%
Z.AI GLM 4.74037330021.9%
Grok 4.3 (Reasoning)6026200021.3%
Gemma 4 26B525200020.6%
Z.AI GLM 4.5 Air515000020.1%
DeepSeek-V2 Chat604100020.1%
Mistral Large 2100000020.0%
Ministral 3 8B100000020.0%
LFM2 24B100000020.0%
Claude 3.7 Sonnet98000019.6%
GPT-5.4 Mini (Reasoning)3232320019.3%
Claude 3 Haiku96000019.2%
Qwen 3.5 Flash702300018.6%
MoonshotAI: Kimi K2.5484200018.1%
GPT-5.4 Nano (Reasoning, Low)29271914017.8%
GPT-5 Mini88000017.5%
Qwen 3.5 122B661800016.8%
Mistral Large71000014.3%
Gemini 3.5 Flash (Reasoning, Minimal)61000012.2%
Qwen 3.5 35B2219108011.8%
GPT-OSS 120B59000011.8%
Gemini 3.1 Flash Lite (Reasoning)59000011.8%
Mistral Small 456000011.1%
Gemini 2.5 Pro54000010.9%
Gemini 3.1 Flash Lite (Preview)54000010.9%
Z.AI GLM 4.551000010.1%
Z.AI GLM 5 Turbo4900009.7%
Mistral Small Creative4800009.5%
GPT-4o, May 13th (temp=0)4700009.4%
Gemini 3 Flash (Preview, Reasoning)4600009.2%
GPT-5.4 Nano (Reasoning)22180007.9%
Grok 4.1 Fast3200006.4%
Qwen 3.5 Plus (2026-04-20)3000006.0%
o4 Mini2900005.8%
Grok 4.32100004.2%
Qwen3.6 Max Preview1900003.8%
Qwen 3.6 35B1900003.7%
Grok 4.20 (Reasoning)000000.0%
Qwen 3.5 27B000000.0%
ByteDance Seed 1.6000000.0%
Qwen 3.5 9B000000.0%
ByteDance Seed 2.0 Lite000000.0%
Nemotron 3 Super000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Inception Mercury000000.0%
Mistral Small 3.2 24B000000.0%
Llama 3.1 70B000000.0%
GPT-4o Mini (temp=0)000000.0%
Ministral 3 3B000000.0%
Ministral 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4 Fast100100100924988.1%
Grok 4.20 (Reasoning)100100100922483.2%
Ministral 3 14B100100100100080.0%
GPT-5 Nano10010010067073.4%
Claude Sonnet 41001007667068.5%
Claude Opus 41001007763068.0%
Gemini 2.5 Flash Lite100888658066.4%
Qwen 3.5 397B A17B1008576462366.0%
Z.AI GLM 4.7 Flash100936258062.5%
ByteDance Seed 2.0 Lite1001001000060.0%
Llama 3.1 Nemotron 70B1001001000060.0%
Llama 3.1 8B1001001000060.0%
Mistral Small Creative100100870057.4%
Hermes 3 405B100100810056.1%
Qwen 3.6 35B1006351392755.9%
Qwen3.6 Max Preview977452282354.8%
MoonshotAI: Kimi K2.5100100740054.7%
Z.AI GLM 5.1100100620052.3%
Mistral Large 310083690050.6%
Gemini 3.1 Flash Lite (Reasoning)100100460049.3%
Qwen 3.5 Plus (2026-04-20)1004438362648.7%
MoonshotAI: Kimi K2.6100100380047.7%
Z.AI GLM 510068630046.2%
ByteDance Seed 1.6 Flash62605650045.4%
Grok 4.20 (Beta)100603131044.5%
Z.AI GLM 4.610067520043.6%
Qwen 3.5 122B10068500043.5%
Claude 3.7 Sonnet10010000040.0%
Z.AI GLM 4.5 Air10010000040.0%
Ministral 3 8B10010000040.0%
Cohere Command R+ (Aug. 2024)10010000040.0%
Qwen 3.5 35B1009900039.9%
Gemini 2.5 Flash8160570039.7%
Claude Opus 4.71009400038.9%
Ministral 3B1009300038.5%
GPT-5.4565230282638.3%
Stealth: Hunter Alpha8959430038.2%
Z.AI GLM 5 Turbo1008900037.9%
GPT-4.1 Nano1008800037.5%
GPT-5.4 (Reasoning)58553835037.5%
GPT-5.4 Mini78393533037.0%
GPT-4.1 Mini939100036.7%
Ministral 3 3B60444231035.5%
MiniMax M2.77059450034.9%
Mistral Small 4 (Reasoning)1007400034.7%
GPT-4o Mini (temp=1)868600034.5%
o4 Mini10040330034.5%
Qwen 3.5 27B10038340034.5%
Mistral Large 21007100034.3%
Z.AI GLM 4.71007000034.1%
Grok 4.20 (Beta, Reasoning)10037330034.0%
Grok 4.1 Fast7160360033.5%
Gemma 4 26B (Reasoning)897800033.5%
Mistral Medium 3.16754450033.1%
DeepSeek V4 Pro (Reasoning)1006300032.6%
GPT-4o, Aug. 6th (temp=0)857800032.6%
WizardLM 2 8x22b926800032.0%
DeepSeek-V2 Chat936700031.9%
Mistral Large1005700031.4%
Qwen3 235B A22B Instruct 25071005700031.4%
Writer: Palmyra X51005600031.1%
Gemini 3.1 Flash Lite (Preview)1005400030.8%
Qwen 3.5 Plus (2026-02-15)777400030.1%
Qwen 3.5 9B5848390029.1%
Qwen 3.5 Flash776300028.0%
o4 Mini High1003700027.5%
Qwen3.7 Max1003300026.6%
Gemini 2.5 Flash Lite (Reasoning)696300026.4%
DeepSeek V4 Flash775500026.4%
GPT-4o, May 13th (temp=0)686300026.2%
Qwen 3.6 Flash49461716025.6%
Gemma 3 27B685600024.7%
GPT-5.4 Nano7026210023.4%
Mistral Small 4684700023.0%
Claude Opus 4.6585000021.6%
GPT-5.13937290021.1%
ByteDance Seed 1.6544900020.7%
GPT-5.4 Nano (Reasoning, Low)4138250020.6%
Claude Opus 4.5100000020.0%
Claude Sonnet 4.5100000020.0%
Gemini 3.5 Flash (Reasoning, Minimal)100000020.0%
Claude Haiku 4.5100000020.0%
GPT-4o, May 13th (temp=1)100000020.0%
GPT-4o, Aug. 6th (temp=1)100000020.0%
DeepSeek V3.1100000020.0%
Qwen 3 32B100000020.0%
Inception Mercury100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
Arcee AI: Trinity Mini100000020.0%
Mistral NeMO100000020.0%
Claude Opus 4.6 (Reasoning)534400019.4%
Gemini 3 Flash (Preview)544100019.1%
Gemini 3 Flash (Preview, Reasoning)95000019.0%
Grok 4.2093000018.5%
Claude Sonnet 4.691000018.2%
LFM2 24B89000017.9%
Gemma 4 26B88000017.5%
Nemotron 3 Super88000017.5%
Llama 3.1 70B88000017.5%
GPT-5.4 (Reasoning, Low)582900017.3%
GPT-5.53923220016.8%
DeepSeek V3 (2025-03-24)83000016.7%
Stealth: Healer Alpha82000016.4%
Claude 3 Haiku81000016.1%
Hermes 3 70B76000015.2%
GPT-5 Mini423300015.0%
GPT-5.5 (Reasoning)2625240015.0%
Rocinante 12B75000014.9%
ByteDance Seed 2.0 Mini72000014.5%
DeepSeek V4 Flash (Reasoning)70000014.1%
Z.AI GLM 4.570000014.1%
Gemma 3 12B70000014.1%
Gemma 3 4B67000013.3%
DeepSeek V4 Pro66000013.2%
GPT-5.5 (Reasoning, Low)471900013.1%
GPT-5.4 Mini (Reasoning, Low)323100012.6%
DeepSeek V3.262000012.3%
Gemini 3.5 Flash (Reasoning)60000012.0%
Stealth: Aurora Alpha59000011.8%
MiniMax M2.558000011.6%
Grok 458000011.6%
Gemini 3.1 Flash Lite53000010.5%
Gemini 3 Pro (Preview)51000010.2%
Xiaomi MIMO v2.54500008.9%
Qwen 3.6 27B4400008.8%
GPT-529150008.7%
Xiaomi MIMO v2.5 Pro3800007.6%
Grok 4.3 (Reasoning)3200006.4%
GPT-5.4 Nano (Reasoning)2100004.2%
Gemini 3.1 Pro (Preview)000000.0%
Claude Sonnet 4.6 (Reasoning)000000.0%
Claude Opus 4.7 (Reasoning)000000.0%
Gemma 4 31B (Reasoning)000000.0%
GPT-5.4 Mini (Reasoning)000000.0%
GPT-5.2000000.0%
Aion 2.0000000.0%
GPT-4.1000000.0%
Gemini 2.5 Pro000000.0%
Gemma 4 31B000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
DeepSeek V3 (2024-12-26)000000.0%
Grok 4.3000000.0%
Mistral Small 3.2 24B000000.0%
GPT-4o Mini (temp=0)000000.0%
Nemotron 3 Nano000000.0%
Qwen 2.5 72B000000.0%
Ministral 8B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
ByteDance Seed 2.0 Lite1001001001007094.1%
ByteDance Seed 1.610010010086077.2%
Writer: Palmyra X51001006761065.5%
Grok 4.20 (Reasoning)1001008822062.0%
DeepSeek V4 Pro (Reasoning)1001001000060.0%
Gemini 3.1 Flash Lite (Reasoning)1001001000060.0%
Mistral NeMO100856648059.6%
ByteDance Seed 2.0 Mini100100960059.2%
Claude 3 Haiku100100910058.2%
Hermes 3 70B100100860057.2%
Gemini 3.1 Flash Lite (Preview)100756546057.1%
Qwen3.6 Max Preview100676747056.2%
Cohere Command R+ (Aug. 2024)100100790055.9%
Gemma 4 31B100100770055.4%
Rocinante 12B100100770055.4%
Qwen 3 32B100766729054.3%
GPT-4o, Aug. 6th (temp=0)10098720054.1%
Claude Sonnet 4.610093740053.2%
Claude 3.7 Sonnet10076760050.3%
Stealth: Healer Alpha100624442049.5%
Z.AI GLM 586545243046.9%
Claude Sonnet 4.6 (Reasoning)7976690044.9%
Gemma 3 27B8679570044.5%
Gemini 2.5 Flash Lite (Reasoning)8277610044.0%
Gemma 4 31B (Reasoning)7269680041.9%
GPT-5.18064590040.5%
Z.AI GLM 5.110010000040.0%
Claude Sonnet 410010000040.0%
Grok 4.20 (Beta)7765570040.0%
GPT-5.410066300039.3%
Gemini 3 Flash (Preview)1009600039.2%
Claude Opus 4.510051450039.0%
o4 Mini7976390038.9%
GPT-4o, May 13th (temp=1)1009400038.9%
Qwen 2.5 72B1009400038.9%
DeepSeek V3 (2025-03-24)1009100038.2%
GPT-5.4 (Reasoning)63613630038.0%
DeepSeek V4 Pro1009000038.0%
GPT-510051390037.9%
Gemma 3 4B8158500037.7%
GPT-4o, Aug. 6th (temp=1)968900037.1%
Gemini 3 Pro (Preview)10047370036.9%
Qwen 3.5 Flash7973300036.6%
GPT-5.4 Nano (Reasoning, Low)80363433036.5%
Z.AI GLM 5 Turbo1008100036.1%
Grok 4.208356360035.1%
DeepSeek-V2 Chat837900032.5%
Grok 4 Fast1006100032.2%
GPT-4o, May 13th (temp=0)946600032.0%
Qwen 3.6 35B100222214031.8%
GPT-5.4 (Reasoning, Low)887000031.6%
Z.AI GLM 4.61005800031.6%
GPT-5 Nano10036210031.6%
Mistral Large896400030.7%
Claude Opus 4.65351420029.1%
Claude Opus 4856000029.0%
Stealth: Hunter Alpha786700029.0%
ByteDance Seed 1.6 Flash1004300028.7%
Claude Opus 4.6 (Reasoning)884200025.9%
Qwen 3.5 397B A17B1002100024.2%
GPT-5.4 Nano413015141322.7%
Qwen3 235B A22B Instruct 2507615000022.1%
Claude Haiku 4.5575300021.9%
GPT-5.4 Nano (Reasoning)42301818021.6%
DeepSeek V3.2574800021.1%
Claude Opus 4.7100000020.0%
MiniMax M2.5100000020.0%
Grok 4100000020.0%
Z.AI GLM 4.5100000020.0%
Z.AI GLM 4.7 Flash100000020.0%
Nemotron 3 Super100000020.0%
Claude 3.5 Sonnet100000020.0%
DeepSeek V3.1100000020.0%
Mistral Small 3.2 24B100000020.0%
GPT-4o Mini (temp=0)100000020.0%
Mistral Medium 3.1100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
WizardLM 2 8x22b100000020.0%
Gemini 2.5 Flash (Reasoning)544500020.0%
Qwen 3.6 Flash3834280019.8%
Xiaomi MIMO v2.5 Pro94000018.9%
Hermes 3 405B91000018.2%
Ministral 3 14B89000017.9%
Ministral 3 3B89000017.9%
Qwen 3.5 122B3925250017.8%
Llama 3.1 8B88000017.5%
GPT-5.5652300017.5%
Inception Mercury 2454000017.0%
Llama 3.1 70B83000016.7%
Mistral Small 4 (Reasoning)78000015.6%
GPT-5.4 Mini (Reasoning)463000015.2%
DeepSeek V4 Flash76000015.2%
Claude Opus 4.7 (Reasoning)75000014.9%
GPT-5.22825200014.8%
GPT-4o Mini (temp=1)74000014.7%
Qwen 3.5 Plus (2026-04-20)482200014.0%
GPT-4.1 Nano69000013.9%
MiniMax M2.768000013.7%
Grok 4.20 (Beta, Reasoning)383000013.6%
MoonshotAI: Kimi K2.568000013.5%
Gemini 3.5 Flash (Reasoning, Minimal)68000013.5%
Mistral Small Creative68000013.5%
GPT-5.5 (Reasoning, Low)431900012.4%
Aion 2.061000012.2%
Gemini 2.5 Flash Lite60000012.0%
Gemma 3 12B60000012.0%
Qwen 3.5 35B59000011.7%
Gemini 2.5 Flash57000011.4%
DeepSeek V4 Flash (Reasoning)56000011.2%
Mistral Large 254000010.8%
Qwen 3.5 Plus (2026-02-15)53000010.5%
GPT-5.5 (Reasoning)272600010.4%
Z.AI GLM 4.752000010.4%
Mistral Large 352000010.4%
Grok 4.352000010.3%
Gemini 3.1 Flash Lite50000010.0%
Xiaomi MIMO v2.54500009.0%
Stealth: Aurora Alpha4300008.7%
Qwen 3.6 27B3800007.5%
GPT-5.4 Mini (Reasoning, Low)3500007.0%
Grok 4.3 (Reasoning)2900005.8%
GPT-5 Mini2900005.7%
GPT-5.4 Mini2800005.7%
Qwen 3.5 27B1600003.1%
Qwen3.7 Max000000.0%
Gemini 3.1 Pro (Preview)000000.0%
Gemini 3.5 Flash (Reasoning)000000.0%
MoonshotAI: Kimi K2.6000000.0%
Gemma 4 26B (Reasoning)000000.0%
Gemini 3 Flash (Preview, Reasoning)000000.0%
o4 Mini High000000.0%
Grok 4.1 Fast000000.0%
GPT-4.1000000.0%
Gemini 2.5 Pro000000.0%
Claude Sonnet 4.5000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Gemma 4 26B000000.0%
DeepSeek V3 (2024-12-26)000000.0%
GPT-4.1 Mini000000.0%
Z.AI GLM 4.5 Air000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Mistral Small 4000000.0%
Ministral 3 8B000000.0%
Arcee AI: Trinity Mini000000.0%
Ministral 8B000000.0%
Ministral 3B000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 122B10010010063072.7%
Claude Sonnet 41001008572071.4%
Gemini 3.1 Flash Lite (Reasoning)10010010049069.7%
Mistral Small Creative1001006659064.9%
Qwen 3.5 35B1001007448064.4%
Qwen 3.5 397B A17B10010010020063.9%
Llama 3.1 Nemotron 70B1001001000060.0%
Gemini 3.1 Flash Lite (Preview)100100930058.7%
Qwen 3.5 Flash100100930058.6%
Qwen 3.6 35B100100720054.4%
ByteDance Seed 1.6 Flash100626240052.7%
Qwen 3.6 Flash100974419052.1%
Gemini 3 Pro (Preview)100804038051.7%
Rocinante 12B100100570051.4%
ByteDance Seed 2.0 Lite88645745050.8%
Qwen 3 32B10096530049.8%
DeepSeek V4 Pro (Reasoning)10095520049.4%
Stealth: Healer Alpha96574746049.2%
Claude Opus 410061600044.1%
Gemma 3 12B7270670041.9%
GPT-5.4 Mini68623833040.2%
Mistral Large 310010000040.0%
Hermes 3 70B10010000040.0%
Llama 3.1 8B10010000040.0%
Claude Opus 4.71009800039.6%
Cohere Command R+ (Aug. 2024)1009800039.6%
Qwen 2.5 72B1008800037.5%
GPT-5.5 (Reasoning, Low)100442321037.5%
Z.AI GLM 5 Turbo7654510036.2%
WizardLM 2 8x22b1007800035.6%
Qwen 3.5 Plus (2026-04-20)57554616034.9%
GPT-4o, Aug. 6th (temp=1)967800034.9%
DeepSeek V4 Flash (Reasoning)1007400034.7%
ByteDance Seed 2.0 Mini1007400034.7%
MiniMax M2.51007200034.5%
Claude Sonnet 4.5967600034.4%
o4 Mini8946350034.2%
GPT-5937800034.1%
Claude 3.7 Sonnet1006900033.9%
Z.AI GLM 4.76360460033.7%
Qwen 3.6 27B1006800033.6%
GPT-4o Mini (temp=1)868100033.4%
Ministral 3B1006600033.2%
Xiaomi MIMO v2.5 Pro7259320032.6%
Claude 3 Haiku837900032.5%
Gemma 3 27B837700032.1%
Qwen 3.5 9B867000031.3%
Grok 4.2055532523031.2%
GPT-5.4 (Reasoning, Low)54413228031.1%
Grok 4.20 (Beta, Reasoning)9933230031.0%
ByteDance Seed 1.61005300030.6%
DeepSeek V3.2935800030.3%
GPT-5.4 (Reasoning)6257320030.3%
MoonshotAI: Kimi K2.61005000029.9%
Mistral Small 41004600029.3%
GPT-5 Nano1004500028.9%
DeepSeek V4 Flash816300028.8%
Z.AI GLM 4.7 Flash756100027.1%
GPT-5.47926250026.2%
Z.AI GLM 5636300025.0%
o4 Mini High903200024.4%
GPT-5.545342516024.1%
Grok 4.20 (Beta)5037290023.2%
Aion 2.0704500023.2%
Mistral Small 4 (Reasoning)4435340022.6%
Gemini 2.5 Pro634900022.2%
GPT-5.5 (Reasoning)694100021.9%
MoonshotAI: Kimi K2.5614800021.8%
Gemini 2.5 Flash (Reasoning)634400021.3%
Gemini 2.5 Flash535000020.5%
Claude Opus 4.6554600020.2%
Claude Opus 4.6 (Reasoning)524900020.1%
Qwen3.7 Max100000020.0%
Qwen 3.5 27B100000020.0%
MiniMax M2.7100000020.0%
Stealth: Hunter Alpha100000020.0%
Gemini 3.1 Flash Lite100000020.0%
Gemini 2.5 Flash Lite (Reasoning)100000020.0%
Nemotron 3 Super100000020.0%
GPT-4.1 Mini100000020.0%
DeepSeek V4 Pro100000020.0%
Mistral Large100000020.0%
Llama 3.1 70B100000020.0%
Ministral 3 8B100000020.0%
Mistral NeMO100000020.0%
Z.AI GLM 5.1564100019.5%
Arcee AI: Trinity Mini96000019.2%
GPT-5.4 Nano4335170019.2%
Gemma 4 26B (Reasoning)94000018.9%
GPT-4o, May 13th (temp=1)94000018.9%
Grok 4 Fast633100018.7%
DeepSeek V3 (2024-12-26)86000017.2%
Grok 4.20 (Reasoning)79000015.9%
Gemini 3.5 Flash (Reasoning, Minimal)78000015.6%
Z.AI GLM 4.5 Air78000015.6%
GPT-5.4 Mini (Reasoning, Low)393700015.1%
GPT-5.4 Mini (Reasoning)393600015.1%
Claude Haiku 4.575000014.9%
Z.AI GLM 4.572000014.5%
Arcee AI: Trinity Large (Preview)65000013.0%
Qwen 3.5 Plus (2026-02-15)64000012.8%
Writer: Palmyra X564000012.8%
Mistral Large 263000012.5%
GPT-5.4 Nano (Reasoning, Low)313000012.3%
Gemini 3 Flash (Preview, Reasoning)56000011.1%
DeepSeek V3.154000010.9%
GPT-5 Mini3900007.8%
Qwen3.6 Max Preview18170007.1%
GPT-5.13200006.4%
GPT-5.22100004.3%
Gemini 3.1 Pro (Preview)000000.0%
Gemini 3.5 Flash (Reasoning)000000.0%
Claude Sonnet 4.6 (Reasoning)000000.0%
Grok 4.3 (Reasoning)000000.0%
Claude Opus 4.7 (Reasoning)000000.0%
Gemma 4 31B (Reasoning)000000.0%
Claude Sonnet 4.6000000.0%
Claude Opus 4.5000000.0%
Grok 4.1 Fast000000.0%
Z.AI GLM 4.6000000.0%
GPT-4.1000000.0%
Grok 4000000.0%
Gemma 4 31B000000.0%
GPT-OSS 120B000000.0%
Gemma 4 26B000000.0%
GPT-4o, May 13th (temp=0)000000.0%
Gemini 3 Flash (Preview)000000.0%
Xiaomi MIMO v2.5000000.0%
DeepSeek-V2 Chat000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Hermes 3 405B000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
DeepSeek V3 (2025-03-24)000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
Gemini 2.5 Flash Lite000000.0%
Qwen3 235B A22B Instruct 2507000000.0%
Inception Mercury000000.0%
Grok 4.3000000.0%
Mistral Small 3.2 24B000000.0%
GPT-4o Mini (temp=0)000000.0%
Mistral Medium 3.1000000.0%
Nemotron 3 Nano000000.0%
Ministral 3 14B000000.0%
GPT-4.1 Nano000000.0%
Gemma 3 4B000000.0%
Ministral 3 3B000000.0%
Ministral 8B000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100918194.3%
Z.AI GLM 4.7 Flash10010088856687.7%
Gemini 3.5 Flash (Reasoning)10010010064072.8%
Grok 4.201007969585171.5%
Gemini 3.1 Flash Lite (Preview)1001007974070.6%
Gemma 4 26B100919170070.4%
GPT-4.1 Nano1001007771069.7%
Stealth: Healer Alpha937971464366.5%
Gemini 2.5 Flash Lite1006969493564.6%
Hermes 3 70B100838151062.9%
Gemma 3 4B1001005454061.7%
DeepSeek V3.2100817451061.1%
MiniMax M2.5100985744060.0%
Cohere Command R+ (Aug. 2024)100100880057.5%
Z.AI GLM 5 Turbo96756050056.1%
GPT-5.4 (Reasoning)908051302454.9%
Gemini 3.1 Flash Lite (Reasoning)100100680053.7%
Claude Haiku 4.597635750053.4%
GPT-5 Nano100715139052.1%
Grok 4.20 (Beta)1007137272451.6%
Ministral 3 3B10083750051.6%
Gemma 4 31B (Reasoning)10099570051.2%
Aion 2.0100575048051.0%
Qwen 3.5 Plus (2026-02-15)99764534050.8%
Claude Opus 4.6100724238050.5%
Gemma 4 31B93545349049.9%
WizardLM 2 8x22b100624740049.7%
Claude Opus 4.592734340049.6%
Grok 4.20 (Reasoning)93773939049.5%
Z.AI GLM 5.110084570048.2%
Gemini 3 Flash (Preview, Reasoning)10074650047.8%
DeepSeek V4 Pro (Reasoning)100100360047.2%
Grok 4.3100544634046.9%
Qwen3 235B A22B Instruct 250710097370046.9%
Z.AI GLM 4.779754436046.9%
Gemini 3 Pro (Preview)10089440046.5%
Qwen 3.5 35B10068540044.3%
GPT-4o, Aug. 6th (temp=1)8876570044.1%
GPT-5.4 (Reasoning, Low)10059550042.8%
Z.AI GLM 4.610066450042.2%
Gemma 4 26B (Reasoning)10069400041.8%
GPT-4o Mini (temp=1)7472620041.5%
Qwen 3.5 397B A17B805029271840.9%
ByteDance Seed 1.610010000040.0%
ByteDance Seed 2.0 Mini10010000040.0%
Gemini 2.5 Flash (Reasoning)10010000040.0%
DeepSeek V3.110010000040.0%
GPT-4o, Aug. 6th (temp=0)6861600037.8%
Z.AI GLM 58761410037.8%
Claude 3 Haiku988900037.5%
Qwen 3.5 122B1008300036.6%
GPT-5.4 Mini (Reasoning)9850320035.9%
Claude Sonnet 4.61007900035.9%
Stealth: Hunter Alpha9156300035.5%
GPT-5.4 Nano (Reasoning, Low)9956190034.9%
DeepSeek-V2 Chat1007400034.7%
Xiaomi MIMO v2.510039340034.7%
Gemini 3.1 Flash Lite1007000034.1%
Claude Opus 4.6 (Reasoning)8247410033.9%
GPT-4o, May 13th (temp=1)1006800033.5%
MoonshotAI: Kimi K2.67454380033.4%
GPT-5.47659280032.7%
Claude 3.7 Sonnet1006300032.7%
Mistral Large818100032.3%
ByteDance Seed 2.0 Lite936800032.2%
Qwen 2.5 72B1005600031.2%
Mistral NeMO1005500031.0%
GPT-5 Mini10028260030.9%
GPT-5.5 (Reasoning, Low)8250220030.8%
GPT-4.1 Mini886500030.5%
DeepSeek V3 (2024-12-26)6052380030.0%
DeepSeek V4 Flash (Reasoning)886000029.7%
Arcee AI: Trinity Mini816800029.6%
Qwen 3.5 Plus (2026-04-20)42404021028.7%
Writer: Palmyra X5895300028.5%
Z.AI GLM 4.5726900028.4%
Claude Sonnet 4.5825900028.2%
o4 Mini5852270027.5%
GPT-5.4 Nano53481816026.9%
GPT-5.4 Mini1003300026.7%
Qwen 3.5 9B943340026.3%
Qwen 3 32B765600026.3%
GPT-4o Mini (temp=0)686300026.2%
GPT-5.4 Mini (Reasoning, Low)7029290025.8%
Qwen 3.5 Flash1002900025.7%
GPT-5.142333023025.6%
ByteDance Seed 1.6 Flash952800024.7%
GPT-4o, May 13th (temp=0)784300024.2%
Grok 4 Fast4937340024.1%
GPT-4.1645300023.5%
Grok 4.20 (Beta, Reasoning)684900023.4%
Gemini 2.5 Pro625400023.2%
Z.AI GLM 4.5 Air793400022.6%
Gemma 3 27B544800020.4%
Claude Sonnet 4.6 (Reasoning)100000020.0%
Qwen 3.5 27B100000020.0%
Claude Sonnet 4100000020.0%
Claude Opus 4100000020.0%
Hermes 3 405B100000020.0%
Llama 3.1 70B100000020.0%
Llama 3.1 Nemotron 70B100000020.0%
Llama 3.1 8B100000020.0%
Qwen 3.6 27B742300019.5%
Nemotron 3 Super96000019.2%
Xiaomi MIMO v2.5 Pro474400018.2%
Arcee AI: Trinity Large (Preview)85000016.9%
Mistral Small 4 (Reasoning)79000015.7%
Qwen3.7 Max403700015.3%
MoonshotAI: Kimi K2.576000015.2%
GPT-OSS 120B75000015.0%
Grok 475000014.9%
o4 Mini High69000013.8%
Ministral 3 14B68000013.5%
Claude Opus 4.766000013.2%
Gemini 3.5 Flash (Reasoning, Minimal)63000012.7%
Mistral Large 363000012.7%
Gemini 2.5 Flash Lite (Reasoning)62000012.3%
Gemini 2.5 Flash61000012.2%
Qwen3.6 Max Preview56000011.2%
Claude Opus 4.7 (Reasoning)56000011.1%
Mistral Medium 3.152000010.4%
Grok 4.3 (Reasoning)4500008.9%
Gemini 3.1 Pro (Preview)4100008.2%
GPT-5.4 Nano (Reasoning)4000008.1%
GPT-5.5 (Reasoning)23180008.1%
Qwen 3.6 Flash3800007.6%
GPT-5.521160007.5%
MiniMax M2.73400006.8%
Qwen 3.6 35B2800005.6%
GPT-5.22700005.5%
GPT-52700005.3%
Gemini 3 Flash (Preview)2200004.3%
Grok 4.1 Fast000000.0%
Claude 3.5 Sonnet000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V4 Pro000000.0%
Mistral Large 2000000.0%
DeepSeek V4 Flash000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Inception Mercury000000.0%
Mistral Small 3.2 24B000000.0%
Gemma 3 12B000000.0%
Nemotron 3 Nano000000.0%
Mistral Small 4000000.0%
Mistral Small Creative000000.0%
Ministral 3 8B000000.0%
Ministral 8B000000.0%
Ministral 3B000000.0%
LFM2 24B000000.0%