Subject-first sentence starts

Test: Bad Writing Habits

Avg. Score
35.3%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Writer: Palmyra X583.2%$0.01122.0s50%
2Qwen3 235B A22B Instruct 250781.0%$0.001159.2s48%
3Rocinante 12B77.8%$0.001438.4s38%
4Llama 3.1 8B71.0%$0.00031.3m29%
5GPT-5.470.2%$0.0491.4m39%
6Mistral Small 4 (Reasoning)60.6%$0.002230.2s25%
7Mistral Small 456.1%$0.001418.2s24%
8Ministral 3 14B50.5%$0.000711.7s25%
9GPT-5.4 (Reasoning, Low)67.7%$0.0551.4m36%
10Claude Sonnet 4.563.6%$0.03538.1s27%
11Grok 4.1 Fast54.6%$0.001837.8s24%
12Mistral Small Creative49.3%$0.00079.1s22%
13Z.AI GLM 558.9%$0.00841.2m26%
14GPT-5.4 Mini50.0%$0.01516.8s25%
15Claude Haiku 4.552.5%$0.01121.6s21%
16Z.AI GLM 5 Turbo49.6%$0.008133.2s22%
17GPT-5.4 Mini (Reasoning, Low)47.8%$0.01516.8s23%
18Grok 4.2045.9%$0.009345.7s26%
19Hermes 3 70B57.9%$0.00101.2m20%
20Mistral Medium 3.143.3%$0.004836.5s25%
21MiniMax M2.552.7%$0.00341.3m23%
22Grok 4 Fast45.4%$0.001724.1s20%
23Hermes 3 405B53.5%$0.003253.2s19%
24DeepSeek V4 Pro51.0%$0.00481.3m23%
25Grok 4.20 (Beta)38.9%$0.01815.8s25%
26Claude Opus 4.760.1%$0.06930.4s26%
27Claude Sonnet 453.0%$0.03243.7s22%
28Claude Sonnet 4.653.3%$0.03139.3s21%
29Llama 3.1 Nemotron 70B45.5%$0.003831.7s18%
30Arcee AI: Trinity Large (Preview)44.0%$0.000043.6s19%
31Llama 3.1 70B44.4%$0.001529.4s17%
32Claude Opus 4.558.8%$0.07053.4s27%
33Claude 3 Haiku45.6%$0.002514.9s14%
34DeepSeek V4 Flash (Reasoning)41.5%$0.000731.1s18%
35Gemini 2.5 Flash Lite34.5%$0.00099.5s18%
36DeepSeek V4 Flash42.5%$0.000631.6s16%
37GPT-5.4 Mini (Reasoning)45.0%$0.02228.1s19%
38ByteDance Seed 1.6 Flash38.0%$0.001327.3s18%
39Z.AI GLM 5.152.9%$0.0141.5m20%
40Grok 4.20 (Reasoning)45.7%$0.0181.5m25%
41Ministral 8B37.6%$0.000410.4s15%
42GPT-4o, Aug. 6th (temp=1)44.8%$0.01824.4s17%
43MiniMax M2.746.3%$0.00401.1m17%
44Stealth: Healer Alpha34.3%$0.000023.7s18%
45Xiaomi MIMO v2.5 Pro41.1%$0.008553.5s19%
46DeepSeek V3 (2025-03-24)42.0%$0.001439.4s14%
47GPT-4.139.4%$0.01844.7s21%
48Ministral 3 8B38.5%$0.000819.6s14%
49Gemma 3 12B36.6%$0.000441.3s17%
50Mistral Large 240.1%$0.01329.4s16%
51Mistral Large 336.9%$0.003330.3s16%
52Stealth: Hunter Alpha40.0%$0.000055.0s17%
53Gemini 2.5 Flash Lite (Reasoning)33.3%$0.002830.8s17%
54GPT-4o Mini (temp=1)38.9%$0.001234.8s13%
55Xiaomi MIMO v2.536.7%$0.005431.8s15%
56Gemma 3 27B39.8%$0.000652.6s14%
57Claude Sonnet 4.6 (Reasoning)53.7%$0.0601.2m22%
58Claude Opus 4.656.4%$0.0781.2m25%
59Qwen 3.6 Flash38.1%$0.01041.4s16%
60GPT-4.1 Nano34.2%$0.000713.3s12%
61Claude Opus 4.6 (Reasoning)58.5%$0.0881.4m26%
62Qwen 3.6 35B37.5%$0.00831.0m17%
63Cohere Command R+ (Aug. 2024)45.9%$0.02052.5s13%
64Claude Opus 4.7 (Reasoning)52.7%$0.07632.0s20%
65Mistral Large37.3%$0.01430.9s13%
66GPT-5.4 Nano28.8%$0.005726.3s16%
67Gemini 2.5 Flash28.1%$0.005210.6s13%
68GPT-5.4 (Reasoning)65.0%$0.0892.6m29%
69LFM2 24B28.6%$0.000228.4s13%
70GPT-5.4 Nano (Reasoning, Low)26.5%$0.005520.6s15%
71GPT-5.151.4%$0.0541.8m22%
72Z.AI GLM 4.628.9%$0.006551.5s17%
73Gemma 3 4B29.4%$0.000220.0s11%
74GPT-5.4 Nano (Reasoning)26.8%$0.006124.5s15%
75Mistral NeMO28.6%$0.000510.1s10%
76Grok 4.20 (Beta, Reasoning)34.7%$0.03934.0s18%
77Qwen 3 32B32.1%$0.001554.6s13%
78Gemini 2.5 Flash (Reasoning)32.2%$0.01121.5s11%
79WizardLM 2 8x22b38.5%$0.00261.8m16%
80DeepSeek V4 Pro (Reasoning)46.7%$0.0153.1m23%
81GPT-4o, May 13th (temp=1)30.5%$0.03314.4s15%
82Gemini 3.1 Flash Lite (Reasoning)23.5%$0.003011.9s11%
83DeepSeek V3.233.7%$0.00141.9m17%
84Ministral 3B23.3%$0.00018.1s10%
85Gemini 3.5 Flash (Reasoning, Minimal)30.7%$0.01812.0s10%
86o4 Mini25.8%$0.01525.7s14%
87Z.AI GLM 4.529.0%$0.005142.1s12%
88GPT-4.1 Mini24.8%$0.002719.0s10%
89Aion 2.028.8%$0.00641.3m15%
90Z.AI GLM 4.5 Air28.2%$0.002958.2s12%
91Claude 3.7 Sonnet32.1%$0.04246.7s17%
92Gemini 2.5 Pro29.5%$0.03636.2s16%
93Qwen 3.5 Plus (2026-02-15)22.7%$0.006031.5s12%
94Gemini 3.1 Flash Lite (Preview)21.3%$0.00308.4s8%
95MoonshotAI: Kimi K2.543.7%$0.0193.2m22%
96GPT-5.5 (Reasoning, Low)55.5%$0.1391.8m32%
97o4 Mini High28.1%$0.02547.2s14%
98Grok 443.0%$0.0481.7m16%
99GPT-5.556.1%$0.1391.7m29%
100Grok 4.321.7%$0.006930.5s9%
101DeepSeek V3.126.6%$0.00201.8m13%
102Ministral 3 3B18.7%$0.000511.1s5%
103Claude 3.5 Sonnet32.1%$0.04835.5s11%
104DeepSeek V3 (2024-12-26)22.0%$0.002154.6s7%
105Qwen 3.5 Plus (2026-04-20)31.8%$0.0171.8m12%
106Gemini 3 Flash (Preview)17.4%$0.007819.6s7%
107DeepSeek-V2 Chat21.7%$0.002153.3s6%
108Gemini 3.1 Flash Lite19.0%$0.003012.1s3%
109GPT-4o, Aug. 6th (temp=0)17.5%$0.02322.7s8%
110GPT-4o, May 13th (temp=0)18.6%$0.03514.1s8%
111Qwen 3.5 397B A17B33.8%$0.0143.0m15%
112Z.AI GLM 4.7 Flash16.8%$0.00171.2m8%
113Gemini 3 Flash (Preview, Reasoning)17.3%$0.01230.1s5%
114GPT-5.5 (Reasoning)51.9%$0.1421.8m25%
115GPT-5 Mini17.7%$0.010057.4s7%
116Gemini 3.5 Flash (Reasoning)30.0%$0.07137.6s11%
117Gemma 4 26B14.2%$0.000955.1s6%
118Gemma 4 31B19.2%$0.00101.6m8%
119Qwen3.6 Max Preview44.1%$0.0503.5m19%
120Arcee AI: Trinity Mini11.6%$0.00039.2s0%
121Z.AI GLM 4.716.6%$0.0101.4m8%
122Qwen 3.5 Flash12.9%$0.002547.5s2%
123Qwen 2.5 72B8.8%$0.001036.7s2%
124Nemotron 3 Super13.3%$0.00001.4m5%
125Claude Opus 457.5%$0.2091.4m28%
126GPT-4o Mini (temp=0)9.5%$0.001234.8s0%
127Gemini 3 Pro (Preview)20.1%$0.05554.4s9%
128GPT-5.222.9%$0.0561.5m11%
129Stealth: Aurora Alpha0.7%$0.00009.8s0%
130Inception Mercury 21.1%$0.00327.0s0%
131ByteDance Seed 2.0 Lite22.0%$0.0122.2m6%
132GPT-5 Nano13.5%$0.00421.4m2%
133Qwen 3.5 35B14.4%$0.0181.0m1%
134Gemma 4 31B (Reasoning)15.2%$0.00142.2m5%
135Qwen 3.6 27B22.7%$0.0252.3m6%
136Inception Mercury1.2%$0.01117.6s0%
137Nemotron 3 Nano7.5%$0.00101.1m0%
138Gemma 4 26B (Reasoning)13.4%$0.00132.0m3%
139Qwen 3.5 9B7.6%$0.00111.4m0%
140GPT-529.2%$0.0652.8m13%
141Grok 4.3 (Reasoning)16.9%$0.0212.3m5%
142Qwen 3.5 122B8.2%$0.0251.1m0%
143Qwen 3.5 27B6.7%$0.0201.6m0%
144GPT-OSS 120B0.9%$0.00151.8m0%
145ByteDance Seed 1.611.8%$0.0132.5m0%
146ByteDance Seed 2.0 Mini22.7%$0.00454.9m9%
147Gemini 3.1 Pro (Preview)23.9%$0.1071.8m6%
148Mistral Small 3.2 24B25.5%$0.00695.7m9%
149Qwen3.7 Max17.6%$0.0682.3m1%
150MoonshotAI: Kimi K2.635.9%$0.0586.5m16%
35.32%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X510010096888092.7%
Rocinante 12B10010098837891.8%
GPT-5.4 (Reasoning)1009187866084.7%
GPT-5.4 (Reasoning, Low)1008980717182.5%
Qwen3 235B A22B Instruct 250710010098872782.5%
Claude Sonnet 4.6 (Reasoning)979281746281.2%
Claude Sonnet 4.6967978776378.4%
Claude Sonnet 4.51008771676477.9%
Claude Opus 4.7 (Reasoning)1007773603468.7%
Claude Opus 4.7937270594668.0%
Gemma 3 4B1007876423967.1%
Hermes 3 70B1009660562366.9%
GPT-5.4817567605166.8%
Claude Opus 4.5838172652965.9%
Claude Opus 4.6 (Reasoning)1007271393563.2%
Llama 3.1 Nemotron 70B857363413559.2%
MiniMax M2.7886051504258.2%
Cohere Command R+ (Aug. 2024)998240363458.2%
Claude Opus 4757471422457.3%
Claude Opus 4.6767050464457.3%
WizardLM 2 8x22b976953392757.0%
Gemma 3 12B877353413156.9%
GPT-5.4 Mini777055414156.6%
GPT-5.5 (Reasoning, Low)635958574456.2%
Z.AI GLM 51007867201355.5%
Llama 3.1 8B1001005618054.8%
GPT-5.4 Mini (Reasoning)756455463454.6%
Claude Haiku 4.51006143363354.4%
GPT-5.5685453524454.2%
Hermes 3 405B10010045141154.0%
MoonshotAI: Kimi K2.5746054482953.2%
GPT-5.4 Mini (Reasoning, Low)696356463053.1%
Aion 2.0645955433350.8%
Z.AI GLM 5 Turbo786747431650.2%
Xiaomi MIMO v2.5 Pro585549463849.2%
Gemini 2.5 Pro636161451148.1%
Gemini 2.5 Flash704842413046.3%
Gemma 3 27B595655431846.2%
GPT-4o Mini (temp=1)904542381546.0%
Arcee AI: Trinity Large (Preview)685047392245.0%
Gemini 2.5 Flash Lite524943423744.7%
Xiaomi MIMO v2.572534744744.7%
Z.AI GLM 5.169674741044.6%
Ministral 8B796046231544.5%
DeepSeek V4 Pro (Reasoning)734843391844.2%
Gemini 2.5 Flash (Reasoning)81773714943.7%
DeepSeek V4 Pro794537312443.1%
Ministral 3 14B755043281843.1%
Claude Sonnet 4685449291142.1%
Stealth: Hunter Alpha883433312041.4%
DeepSeek V4 Flash (Reasoning)885126221941.3%
GPT-5.5 (Reasoning)565437322440.6%
Qwen 3.6 35B734438242340.4%
DeepSeek V3 (2025-03-24)94562626040.3%
GPT-5.1544938362540.2%
DeepSeek V3.2545243391240.1%
Grok 4.3634542361239.7%
Gemini 2.5 Flash Lite (Reasoning)773832302039.5%
LFM2 24B86612221037.9%
Grok 4.20 (Reasoning)534643242337.7%
Mistral Small 4 (Reasoning)555544171537.3%
DeepSeek V4 Flash10042327036.2%
GPT-4o, Aug. 6th (temp=1)75582918036.2%
Gemini 3.5 Flash (Reasoning, Minimal)494635252536.0%
Grok 4.20 (Beta)644539211035.9%
Mistral Small Creative534643171334.6%
ByteDance Seed 1.6 Flash79342923834.6%
Mistral Small 3.2 24B81352928034.6%
GPT-4.162533027034.4%
Stealth: Healer Alpha80403814034.4%
Gemini 3 Flash (Preview, Reasoning)483935281633.2%
Ministral 3B6560370032.4%
Z.AI GLM 4.6494128232232.4%
Qwen 3.6 Flash77332822232.2%
Mistral Large 2393732291831.2%
Claude 3 Haiku732621211430.9%
Grok 4.20543631161630.6%
Gemma 4 31B (Reasoning)353131282730.4%
MoonshotAI: Kimi K2.6582828271030.0%
Llama 3.1 70B10031107029.6%
Z.AI GLM 4.5 Air494421191529.5%
Mistral Medium 3.1464231151329.2%
GPT-5.242413716528.2%
MiniMax M2.573262516028.0%
Grok 4 Fast433131211328.0%
Gemini 3 Pro (Preview)49362720727.6%
Claude 3.7 Sonnet50402820027.6%
Gemini 3 Flash (Preview)38373624027.0%
Mistral Large 3463120191726.4%
GPT-4o, May 13th (temp=1)602616141225.7%
Mistral Large48312720125.4%
Grok 4.1 Fast48272617624.8%
Claude 3.5 Sonnet7023148724.1%
ByteDance Seed 2.0 Lite51312316024.1%
DeepSeek V3.133322520924.0%
Gemma 4 31B4943250023.5%
DeepSeek-V2 Chat542218111022.9%
GPT-5.4 Nano432818141122.9%
Ministral 3 8B6039122022.6%
Gemini 3.5 Flash (Reasoning)36272721022.2%
Ministral 3 3B45411310021.8%
Qwen 3.5 397B A17B4031288021.4%
Qwen 3.6 27B6324164021.2%
GPT-5.4 Nano (Reasoning, Low)27272420420.4%
GPT-5.4 Nano (Reasoning)37351411520.3%
Qwen3.6 Max Preview6021115019.5%
Z.AI GLM 4.728272314218.9%
Qwen3.7 Max4227162017.4%
ByteDance Seed 2.0 Mini552580017.4%
o4 Mini High462597017.4%
DeepSeek V3 (2024-12-26)3825213017.4%
Mistral Small 434171512917.3%
GPT-54018109716.9%
GPT-4.1 Mini4123180016.2%
Qwen 3.5 Plus (2026-04-20)3925105015.8%
Grok 43028106215.3%
Gemini 3.1 Flash Lite (Preview)3717135315.1%
GPT-4o, Aug. 6th (temp=0)2823174014.3%
Mistral NeMO322780013.5%
Gemma 4 26B401780013.0%
GPT-4.1 Nano2520162012.3%
Arcee AI: Trinity Mini431700012.0%
Qwen 3.5 9B391800011.4%
Z.AI GLM 4.7 Flash2515140010.9%
Qwen 2.5 72B241953210.7%
Gemini 3.1 Pro (Preview)2316113010.5%
Grok 4.20 (Beta, Reasoning)231310209.7%
Gemma 4 26B (Reasoning)201514009.6%
Qwen 3.5 Plus (2026-02-15)21147309.0%
Z.AI GLM 4.523135409.0%
Qwen 3.5 Flash4500009.0%
Grok 4.3 (Reasoning)26145008.9%
GPT-4o, May 13th (temp=0)24165008.8%
Gemini 3.1 Flash Lite23117208.7%
GPT-5 Mini2385007.2%
Qwen 3.5 35B3120006.5%
GPT-5 Nano1796006.3%
Qwen 3 32B1884106.3%
o4 Mini11106216.3%
GPT-4o Mini (temp=0)2731006.1%
Nemotron 3 Super17130006.0%
Nemotron 3 Nano1570004.4%
Qwen 3.5 27B1900003.8%
Gemini 3.1 Flash Lite (Reasoning)1260003.6%
ByteDance Seed 1.61400002.8%
Stealth: Aurora Alpha500001.0%
Inception Mercury 2400000.9%
Qwen 3.5 122B000000.0%
GPT-OSS 120B000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 27B1001001001008597.0%
MiniMax M2.7100100100949297.0%
Claude Sonnet 4.5100100100997695.1%
Claude Opus 4.6 (Reasoning)1001001001006993.9%
Claude Haiku 4.510010099947793.9%
GPT-4o Mini (temp=1)10010095957693.0%
Claude Opus 4.7 (Reasoning)1009393888291.2%
Claude Opus 4100100100885789.1%
Rocinante 12B100100100856088.9%
Z.AI GLM 5.11009895787088.2%
DeepSeek V4 Pro100100100835788.0%
Claude Sonnet 4.6999089847687.5%
GPT-5.4 (Reasoning)979382807986.2%
Mistral Small 4 (Reasoning)1009889786285.3%
DeepSeek V4 Flash10010095834684.9%
Claude Opus 4.61001001001002184.3%
Z.AI GLM 5 Turbo979389766684.2%
GPT-4o, Aug. 6th (temp=1)10010095685784.1%
GPT-5.41008681777483.8%
GPT-5.4 (Reasoning, Low)948584777482.8%
Claude Opus 4.71008380786881.8%
Xiaomi MIMO v2.5 Pro1009081746481.8%
Claude 3 Haiku10010090852980.7%
Claude Sonnet 410010088595480.0%
Grok 41009690654679.5%
Claude Sonnet 4.6 (Reasoning)10010074595878.0%
Stealth: Hunter Alpha908483745677.5%
DeepSeek V3 (2025-03-24)10010090603777.4%
Llama 3.1 8B1001009977876.8%
GPT-5.5918379735876.6%
Z.AI GLM 51008574625675.4%
WizardLM 2 8x22b958969635974.7%
DeepSeek V4 Flash (Reasoning)10010074583873.9%
Claude Opus 4.510010082444273.7%
DeepSeek V4 Pro (Reasoning)10010083562873.4%
Mistral Large 21008973564873.1%
Gemma 3 4B998181515072.4%
GPT-5.5 (Reasoning)757473736071.0%
Z.AI GLM 4.5949481553171.0%
DeepSeek V3.2907266665870.6%
Mistral Small Creative998579523770.3%
MiniMax M2.5927368665170.1%
GPT-5.4 Mini (Reasoning, Low)857367645969.8%
Arcee AI: Trinity Large (Preview)1009177481666.4%
Ministral 3 14B966866504665.3%
GPT-5.1907171524165.2%
Z.AI GLM 4.5 Air997867482964.2%
Mistral Large 310010048452663.8%
Mistral Large937169532762.7%
Aion 2.0816762604362.7%
Ministral 8B827465603162.4%
GPT-5.5 (Reasoning, Low)806458514960.6%
Gemini 2.5 Flash Lite827460414159.6%
Cohere Command R+ (Aug. 2024)100827243059.5%
Mistral Medium 3.1876057484459.1%
Xiaomi MIMO v2.5786663612758.8%
Hermes 3 70B100926238158.5%
Gemma 3 12B716763593258.3%
GPT-5.4 Mini766051484856.8%
Grok 4.1 Fast914947474455.6%
GPT-4.1865951383754.1%
Gemini 2.5 Pro676060493454.0%
Gemini 3 Pro (Preview)875751482453.5%
MoonshotAI: Kimi K2.5776159402853.0%
Llama 3.1 70B776560491353.0%
GPT-5.4 Mini (Reasoning)805350483352.7%
Claude 3.7 Sonnet676454473252.6%
DeepSeek V3.1928039312152.6%
Gemini 2.5 Flash (Reasoning)81805843052.4%
Gemini 2.5 Flash1004843363351.8%
DeepSeek V3 (2024-12-26)1007247191851.2%
ByteDance Seed 1.6 Flash635552513551.1%
LFM2 24B79715343650.7%
GPT-4.1 Nano725650393550.4%
GPT-4o, May 13th (temp=1)685953363249.6%
Grok 4.20 (Beta)554848464548.4%
Grok 4 Fast71696529648.2%
Ministral 3 8B754848373448.2%
GPT-4.1 Mini665958322648.1%
Claude 3.5 Sonnet100484741147.4%
Mistral NeMO100593935347.2%
GPT-5.4 Nano (Reasoning, Low)695746322946.5%
Mistral Small 468645640045.6%
GPT-5.4 Nano664947441945.2%
Gemini 2.5 Flash Lite (Reasoning)85803524044.9%
Hermes 3 405B93843311044.2%
Stealth: Healer Alpha615347411743.7%
Grok 4.20 (Reasoning)545046392442.6%
Llama 3.1 Nemotron 70B100532917641.1%
Grok 4.20464642383140.6%
Gemini 3.5 Flash (Reasoning)444442393039.9%
DeepSeek-V2 Chat100641913439.8%
Z.AI GLM 4.6514836322939.2%
ByteDance Seed 2.0 Mini554340292738.9%
Nemotron 3 Super73504616938.8%
Z.AI GLM 4.7743232282838.7%
Qwen 3.6 27B64604721038.2%
MoonshotAI: Kimi K2.6555443201537.6%
Qwen 3 32B52494539137.2%
GPT-5554636331537.0%
GPT-5.2574336321736.9%
Arcee AI: Trinity Mini10041236334.5%
Gemini 3.5 Flash (Reasoning, Minimal)54532727533.3%
Qwen 3.5 Plus (2026-04-20)55483322833.2%
Ministral 3 3B694025161432.6%
Ministral 3B54383624932.0%
GPT-5.4 Nano (Reasoning)51333232931.4%
Qwen 3.5 Plus (2026-02-15)582725251329.7%
Grok 4.20 (Beta, Reasoning)553818151227.4%
Gemini 3.1 Pro (Preview)423025201927.4%
Gemini 3 Flash (Preview)39322823926.3%
o4 Mini403423161625.8%
GPT-4o Mini (temp=0)373028181525.4%
Z.AI GLM 4.7 Flash333022201724.4%
Gemma 4 31B353425121223.8%
o4 Mini High392925151123.7%
Qwen3.6 Max Preview822933023.6%
Qwen 3.6 35B46312217023.2%
Grok 4.3 (Reasoning)452522131123.1%
Grok 4.35538137022.7%
Gemma 4 31B (Reasoning)4330258722.6%
Qwen 3.6 Flash5034234022.2%
ByteDance Seed 1.6252323151219.3%
ByteDance Seed 2.0 Lite64121010019.3%
Gemma 4 26B23191916015.4%
Gemini 3 Flash (Preview, Reasoning)353370015.1%
Gemma 4 26B (Reasoning)462350014.8%
GPT-5 Mini30121010012.6%
GPT-4o, May 13th (temp=0)54520012.4%
GPT-4o, Aug. 6th (temp=0)351642011.4%
Qwen 2.5 72B36966011.2%
GPT-5 Nano2317122111.1%
Qwen 3.5 397B A17B161586610.1%
Gemini 3.1 Flash Lite (Reasoning)2640006.0%
Gemini 3.1 Flash Lite (Preview)1662004.9%
Mistral Small 3.2 24B1311003.1%
Qwen3.7 Max1600003.1%
Qwen 3.5 35B1500003.0%
Qwen 3.5 122B1310002.9%
Qwen 3.5 Flash900001.8%
Nemotron 3 Nano900001.8%
Gemini 3.1 Flash Lite000000.1%
Qwen 3.5 27B000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X51001001001007895.7%
Qwen3 235B A22B Instruct 250710010099826789.7%
Llama 3.1 8B100100100834084.5%
Claude Opus 4100100100625082.5%
Rocinante 12B10010010052270.8%
Mistral Small 4959368601366.1%
Mistral Small 4 (Reasoning)79776857958.1%
Claude Opus 4.7907070291655.0%
GPT-5.4686762443454.8%
Claude Opus 4.5836643433153.5%
Claude Haiku 4.51005346353453.4%
Claude Sonnet 41005647372553.0%
Mistral Large 2856747342852.3%
Ministral 3 8B1009129231952.3%
Claude Sonnet 4.5757261291750.9%
Claude Sonnet 4.6 (Reasoning)655646454150.7%
DeepSeek V4 Pro907631252549.6%
Hermes 3 405B88664539047.5%
Hermes 3 70B74694742046.4%
Mistral Medium 3.168615545045.8%
MiniMax M2.571565045645.8%
Mistral Large 31006224191744.6%
Ministral 3 14B85584331544.2%
Llama 3.1 70B824841241742.4%
Claude Sonnet 4.674585512040.0%
GPT-5.4 (Reasoning, Low)595044281739.8%
Claude Opus 4.6574848242339.8%
Mistral Large75454227438.6%
Z.AI GLM 5544844271537.6%
DeepSeek V4 Pro (Reasoning)504038342337.1%
Llama 3.1 Nemotron 70B69434131036.9%
Grok 4.1 Fast71484811035.8%
MoonshotAI: Kimi K2.5544128272635.0%
DeepSeek V4 Flash8161271033.9%
Qwen3.6 Max Preview645520171233.7%
GPT-5975483032.4%
ByteDance Seed 1.6 Flash553130232332.3%
Qwen 3.5 397B A17B50453429031.6%
GPT-5.5 (Reasoning, Low)563127261431.0%
Ministral 8B46453320930.8%
Arcee AI: Trinity Large (Preview)886050030.6%
Z.AI GLM 5 Turbo413733221930.3%
Gemma 3 12B61352920029.2%
GPT-5.5363434251328.6%
Claude Opus 4.7 (Reasoning)55363412027.4%
DeepSeek V3 (2025-03-24)5549310027.1%
Gemma 3 27B493725121127.0%
MiniMax M2.740342824726.9%
GPT-4.16041270025.5%
GPT-4o, Aug. 6th (temp=1)42392717025.0%
Qwen 3 32B4536356024.5%
Z.AI GLM 5.15935250023.7%
WizardLM 2 8x22b4339313023.3%
Grok 4 Fast5331186522.8%
Grok 4.20 (Beta)47351610021.5%
GPT-5.4 Nano (Reasoning, Low)39241815720.7%
GPT-4o Mini (temp=1)4323237620.6%
GPT-5.4 (Reasoning)3933310020.6%
GPT-5.5 (Reasoning)41281614020.1%
Gemini 2.5 Flash (Reasoning)100000020.0%
Xiaomi MIMO v2.5 Pro50181715019.9%
Grok 4.204230251019.7%
Qwen 3.5 35B4440140019.6%
Claude 3 Haiku7114130019.6%
Z.AI GLM 4.5 Air524600019.6%
ByteDance Seed 2.0 Lite3930263019.5%
Claude Opus 4.6 (Reasoning)474170019.0%
Mistral Small Creative4437112019.0%
DeepSeek V3 (2024-12-26)6020100018.1%
GPT-5.14525143017.2%
Claude 3.5 Sonnet4816119016.8%
LFM2 24B4916135016.8%
Gemma 3 4B25231414516.2%
Mistral NeMO3918160014.7%
GPT-4.1 Nano3022192014.5%
GPT-4o, May 13th (temp=1)531900014.3%
DeepSeek-V2 Chat3814117013.8%
GPT-5.4 Nano2318129513.3%
GPT-5.4 Mini2620127013.1%
DeepSeek V4 Flash (Reasoning)59500012.9%
GPT-5.4 Mini (Reasoning)431800012.2%
Claude 3.7 Sonnet342600012.0%
Grok 4.20 (Reasoning)341661011.5%
Ministral 3B2518130011.1%
Qwen 3.6 35B51400011.0%
MoonshotAI: Kimi K2.6322100010.6%
Gemini 2.5 Flash Lite311154010.3%
GPT-4o, May 13th (temp=0)311630010.2%
GPT-5 Nano51000010.2%
Cohere Command R+ (Aug. 2024)411000010.2%
Qwen 3.6 Flash381200010.1%
GPT-5.4 Nano (Reasoning)2713100010.0%
GPT-4.1 Mini28145009.6%
Mistral Small 3.2 24B4150009.2%
GPT-5.4 Mini (Reasoning, Low)23166009.1%
DeepSeek V3.223106007.9%
Z.AI GLM 4.62396007.7%
Xiaomi MIMO v2.52791007.3%
Z.AI GLM 4.53600007.3%
Ministral 3 3B3410007.1%
ByteDance Seed 1.63130007.0%
Aion 2.019131006.6%
Qwen 3.5 27B19140006.4%
Qwen 3.5 Plus (2026-02-15)17150006.4%
Gemini 2.5 Flash3100006.3%
Nemotron 3 Super3100006.1%
o4 Mini15115006.1%
Stealth: Healer Alpha1594105.8%
Grok 42600005.1%
o4 Mini High1294005.0%
Qwen 3.5 Flash2230004.9%
Gemini 3 Pro (Preview)13110004.7%
GPT-5.22300004.5%
Qwen 3.5 Plus (2026-04-20)1530003.6%
Qwen 2.5 72B1700003.4%
Arcee AI: Trinity Mini1160003.4%
GPT-4o, Aug. 6th (temp=0)1600003.1%
Gemini 2.5 Flash Lite (Reasoning)753002.9%
Gemini 3.1 Pro (Preview)1300002.6%
GPT-4o Mini (temp=0)920002.2%
Gemini 2.5 Pro1000002.1%
Z.AI GLM 4.7 Flash1000002.1%
Grok 4.20 (Beta, Reasoning)710001.7%
GPT-5 Mini711001.6%
Gemini 3.5 Flash (Reasoning)800001.6%
Gemini 3.1 Flash Lite800001.5%
Qwen 3.6 27B700001.5%
Gemini 3 Flash (Preview, Reasoning)700001.4%
Gemma 4 26B (Reasoning)310000.7%
Grok 4.3300000.6%
Gemini 3.1 Flash Lite (Preview)200000.5%
ByteDance Seed 2.0 Mini110000.3%
Gemma 4 31B (Reasoning)100000.2%
Qwen 3.5 9B100000.2%
Stealth: Hunter Alpha100000.2%
Nemotron 3 Nano000000.1%
Qwen3.7 Max000000.0%
Grok 4.3 (Reasoning)000000.0%
Qwen 3.5 122B000000.0%
Z.AI GLM 4.7000000.0%
Gemma 4 31B000000.0%
Gemini 3.5 Flash (Reasoning, Minimal)000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Gemma 4 26B000000.0%
Gemini 3 Flash (Preview)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V3.1000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X5100100100100100100.0%
Qwen3 235B A22B Instruct 25071001001001008396.7%
GPT-5.4 (Reasoning, Low)100100100927393.0%
Claude Opus 4.6 (Reasoning)10010090864684.6%
Claude Sonnet 4.510010094804583.7%
Xiaomi MIMO v2.5 Pro10010094604179.0%
Stealth: Hunter Alpha1009379664576.8%
Claude Sonnet 4.61009877604676.2%
GPT-5.11009668675076.1%
Hermes 3 405B1009672585476.0%
Claude Sonnet 41007772625773.7%
Mistral Small 4 (Reasoning)1009565575273.7%
GPT-5.41009465634573.4%
Rocinante 12B10010088463173.0%
Grok 4.1 Fast958474535171.4%
Claude Sonnet 4.6 (Reasoning)959074484871.1%
Claude Opus 4.610010059514571.0%
Claude Opus 4.7 (Reasoning)968987443670.5%
DeepSeek V4 Flash1007372654270.5%
Z.AI GLM 51007265645270.5%
GPT-5.4 (Reasoning)977271634469.4%
Claude Opus 4938273613568.8%
DeepSeek V4 Flash (Reasoning)1008758534568.6%
Claude Opus 4.71008776631367.9%
MoonshotAI: Kimi K2.51007065633967.4%
Mistral Small Creative1007167573466.0%
GPT-5.4 Mini827362625166.0%
Llama 3.1 8B10010060501665.0%
Mistral Small 41006657574164.2%
MiniMax M2.7817979453363.4%
GPT-5.5917355554062.8%
Mistral Large92756663760.7%
Z.AI GLM 5 Turbo796663603059.8%
Ministral 3 14B95776354959.6%
Hermes 3 70B1001006530059.0%
GPT-5.5 (Reasoning, Low)676664544458.9%
Claude Opus 4.5909045392958.5%
Gemma 3 27B1007157442058.4%
MiniMax M2.5818167312457.1%
Xiaomi MIMO v2.51009238302456.6%
Qwen3.6 Max Preview776055453754.8%
GPT-5.4 Mini (Reasoning)595959514554.6%
Arcee AI: Trinity Large (Preview)876256422454.2%
DeepSeek V4 Pro806548433053.5%
Z.AI GLM 5.1666351503352.6%
GPT-5.4 Mini (Reasoning, Low)746252412951.5%
Stealth: Healer Alpha676665362351.4%
DeepSeek V4 Pro (Reasoning)815748462050.4%
Qwen 3 32B94604939048.3%
Qwen 3.6 Flash83785921048.1%
DeepSeek V3.1854739363247.9%
Mistral Medium 3.1726959251347.6%
GPT-5846935331246.4%
GPT-5.5 (Reasoning)655251501246.3%
Claude 3.7 Sonnet814140383146.2%
WizardLM 2 8x22b86784025046.0%
Claude 3.5 Sonnet100574519745.7%
Gemma 3 4B605344432545.0%
Mistral Large 279625330044.8%
Claude Haiku 4.564595148144.5%
Qwen 3.6 35B635251371744.2%
Grok 4665936312843.9%
GPT-4.1 Nano67574943043.1%
DeepSeek V3 (2025-03-24)1004336211342.5%
Gemma 3 12B88504828042.5%
ByteDance Seed 1.6 Flash1004823212042.3%
Grok 4.20 (Beta)585234333342.0%
Gemini 2.5 Flash Lite785143191741.6%
Qwen 3.5 397B A17B73675711041.6%
Gemini 2.5 Flash (Reasoning)68664724341.5%
Grok 4.208771318740.8%
Aion 2.0665043251540.0%
GPT-5.4 Nano594138322338.7%
Gemini 2.5 Flash Lite (Reasoning)674137301738.4%
GPT-4o, Aug. 6th (temp=1)544238381437.1%
Mistral Small 3.2 24B8455440036.6%
Mistral Large 3843428201636.5%
Qwen 3.5 Plus (2026-04-20)653828231834.2%
Gemma 4 31B724723161133.6%
Gemini 3.5 Flash (Reasoning, Minimal)75581816033.3%
Gemini 2.5 Flash45453734032.1%
GPT-5.4 Nano (Reasoning, Low)585318161331.8%
Ministral 3 8B593228251131.1%
GPT-4o Mini (temp=1)57503712031.1%
MoonshotAI: Kimi K2.6553130231530.9%
Gemini 3.5 Flash (Reasoning)664120151030.4%
Ministral 8B8742145029.8%
Grok 4.20 (Reasoning)60442311428.6%
LFM2 24B623024141228.3%
GPT-4o, May 13th (temp=1)69342215027.9%
Gemini 3 Pro (Preview)59402512127.3%
o4 Mini High7922188025.6%
DeepSeek V3.240383415125.6%
Cohere Command R+ (Aug. 2024)635344024.9%
Mistral NeMO423022191124.9%
GPT-4.14139380023.6%
Grok 4 Fast48332311123.2%
Claude 3 Haiku7711109322.0%
Gemini 3.1 Flash Lite (Preview)37262216921.9%
Grok 4.3 (Reasoning)50202019021.9%
Arcee AI: Trinity Mini752390021.2%
Grok 4.342321614120.9%
Z.AI GLM 4.76128122020.7%
Z.AI GLM 4.65232115220.5%
Ministral 3B5226169020.4%
Gemini 3 Flash (Preview)37311813019.9%
Ministral 3 3B38241612819.5%
Gemini 2.5 Pro51231111219.4%
GPT-4o, May 13th (temp=0)4429194019.1%
o4 Mini3530217018.6%
GPT-5.4 Nano (Reasoning)443841017.5%
Gemini 3 Flash (Preview, Reasoning)3823188017.4%
Llama 3.1 70B3129250017.1%
Z.AI GLM 4.7 Flash3932103017.0%
Z.AI GLM 4.5472592016.5%
GPT-5.22826199016.3%
DeepSeek V3 (2024-12-26)4025100015.0%
Qwen 3.5 Plus (2026-02-15)2726201014.8%
Grok 4.20 (Beta, Reasoning)3421108014.7%
Qwen 3.6 27B65500013.9%
Z.AI GLM 4.5 Air2726140013.4%
GPT-4o, Aug. 6th (temp=0)3521100013.3%
Gemma 4 26B25141211012.2%
Qwen 2.5 72B47840011.9%
Gemini 3.1 Pro (Preview)2317153111.7%
Gemini 3.1 Flash Lite (Reasoning)272054011.3%
Llama 3.1 Nemotron 70B46500010.3%
GPT-5 Mini22127008.1%
ByteDance Seed 1.620180007.5%
Qwen 3.5 35B2800005.5%
Gemini 3.1 Flash Lite1852005.0%
Gemma 4 31B (Reasoning)13100004.6%
Nemotron 3 Super2200004.4%
DeepSeek-V2 Chat984004.1%
Qwen 3.5 122B1600003.3%
Gemma 4 26B (Reasoning)1150003.2%
ByteDance Seed 2.0 Mini1200002.4%
Qwen 3.5 27B1000002.0%
GPT-4.1 Mini900001.7%
Qwen3.7 Max210000.7%
GPT-4o Mini (temp=0)300000.6%
GPT-OSS 120B300000.6%
ByteDance Seed 2.0 Lite300000.6%
Qwen 3.5 Flash000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5 Nano000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100786187.8%
Writer: Palmyra X51009682764379.4%
Llama 3.1 8B10010066605776.7%
Hermes 3 70B1009393772076.6%
Cohere Command R+ (Aug. 2024)1001008074070.8%
Claude Opus 4.71008381761170.1%
Claude 3 Haiku838171673868.1%
Mistral Medium 3.1868474513766.4%
Claude Opus 4987854474364.0%
GPT-5.4 (Reasoning, Low)746861585362.7%
Claude Opus 4.5868660433562.1%
Claude Opus 4.6 (Reasoning)756761574661.2%
Claude Sonnet 41007261422059.0%
Qwen3 235B A22B Instruct 2507806161602357.1%
DeepSeek V4 Flash (Reasoning)1005347472153.7%
DeepSeek V4 Flash96665749053.7%
GPT-5.4806138373449.9%
Claude Opus 4.6826248351949.4%
Grok 4.1 Fast886932292749.0%
Mistral NeMO75665145648.7%
MiniMax M2.5605149373446.1%
Z.AI GLM 5.1100623521945.5%
GPT-5.5674836323042.6%
Mistral Small Creative100584210142.3%
Z.AI GLM 5814332322142.0%
Ministral 3 8B81694510041.0%
Grok 4 Fast615339331940.9%
Mistral Small 4 (Reasoning)585837292340.8%
Mistral Small 3.2 24B100353430140.0%
Qwen3.6 Max Preview82493730039.5%
Llama 3.1 70B100432817137.7%
Grok 4.20514036332937.7%
DeepSeek V4 Pro504340272637.4%
Xiaomi MIMO v2.5 Pro614131302337.3%
Claude Sonnet 4.6 (Reasoning)58564329037.2%
Claude Sonnet 4.566483633036.6%
GPT-4o, Aug. 6th (temp=1)60514131036.6%
Mistral Large8176164035.5%
Grok 4.20 (Beta)57433838035.3%
Qwen 3.5 397B A17B544336231333.8%
Claude Haiku 4.5574137191333.5%
Mistral Large 360523815033.0%
Ministral 3 14B71453118032.9%
Qwen 3.5 Plus (2026-02-15)8948260032.5%
Qwen 3.6 Flash554623231632.5%
MiniMax M2.774463211032.4%
GPT-4o, May 13th (temp=0)474629271432.4%
Ministral 8B63473119032.0%
Mistral Small 47155290031.1%
Claude Sonnet 4.610035119031.0%
Gemma 3 27B55432320930.0%
GPT-5.4 Mini413634261129.7%
Gemini 2.5 Flash Lite49402722929.5%
Hermes 3 405B54502813029.0%
GPT-5.5 (Reasoning, Low)373431261629.0%
Stealth: Hunter Alpha5045389128.6%
Z.AI GLM 5 Turbo6357105528.1%
Z.AI GLM 4.5 Air76361512027.9%
Gemma 3 12B363431211627.7%
GPT-4.1 Nano48332925027.0%
o4 Mini712119121126.7%
Stealth: Healer Alpha8527174026.6%
MoonshotAI: Kimi K2.54643329026.1%
Arcee AI: Trinity Large (Preview)46413112025.9%
GPT-4o, May 13th (temp=1)423123171325.2%
Grok 4.20 (Reasoning)5845139025.0%
Gemma 3 4B61282311024.5%
ByteDance Seed 1.6 Flash48292520024.4%
Llama 3.1 Nemotron 70B56251816624.1%
Ministral 3 3B932620024.1%
DeepSeek V3.24541330023.9%
LFM2 24B54302214023.9%
GPT-5.4 (Reasoning)502916141023.7%
ByteDance Seed 2.0 Mini6523179022.6%
GPT-5.4 Nano (Reasoning, Low)4826199922.3%
Gemini 2.5 Flash6030160021.2%
Gemini 2.5 Flash Lite (Reasoning)4637176021.1%
Qwen 3 32B90640020.1%
WizardLM 2 8x22b40231918019.8%
Claude Opus 4.7 (Reasoning)622791019.7%
GPT-5 Nano741184019.3%
Grok 434321212519.1%
MoonshotAI: Kimi K2.637211914318.9%
Qwen 3.5 Plus (2026-04-20)552288018.5%
Qwen 3.6 35B4830150018.5%
DeepSeek V3 (2025-03-24)38271513018.3%
GPT-5.14233115018.2%
Mistral Large 235241614017.9%
Arcee AI: Trinity Mini4518176017.1%
Claude 3.7 Sonnet562061016.7%
Claude 3.5 Sonnet3731150016.6%
GPT-4o, Aug. 6th (temp=0)37171414016.3%
GPT-5.5 (Reasoning)27241614016.2%
Xiaomi MIMO v2.53223149115.7%
GPT-5.4 Nano24201713415.5%
Qwen 3.5 Flash471594015.0%
GPT-5.4 Mini (Reasoning, Low)2823116013.6%
Z.AI GLM 4.6402030012.6%
Ministral 3B51730012.2%
ByteDance Seed 1.6303000012.0%
GPT-4.1281765011.2%
Grok 4.20 (Beta, Reasoning)391430011.1%
GPT-5.4 Mini (Reasoning)2120130010.9%
Qwen 3.6 27B292030010.3%
DeepSeek V4 Pro (Reasoning)301660010.3%
Qwen 3.5 27B252500010.0%
Gemini 2.5 Flash (Reasoning)4440009.6%
Aion 2.031140009.0%
GPT-5.4 Nano (Reasoning)20126428.8%
o4 Mini High4210008.6%
Gemini 3 Flash (Preview, Reasoning)3254008.3%
GPT-5 Mini27140008.1%
Gemini 3.5 Flash (Reasoning, Minimal)3080007.7%
Gemini 3 Pro (Preview)15128107.2%
GPT-4o Mini (temp=1)2560006.2%
Z.AI GLM 4.519120006.1%
DeepSeek V3 (2024-12-26)16101005.4%
Gemini 3.1 Pro (Preview)2430005.3%
Qwen 3.5 35B2700005.3%
Grok 4.31474004.9%
Gemini 2.5 Pro2020004.5%
GPT-51074004.4%
Gemini 3.5 Flash (Reasoning)1090003.8%
GPT-4.1 Mini955003.7%
Z.AI GLM 4.7 Flash1331003.5%
Nemotron 3 Super1700003.4%
Gemini 3 Flash (Preview)1400002.7%
Gemini 3.1 Flash Lite1120002.6%
Z.AI GLM 4.7750002.6%
Qwen 2.5 72B533002.5%
DeepSeek-V2 Chat1100002.1%
DeepSeek V3.1630001.8%
Gemini 3.1 Flash Lite (Preview)430001.6%
ByteDance Seed 2.0 Lite500001.1%
Qwen 3.5 122B500001.1%
GPT-4o Mini (temp=0)500000.9%
Gemma 4 26B (Reasoning)400000.8%
GPT-5.2300000.6%
Gemma 4 31B200000.5%
Grok 4.3 (Reasoning)200000.5%
Qwen3.7 Max000000.0%
Gemma 4 31B (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Qwen 3.5 9B000000.0%
Gemma 4 26B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 25071001001001009498.7%
Writer: Palmyra X51001001001008196.2%
Rocinante 12B100100100938295.0%
Z.AI GLM 5100100100844986.6%
Llama 3.1 8B100100100784885.2%
Hermes 3 405B10010093913684.1%
DeepSeek V4 Flash (Reasoning)999796834283.5%
Claude Sonnet 4.6 (Reasoning)10010083765582.7%
Claude Opus 4.7 (Reasoning)10010094853282.3%
Claude Opus 4.7100100100624080.5%
DeepSeek V4 Flash1008774716278.7%
Claude Haiku 4.51009084694377.2%
Claude Opus 4.5928475646475.9%
Hermes 3 70B10010085741875.3%
Claude Sonnet 4.61009078564674.2%
Claude Sonnet 4.51009465574071.4%
GPT-5.4 (Reasoning, Low)897770635069.6%
Claude Sonnet 4978264544769.0%
Xiaomi MIMO v2.5 Pro907268645269.0%
Mistral Small 410010078432368.7%
Z.AI GLM 5.11008863523768.2%
MiniMax M2.71009349494867.7%
GPT-5.4847660595767.2%
Mistral Medium 3.1918867483666.2%
Claude Opus 4.6 (Reasoning)838357534964.9%
DeepSeek V4 Pro1009897161364.8%
Cohere Command R+ (Aug. 2024)1007573512364.6%
GPT-4o, Aug. 6th (temp=1)1008355424264.5%
Z.AI GLM 5 Turbo797069594464.3%
Claude Opus 4.6827672533764.0%
Arcee AI: Trinity Large (Preview)858570572063.5%
WizardLM 2 8x22b866960504962.8%
DeepSeek V4 Pro (Reasoning)776865535162.5%
Qwen 3.5 397B A17B1006755454261.9%
Grok 4.1 Fast747061594561.6%
GPT-4o Mini (temp=1)967053483860.9%
Mistral Large 3726765642859.2%
Gemma 3 27B886052444056.8%
Z.AI GLM 4.5 Air1007054471156.2%
GPT-4.1876561412756.2%
Mistral Small 4 (Reasoning)98755748256.1%
Claude 3 Haiku1001004529054.8%
MiniMax M2.51007143302754.1%
GPT-5.1785951483454.1%
Gemini 2.5 Flash Lite706853502753.9%
GPT-4.1 Nano767368272653.7%
Mistral Large99866120053.4%
Qwen 3.5 Plus (2026-04-20)757563282553.3%
Gemini 2.5 Flash Lite (Reasoning)635353534453.2%
Claude Opus 4796850382652.3%
GPT-4o, May 13th (temp=1)755857462351.7%
DeepSeek V3.2695752413450.5%
Mistral Large 2905331312946.8%
MoonshotAI: Kimi K2.5625440383846.2%
Grok 4 Fast605651312645.0%
GPT-5.5 (Reasoning, Low)694540393044.8%
Mistral Small Creative816536281344.6%
DeepSeek V3 (2025-03-24)696743271644.4%
Qwen 3 32B75524742043.3%
GPT-5.4 (Reasoning)544940373442.8%
Z.AI GLM 4.6645235342642.3%
Gemini 2.5 Flash675343311541.9%
Grok 4675532312441.7%
Z.AI GLM 4.5604747282641.6%
Stealth: Healer Alpha69654528041.4%
MoonshotAI: Kimi K2.675503835841.2%
Grok 4.20625243331741.2%
Llama 3.1 70B604340382441.1%
GPT-5.5 (Reasoning)484242383440.6%
Ministral 3B754831271940.2%
o4 Mini High605242281539.5%
Qwen3.6 Max Preview735239201139.1%
GPT-5.552525036438.9%
Aion 2.0663635312638.9%
GPT-5575044261638.6%
Ministral 3 14B912927252038.4%
GPT-4.1 Mini605232301738.1%
Ministral 3 3B9749319337.9%
Claude 3.5 Sonnet6865560037.7%
GPT-5.4 Mini (Reasoning, Low)704825232337.7%
Stealth: Hunter Alpha584533262336.9%
Qwen 3.6 Flash72452927836.1%
Gemini 2.5 Pro56554911234.5%
Xiaomi MIMO v2.5653429271634.3%
GPT-5.4 Nano564335211634.2%
Gemma 3 4B753431161534.1%
Llama 3.1 Nemotron 70B574837141333.8%
Z.AI GLM 4.7 Flash575423191333.2%
Grok 4.20 (Beta)523634281533.1%
Ministral 3 8B8166130032.2%
o4 Mini63413117731.7%
Claude 3.7 Sonnet68373214831.6%
LFM2 24B403834272031.6%
DeepSeek V3.1474126231630.7%
Qwen 3.6 35B6150318030.0%
Mistral Small 3.2 24B60452222029.7%
Grok 4.20 (Beta, Reasoning)483523212029.2%
ByteDance Seed 1.6 Flash45413513427.5%
Gemma 3 12B44343423227.4%
Qwen 3.5 Plus (2026-02-15)43363213125.0%
Grok 4.3863530024.9%
Ministral 8B5337311024.6%
Gemini 3.5 Flash (Reasoning)352725161623.7%
Gemini 3.5 Flash (Reasoning, Minimal)5646124023.7%
GPT-5.4 Nano (Reasoning, Low)31303021223.1%
GPT-5.4 Mini372522161523.1%
Grok 4.20 (Reasoning)34282623523.0%
GPT-5.4 Nano (Reasoning)40332216021.9%
Gemini 3.1 Pro (Preview)333121131021.7%
DeepSeek-V2 Chat42252219021.6%
Mistral NeMO4340210020.9%
Gemini 3 Pro (Preview)34212018018.6%
GPT-4o, Aug. 6th (temp=0)5918150018.3%
Gemini 2.5 Flash (Reasoning)25242213016.8%
Qwen 3.6 27B641900016.6%
GPT-5.4 Mini (Reasoning)3429135116.5%
Nemotron 3 Super2924150013.7%
Qwen 3.5 9B67000013.4%
Z.AI GLM 4.720171613013.2%
Gemini 3 Flash (Preview, Reasoning)2519192012.8%
Grok 4.3 (Reasoning)302370012.1%
GPT-4o Mini (temp=0)2516150011.4%
ByteDance Seed 2.0 Mini282040010.4%
Gemini 3.1 Flash Lite46410010.0%
DeepSeek V3 (2024-12-26)2199408.7%
Gemma 4 31B21119008.1%
Qwen 2.5 72B19166008.1%
GPT-5.23070007.4%
GPT-5 Nano12109406.9%
Gemma 4 26B3310006.8%
Gemini 3 Flash (Preview)1875306.4%
Gemma 4 26B (Reasoning)1770005.0%
Arcee AI: Trinity Mini13101004.7%
ByteDance Seed 1.61751004.7%
ByteDance Seed 2.0 Lite1640004.0%
GPT-5 Mini1820003.9%
Gemini 3.1 Flash Lite (Reasoning)754203.7%
Gemini 3.1 Flash Lite (Preview)760002.6%
GPT-4o, May 13th (temp=0)930002.4%
GPT-OSS 120B700001.4%
Nemotron 3 Nano700001.4%
Qwen3.7 Max300000.5%
Qwen 3.5 Flash200000.4%
Gemma 4 31B (Reasoning)100000.1%
Qwen 3.5 122B000000.0%
Qwen 3.5 27B000000.0%
Qwen 3.5 35B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.5 (Reasoning, Low)999695847890.5%
GPT-5.4 (Reasoning, Low)1009089887989.3%
GPT-5.4 (Reasoning)1009486857788.4%
GPT-5.4 Mini (Reasoning, Low)1009184837185.8%
Rocinante 12B100100100645082.7%
GPT-5.4 Mini (Reasoning)888679767681.2%
GPT-5.5 (Reasoning)868281807380.2%
Llama 3.1 8B10010098623879.6%
Llama 3.1 Nemotron 70B10010091514978.2%
Gemini 3.5 Flash (Reasoning)988175745977.3%
GPT-5.5977575716576.6%
Claude Opus 4.710010067584975.0%
GPT-5.4887575716374.4%
GPT-5.4 Mini848166666672.6%
GPT-5.11008563575271.5%
Writer: Palmyra X51008863594470.9%
Xiaomi MIMO v2.51006864635269.5%
Gemini 3.1 Flash Lite878073544768.1%
Llama 3.1 70B898278484368.0%
Gemini 3.5 Flash (Reasoning, Minimal)787170624365.0%
Mistral Small 4987572452562.8%
WizardLM 2 8x22b878253533862.5%
Qwen3.6 Max Preview816960524661.4%
GPT-4o Mini (temp=1)936564503060.4%
GPT-4o, Aug. 6th (temp=1)937666481459.5%
MiniMax M2.5917052483559.2%
Claude Sonnet 4.5796856524159.2%
Grok 4 Fast907752403759.1%
Claude Opus 4.7 (Reasoning)706863474759.0%
Qwen 3.5 397B A17B857869382559.0%
Grok 4.20655959575559.0%
Gemini 3.1 Flash Lite (Reasoning)756665454258.6%
Gemini 3 Flash (Preview, Reasoning)836557464058.4%
Gemini 3.1 Pro (Preview)787454503658.3%
DeepSeek V4 Pro (Reasoning)746756514157.8%
Claude Sonnet 4.6 (Reasoning)706861414156.5%
Gemini 3.1 Flash Lite (Preview)765453504956.4%
GPT-4.1656359553956.2%
Qwen3 235B A22B Instruct 2507926358531255.5%
Grok 4.20 (Reasoning)716159493755.4%
Claude Sonnet 4.6786449473855.1%
Hermes 3 405B737251502855.0%
ByteDance Seed 1.6 Flash100776529054.4%
DeepSeek V3 (2025-03-24)1005045433454.4%
Z.AI GLM 4.5786451413854.4%
MoonshotAI: Kimi K2.61006248332854.3%
Claude Opus 4.6 (Reasoning)666654483453.8%
Qwen3.7 Max725656552853.4%
Qwen 3.6 Flash85715440851.7%
Z.AI GLM 5 Turbo746747471750.5%
Qwen 3.6 35B595954433750.2%
Claude Sonnet 4816337343149.4%
DeepSeek V3.2646258441648.9%
Grok 4.20 (Beta)705343423448.3%
GPT-4o, May 13th (temp=1)874638383248.1%
Z.AI GLM 5.1695351343347.9%
Xiaomi MIMO v2.5 Pro645643413747.9%
GPT-5.4 Nano585651502447.6%
Gemini 2.5 Flash (Reasoning)685949402147.6%
Gemini 2.5 Pro684943423447.1%
Gemma 4 26B715140363446.5%
GPT-5 Mini564645444046.3%
Claude Opus 4.672714731946.1%
Gemini 3 Flash (Preview)796637242345.9%
Gemma 4 31B (Reasoning)80635037045.9%
Cohere Command R+ (Aug. 2024)70604843845.8%
Qwen 3.5 Plus (2026-02-15)665939313045.0%
Grok 4695043422145.0%
Z.AI GLM 5824746311544.1%
Grok 4.1 Fast644543402744.0%
Grok 4.20 (Beta, Reasoning)535251481643.9%
GPT-5.2525146383043.2%
Claude 3.7 Sonnet64545444043.2%
GPT-5634539353242.8%
Gemma 4 26B (Reasoning)56545346542.7%
GPT-5.4 Nano (Reasoning)504745442742.6%
MoonshotAI: Kimi K2.5754336322642.5%
DeepSeek V4 Flash (Reasoning)615140362342.2%
Hermes 3 70B534340383441.7%
Claude 3 Haiku66535330040.4%
Z.AI GLM 4.5 Air534040383140.3%
Stealth: Hunter Alpha574438372540.2%
ByteDance Seed 2.0 Lite69624325040.0%
Gemma 4 31B545141401239.6%
Grok 4.3534137313138.9%
Mistral Small 3.2 24B10052420038.8%
Qwen 3.5 Plus (2026-04-20)694734321238.7%
Mistral Medium 3.1504136362637.8%
DeepSeek-V2 Chat74572523737.1%
Claude Haiku 4.564623016936.1%
Ministral 3 14B685432161035.9%
DeepSeek V4 Flash464240381035.0%
GPT-4.1 Nano6863376035.0%
LFM2 24B593734281634.7%
GPT-5.4 Nano (Reasoning, Low)413939282634.5%
Qwen 3.6 27B564127262234.3%
Stealth: Healer Alpha504331271633.5%
GPT-4.1 Mini70362928032.7%
Claude Opus 4693023231932.7%
DeepSeek V4 Pro48413925731.9%
Mistral Large 3413928262531.9%
Gemini 2.5 Flash Lite (Reasoning)593923231731.9%
Qwen 3 32B57503911031.6%
GPT-4o, May 13th (temp=0)554228181331.2%
Gemma 3 12B46393931131.0%
Z.AI GLM 4.6643029201130.8%
DeepSeek V3 (2024-12-26)535227111130.7%
Gemini 2.5 Flash574519161530.5%
Qwen 3.5 35B55512716029.7%
Claude Opus 4.554523011029.3%
Grok 4.3 (Reasoning)59343014929.1%
ByteDance Seed 1.66360230029.0%
Gemini 3 Pro (Preview)62382914028.6%
Gemma 3 4B46432623328.2%
Mistral Large 24945318728.0%
o4 Mini51471716727.7%
Z.AI GLM 4.7423128261127.7%
Nemotron 3 Super40402424927.5%
Aion 2.059332619027.4%
Mistral Small 4 (Reasoning)52302817826.9%
GPT-4o Mini (temp=0)363229231226.8%
GPT-4o, Aug. 6th (temp=0)42383119426.7%
Claude 3.5 Sonnet7931184126.6%
Mistral Large363524231226.1%
Mistral NeMO423519161325.0%
DeepSeek V3.14342229824.9%
o4 Mini High323128221124.6%
Gemma 3 27B36332821424.5%
Nemotron 3 Nano36302919423.9%
ByteDance Seed 2.0 Mini45412310023.9%
Mistral Small Creative46371610923.6%
Qwen 3.5 122B5747120023.2%
Qwen 3.5 Flash4237233021.2%
Gemini 2.5 Flash Lite312519161320.9%
Ministral 8B4232240019.5%
Z.AI GLM 4.7 Flash32231514618.0%
Ministral 3B4334111017.9%
Qwen 2.5 72B35221611016.8%
Ministral 3 8B413040014.9%
Arcee AI: Trinity Large (Preview)372600012.7%
Qwen 3.5 9B3577009.9%
Inception Mercury 2181413008.9%
MiniMax M2.71698006.6%
GPT-OSS 120B1084004.5%
Qwen 3.5 27B2020004.4%
Ministral 3 3B1400002.8%
GPT-5 Nano553002.7%
Arcee AI: Trinity Mini1200002.3%
Inception Mercury400000.8%
Stealth: Aurora Alpha000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 (Reasoning)1001001001009699.2%
GPT-5.4 (Reasoning, Low)100100100979197.5%
GPT-5.1100100100988696.8%
GPT-5.5 (Reasoning)10010093908393.2%
Grok 4.1 Fast100100100986292.0%
Claude Opus 4.6 (Reasoning)10010098818191.8%
Rocinante 12B100100100817591.3%
GPT-5.51009087878689.9%
Gemini 3.5 Flash (Reasoning, Minimal)10010097955589.3%
Claude Sonnet 4.51009886857488.6%
Claude Sonnet 4.6 (Reasoning)10010095736887.3%
GPT-5.4 Mini (Reasoning)10010086836486.5%
Claude Opus 4.5100100100795286.2%
Claude Sonnet 41009590766585.3%
DeepSeek V4 Pro10010091696685.3%
GPT-5.5 (Reasoning, Low)958781807583.8%
MiniMax M2.5100100100883083.6%
Claude Haiku 4.51009790795183.5%
GPT-4o, Aug. 6th (temp=1)10010097764383.3%
Claude Opus 4.61009675747183.3%
Mistral Small 4 (Reasoning)10010084834983.0%
DeepSeek V3 (2025-03-24)1009087874982.7%
Claude Opus 4.710010085705882.6%
GPT-4o Mini (temp=1)928585767382.3%
Z.AI GLM 510010088675582.0%
Z.AI GLM 5 Turbo1008481736580.6%
GPT-5.4 Mini (Reasoning, Low)10010079774279.8%
GPT-5.4 Mini948684736279.8%
DeepSeek V3.2918980686578.8%
Z.AI GLM 5.11009082645678.3%
DeepSeek V4 Pro (Reasoning)868379707077.5%
Mistral Small 4908274717077.4%
Mistral Small Creative979181793376.1%
Grok 41009089633776.0%
Llama 3.1 8B100100100472774.8%
MoonshotAI: Kimi K2.6878675725174.3%
Ministral 3 8B10010066594273.5%
Gemma 3 27B828174656573.4%
LFM2 24B1008483633773.4%
GPT-5.4 Nano (Reasoning)1007968625673.0%
Xiaomi MIMO v2.51007868635572.9%
GPT-5.2837776666372.8%
Grok 4 Fast1009272702571.7%
Gemma 3 12B797871646170.4%
Claude 3.5 Sonnet1009360564070.0%
GPT-4.1 Nano10010076393469.6%
Gemini 2.5 Flash1007265574868.2%
Grok 4.20 (Reasoning)777069645667.5%
Arcee AI: Trinity Large (Preview)1009758532967.5%
Z.AI GLM 4.5796964646267.4%
GPT-4.1979070483167.3%
Stealth: Healer Alpha858280463866.3%
Claude Sonnet 4.6846963595566.0%
Ministral 3 14B878555534965.8%
Qwen 3.6 35B957871444065.6%
Qwen 3 32B1009057552365.0%
Gemini 3.5 Flash (Reasoning)796966624764.7%
Cohere Command R+ (Aug. 2024)1007764542764.3%
Grok 4.20 (Beta)867659584164.0%
Aion 2.0906560525263.8%
Claude Opus 4.7 (Reasoning)856161565463.7%
DeepSeek V4 Flash1006657494763.6%
DeepSeek V4 Flash (Reasoning)926560574363.3%
Gemma 3 4B836557555463.0%
Ministral 8B908272462562.9%
GPT-5.4 Nano (Reasoning, Low)736967624262.5%
MiniMax M2.71009151392962.1%
Hermes 3 70B948757363461.7%
Ministral 3B100977731060.9%
Gemini 2.5 Flash (Reasoning)917862482560.9%
Claude Opus 4787760542659.1%
Gemma 4 26B (Reasoning)777171383458.1%
GPT-5.4 Nano796052524657.6%
Grok 4.20646362564357.6%
Qwen 3.5 Plus (2026-02-15)736554494657.4%
Hermes 3 405B100855843057.3%
GPT-5716362543356.8%
Gemini 3 Flash (Preview, Reasoning)747158522856.5%
ByteDance Seed 1.6 Flash1007556282256.2%
GPT-4o, May 13th (temp=1)847859332856.2%
MoonshotAI: Kimi K2.5836160502656.1%
GPT-5 Mini615955554955.9%
Llama 3.1 Nemotron 70B100755251055.6%
Grok 4.20 (Beta, Reasoning)866145444255.5%
Qwen 3.6 Flash816959333254.9%
Gemini 2.5 Pro885551493154.8%
Xiaomi MIMO v2.5 Pro676551474154.1%
Claude 3 Haiku1009131281853.7%
Gemini 3.1 Pro (Preview)897240373053.6%
Z.AI GLM 4.7846554382653.5%
Claude 3.7 Sonnet805653413653.3%
Gemini 2.5 Flash Lite (Reasoning)676256513053.3%
DeepSeek-V2 Chat100746320652.6%
Mistral Medium 3.1656149473451.4%
Mistral NeMO95795032051.3%
Gemini 2.5 Flash Lite636053413851.2%
ByteDance Seed 2.0 Mini767467181850.6%
Gemma 4 31B635353453850.4%
Mistral Large76585749849.4%
Qwen 3.5 Plus (2026-04-20)1004934323149.2%
GPT-4.1 Mini595553413648.8%
Mistral Large 2625448413547.8%
Stealth: Hunter Alpha706354351447.2%
DeepSeek V3.1525145424046.0%
Gemini 3 Pro (Preview)825350331345.9%
Gemma 4 31B (Reasoning)676342391945.9%
Llama 3.1 70B1004038341745.6%
Gemini 3 Flash (Preview)736444251945.0%
o4 Mini High727038261744.6%
Qwen 3.5 397B A17B575349313144.2%
Qwen3.6 Max Preview595238373043.2%
o4 Mini575637292641.0%
Ministral 3 3B74713616239.8%
Z.AI GLM 4.5 Air714843201639.5%
Z.AI GLM 4.7 Flash514743322439.5%
Grok 4.3584036352839.5%
Gemma 4 26B554735332438.8%
Qwen 3.6 27B524543391238.4%
Gemini 3.1 Flash Lite88463121638.3%
Mistral Large 365463836337.5%
GPT-4o Mini (temp=0)645636181337.5%
DeepSeek V3 (2024-12-26)10049316037.3%
Z.AI GLM 4.6584734241635.7%
Qwen3.7 Max84432318935.4%
Nemotron 3 Super564034281735.0%
ByteDance Seed 2.0 Lite72373625034.2%
WizardLM 2 8x22b694530151133.9%
Qwen 3.5 Flash73393516533.8%
GPT-4o, May 13th (temp=0)58482910930.9%
Arcee AI: Trinity Mini54423812830.8%
Grok 4.3 (Reasoning)50453814530.5%
Mistral Small 3.2 24B4847446028.9%
GPT-5 Nano353531181426.5%
ByteDance Seed 1.642353118025.5%
Gemini 3.1 Flash Lite (Preview)4643259525.5%
Gemini 3.1 Flash Lite (Reasoning)35343424025.4%
Qwen 3.5 122B45372114724.8%
GPT-4o, Aug. 6th (temp=0)6620129322.0%
Qwen 3.5 35B38242412520.4%
Qwen 3.5 9B36261913018.8%
Qwen 3.5 27B3626104015.2%
Nemotron 3 Nano2418119112.5%
Qwen 2.5 72B3910007.9%
GPT-OSS 120B1120002.5%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100975690.7%
Qwen3 235B A22B Instruct 25071009281655478.5%
Mistral Large979082683274.1%
Mistral Small 4928785594172.7%
GPT-5.5 (Reasoning, Low)808073694870.0%
GPT-5.4908574643168.9%
Writer: Palmyra X51009657523768.3%
Grok 4.20 (Beta, Reasoning)897468624667.9%
GPT-5.4 Mini797669654767.4%
GPT-5.4 Mini (Reasoning, Low)786967565464.9%
Llama 3.1 Nemotron 70B858054535164.4%
Mistral Large 21007370522764.3%
Grok 4.20 (Reasoning)797266643763.5%
Grok 4 Fast878166493563.4%
Llama 3.1 70B10010063401263.2%
GPT-5.4 (Reasoning)756863585162.9%
GPT-5.4 (Reasoning, Low)846160584762.0%
GPT-5.5 (Reasoning)706861545461.4%
GPT-5.4 Mini (Reasoning)666361575660.7%
MiniMax M2.5886858551957.8%
Mistral Small Creative817150473657.2%
GPT-5.5756457464156.7%
Qwen3.6 Max Preview745958414054.3%
Ministral 3 14B726760511953.7%
Grok 4745752443853.0%
GPT-5.1656161502352.1%
Claude Opus 4726558461851.7%
Grok 4.1 Fast925340393451.6%
DeepSeek V3 (2025-03-24)716643423651.6%
Qwen3.7 Max695554512751.2%
Ministral 8B897641262150.6%
Mistral Medium 3.1856554341450.5%
Claude Sonnet 4.51006053221750.3%
Grok 4.20 (Beta)716460342049.8%
Grok 4.20555452493649.3%
Claude 3 Haiku1007033291148.5%
Rocinante 12B100774617148.3%
Mistral Small 4 (Reasoning)100634234047.9%
Claude Opus 4.583754535247.7%
GPT-5.4 Nano (Reasoning)695747352947.3%
Hermes 3 70B604847383545.5%
MoonshotAI: Kimi K2.661595739844.8%
Hermes 3 405B1004034252544.6%
Ministral 3 8B1004234271543.5%
Gemini 3.1 Pro (Preview)86563830342.6%
Qwen 3.6 35B69584342042.3%
MoonshotAI: Kimi K2.5767227181742.1%
Claude Opus 4.6604241392841.9%
Qwen 3.6 Flash67554539041.1%
DeepSeek V4 Pro (Reasoning)684941271940.7%
Z.AI GLM 5 Turbo88534217040.2%
Grok 4.3 (Reasoning)63534635039.5%
Claude Haiku 4.578503524839.1%
GPT-5 Mini604641271938.6%
Qwen 3.5 397B A17B48474743838.5%
Z.AI GLM 5575149211438.3%
Z.AI GLM 5.1100382512936.8%
Qwen 3 32B9153279035.9%
o4 Mini47474633535.6%
ByteDance Seed 1.6 Flash534237271935.5%
Arcee AI: Trinity Large (Preview)10040245334.4%
GPT-5.4 Nano (Reasoning, Low)503831282333.8%
LFM2 24B51433431833.6%
Claude Sonnet 4.6692827212033.3%
WizardLM 2 8x22b10040149032.7%
Claude Sonnet 449444029032.4%
Gemma 3 12B53503515832.3%
GPT-4.151433716931.4%
DeepSeek V3 (2024-12-26)54383620029.8%
ByteDance Seed 2.0 Mini10030190029.8%
DeepSeek V4 Flash (Reasoning)71421715329.7%
Claude Sonnet 4.6 (Reasoning)63362816429.3%
Mistral NeMO582520181727.7%
Z.AI GLM 4.5353525211827.0%
Claude Opus 4.743333127026.9%
Nemotron 3 Nano5048308026.9%
Ministral 3B54342119726.7%
GPT-5.2482724201526.6%
Claude Opus 4.7 (Reasoning)60352115026.3%
GPT-4o Mini (temp=1)342727232126.1%
DeepSeek V4 Pro41332624525.9%
GPT-4o, Aug. 6th (temp=1)433626121125.6%
Mistral Large 36141251025.6%
o4 Mini High56262511524.4%
Claude Opus 4.6 (Reasoning)34302820723.8%
Cohere Command R+ (Aug. 2024)6437117023.8%
Gemini 3.1 Flash Lite (Reasoning)38313020023.7%
Stealth: Hunter Alpha42272221022.4%
MiniMax M2.745332012021.9%
GPT-4o, May 13th (temp=0)46252116021.4%
Gemini 3.5 Flash (Reasoning, Minimal)343017151121.4%
GPT-5.4 Nano33262119520.8%
Z.AI GLM 4.5 Air5031166020.6%
Qwen 3.5 Plus (2026-04-20)513684019.7%
Gemini 2.5 Flash Lite (Reasoning)33292411019.3%
Mistral Small 3.2 24B464133018.6%
Xiaomi MIMO v2.54521193017.7%
Gemini 2.5 Flash (Reasoning)562280017.3%
Xiaomi MIMO v2.5 Pro65764016.4%
Qwen 3.5 Plus (2026-02-15)3720147115.7%
Qwen 3.6 27B4122106015.7%
Gemini 2.5 Pro29221612015.6%
GPT-52525195415.6%
Gemini 3.1 Flash Lite3518159015.4%
Qwen 2.5 72B2721189015.1%
Gemini 3.5 Flash (Reasoning)3720160014.8%
DeepSeek-V2 Chat342585014.5%
GPT-4.1 Mini3324160014.4%
Qwen 3.5 35B2524200013.8%
Gemini 2.5 Flash Lite2716147413.7%
Z.AI GLM 4.62423174013.5%
Qwen 3.5 27B412700013.5%
Z.AI GLM 4.7 Flash29141312013.4%
GPT-4o, Aug. 6th (temp=0)3220140013.3%
Stealth: Healer Alpha2319169013.2%
Ministral 3 3B382232013.0%
GPT-4o, May 13th (temp=1)282590012.4%
Grok 4.32419105011.8%
Claude 3.5 Sonnet47550011.6%
Gemma 3 4B232366011.4%
Gemma 4 31B (Reasoning)362000011.2%
Claude 3.7 Sonnet2916120011.2%
Gemma 4 26B261355210.2%
Gemma 4 26B (Reasoning)1812128010.1%
DeepSeek V3.230143009.3%
GPT-4.1 Nano2496508.8%
Gemini 2.5 Flash23143208.6%
Gemini 3.1 Flash Lite (Preview)17118608.5%
Nemotron 3 Super18117708.4%
Gemma 4 31B14145528.1%
Qwen 3.5 Flash3143007.6%
GPT-5 Nano14108337.5%
GPT-4o Mini (temp=0)17140006.1%
Aion 2.01398006.1%
Inception Mercury 23000006.0%
DeepSeek V4 Flash2230005.1%
Qwen 3.5 122B1770004.9%
Gemini 3 Pro (Preview)1670004.6%
DeepSeek V3.11093004.5%
Gemma 3 27B1432003.7%
Z.AI GLM 4.71700003.5%
ByteDance Seed 2.0 Lite750002.5%
Gemini 3 Flash (Preview)210000.7%
Gemini 3 Flash (Preview, Reasoning)300000.6%
Arcee AI: Trinity Mini100000.2%
ByteDance Seed 1.6000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4100100100907592.9%
Claude Sonnet 4.510010096957092.4%
Grok 4 Fast1009898887892.3%
Writer: Palmyra X51001001001005891.5%
GPT-5.4 (Reasoning)100100100935890.2%
GPT-5.4 (Reasoning, Low)10010091867189.7%
GPT-5.510010083837087.1%
Mistral Small 4 (Reasoning)1008989757585.7%
Qwen3.6 Max Preview10010083666582.9%
Mistral Small 4998982766281.7%
GPT-5.5 (Reasoning, Low)1008879736480.9%
Grok 410010074695780.1%
GPT-5.5 (Reasoning)918888645877.9%
Ministral 3 14B10010084634077.4%
Qwen 3.6 35B1009592662776.0%
Claude Opus 4.7 (Reasoning)1009876574575.2%
GPT-5.4 Mini977975645373.5%
Qwen3 235B A22B Instruct 25071009874593573.1%
Grok 4.1 Fast1007571625272.0%
Qwen 3.6 Flash1007768565170.1%
Claude Opus 4.7929080454270.0%
Qwen 3.5 397B A17B1008368663169.9%
Grok 4.20 (Reasoning)918065594568.0%
GPT-5.1967058585467.3%
Claude Opus 4917566653867.0%
Claude Sonnet 4.61008371473166.5%
Gemini 3.1 Flash Lite (Preview)827271575066.2%
Gemini 2.5 Flash (Reasoning)888472444165.8%
MiniMax M2.51006560554765.4%
GPT-5.4 Mini (Reasoning, Low)706867625865.1%
Grok 4.20888173443965.1%
MoonshotAI: Kimi K2.5866563555264.3%
Claude 3.5 Sonnet978556493364.0%
Llama 3.1 8B10010010016463.9%
Rocinante 12B100968825863.6%
GPT-5.4 Mini (Reasoning)928461503063.4%
Z.AI GLM 5 Turbo926864504062.7%
Hermes 3 70B10010076201662.3%
MoonshotAI: Kimi K2.6807470533462.1%
Gemini 2.5 Flash Lite (Reasoning)1007049463259.4%
DeepSeek V4 Flash746356564258.1%
Grok 4.20 (Beta, Reasoning)786457563357.6%
Claude Opus 4.6 (Reasoning)726859513657.3%
Llama 3.1 Nemotron 70B968243412156.7%
Z.AI GLM 5.1976056363156.0%
Claude Opus 4.51005744423154.9%
GPT-5716057533454.8%
Mistral NeMO925949403354.5%
Grok 4.390745546553.9%
Qwen3.7 Max757370252553.3%
Gemma 3 12B96706129953.3%
Stealth: Hunter Alpha835448463052.1%
GPT-4.1 Nano835651462351.8%
Mistral Large 294756022851.7%
Qwen 3 32B1008036251751.6%
Claude Sonnet 4736954431751.4%
DeepSeek V4 Pro (Reasoning)706843393451.0%
Claude Sonnet 4.6 (Reasoning)1008933181450.7%
Stealth: Healer Alpha765552383150.6%
Qwen 3.6 27B99694935050.3%
DeepSeek V4 Pro87715529850.3%
Mistral Large 3925348451350.2%
Claude Opus 4.686624743849.3%
Mistral Large64625753849.0%
GPT-4.1767255201948.3%
Gemini 3.1 Flash Lite726147441447.9%
Gemini 3 Flash (Preview)806944271847.6%
GPT-5.4 Nano635149482647.4%
Mistral Small Creative71575553047.1%
Mistral Medium 3.1774747343147.0%
GPT-5.4 Nano (Reasoning, Low)655956262546.2%
Hermes 3 405B1005527242345.6%
Gemini 3.1 Pro (Preview)100754012045.3%
DeepSeek V3 (2025-03-24)79706214045.1%
Gemini 3.5 Flash (Reasoning, Minimal)985825222044.6%
GPT-5.4 Nano (Reasoning)494843424044.5%
Xiaomi MIMO v2.562624845344.0%
MiniMax M2.7944231262343.2%
Claude 3.7 Sonnet815049221142.7%
Gemini 3.1 Flash Lite (Reasoning)59535249042.6%
GPT-4o, Aug. 6th (temp=1)565537343042.5%
Z.AI GLM 5554745323242.3%
DeepSeek V3.280524037041.8%
Arcee AI: Trinity Large (Preview)624636313041.1%
Xiaomi MIMO v2.5 Pro434342403540.7%
Claude Haiku 4.5605452221440.2%
GPT-5.2625438281840.0%
Ministral 8B634837341539.4%
Gemma 4 31B605034282339.1%
Gemma 3 27B635832241638.7%
Qwen 3.5 Plus (2026-04-20)534543381538.7%
DeepSeek V4 Flash (Reasoning)54494731937.8%
Mistral Small 3.2 24B7169434037.4%
Grok 4.20 (Beta)524847191736.8%
Ministral 3 8B7769380036.7%
Z.AI GLM 4.5 Air605130241836.7%
GPT-4o Mini (temp=1)69494618136.7%
WizardLM 2 8x22b654838161135.4%
Z.AI GLM 4.6593429272635.1%
ByteDance Seed 2.0 Mini52514625034.7%
Grok 4.3 (Reasoning)63423131534.7%
Claude 3 Haiku71464016034.6%
ByteDance Seed 1.687492011033.5%
DeepSeek-V2 Chat75433311032.6%
ByteDance Seed 1.6 Flash57383821832.3%
Qwen 3.5 122B53484312031.3%
Gemini 2.5 Pro673123201431.0%
Ministral 3B53432927030.4%
Gemini 3.5 Flash (Reasoning)8729295029.9%
Gemini 2.5 Flash Lite55482420229.8%
o4 Mini High393932231529.4%
Gemini 2.5 Flash64322619529.1%
Qwen 3.5 Flash46462725028.8%
o4 Mini40363528628.8%
ByteDance Seed 2.0 Lite8741100027.7%
Aion 2.05343329027.7%
GPT-5 Mini453523201427.5%
Qwen 3.5 Plus (2026-02-15)47352718726.7%
Ministral 3 3B48392712025.2%
Z.AI GLM 4.752392312025.1%
Gemini 3 Flash (Preview, Reasoning)8026170024.6%
Llama 3.1 70B5130287524.2%
Nemotron 3 Nano7920152023.1%
GPT-4.1 Mini36353111022.6%
LFM2 24B6226205022.5%
Cohere Command R+ (Aug. 2024)57241810021.9%
GPT-4o, Aug. 6th (temp=0)4938145421.8%
Nemotron 3 Super45291612521.3%
Gemma 4 31B (Reasoning)34282517321.3%
DeepSeek V3.15435134021.1%
Qwen 3.5 35B711895020.7%
Gemma 4 26B (Reasoning)3428248720.3%
Z.AI GLM 4.556211211019.9%
Gemma 4 26B29271816919.7%
GPT-4o, May 13th (temp=1)31212014117.2%
GPT-4o, May 13th (temp=0)4325160017.0%
Qwen 3.5 27B4022145016.2%
Gemini 3 Pro (Preview)2625139515.5%
Gemma 3 4B3022197015.5%
DeepSeek V3 (2024-12-26)4611102013.7%
Qwen 2.5 72B3117137013.6%
Z.AI GLM 4.7 Flash22171311513.5%
Qwen 3.5 9B262630011.0%
GPT-5 Nano23109809.8%
GPT-4o Mini (temp=0)21140007.0%
Arcee AI: Trinity Mini3400006.7%
GPT-OSS 120B770002.9%
Inception Mercury900001.9%
Inception Mercury 2700001.4%
Stealth: Aurora Alpha000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100876991.3%
GPT-5.41009284807987.0%
Llama 3.1 8B10010010093980.4%
GPT-5.4 Mini (Reasoning, Low)1008684735479.5%
GPT-5.4 (Reasoning)928382755677.5%
GPT-5.4 Mini (Reasoning)939380724275.9%
Qwen3 235B A22B Instruct 2507868577704672.7%
Qwen3.6 Max Preview838275705172.3%
GPT-5.4 Mini928872525070.9%
Mistral Small 41009554494668.8%
GPT-5.4 (Reasoning, Low)767066605364.7%
Ministral 3 14B887765523864.0%
MiniMax M2.7796862554962.8%
Qwen 3.6 35B767668582460.6%
Qwen 3.6 Flash767672661160.1%
Mistral Small 4 (Reasoning)926358434259.9%
GPT-5.5 (Reasoning, Low)726161584359.0%
GPT-5.5 (Reasoning)827755552659.0%
Llama 3.1 70B1008049471357.9%
GPT-5.5626258555057.6%
Grok 4877942403857.3%
Mistral Large 31007245363156.8%
Writer: Palmyra X5100876032055.9%
GPT-4.1716560493355.5%
Grok 4.1 Fast816050503655.3%
Claude Sonnet 4.5857268391054.9%
Grok 4 Fast756952413454.4%
Grok 4.20 (Reasoning)646356464154.1%
Claude Opus 4.5806153423253.8%
Hermes 3 405B1005248451852.7%
Grok 4.20675957502852.0%
Claude Opus 4.6904949432851.9%
Gemini 3.1 Flash Lite (Reasoning)715650423951.8%
GPT-5.1695349423850.3%
Claude Opus 4925943271948.1%
DeepSeek V4 Pro (Reasoning)856942261547.5%
ByteDance Seed 1.6 Flash635756422047.5%
DeepSeek V3 (2025-03-24)100683434047.3%
Gemini 3.1 Pro (Preview)90813327046.2%
Grok 4.20 (Beta)726246262546.1%
DeepSeek V4 Pro81554241745.3%
Qwen 3.5 397B A17B685956301044.5%
Claude Haiku 4.5776037341444.4%
Claude 3 Haiku645856222044.0%
Z.AI GLM 5775244311544.0%
Grok 4.20 (Beta, Reasoning)615538363043.8%
Cohere Command R+ (Aug. 2024)664339373343.6%
Z.AI GLM 5.1685542251841.8%
MoonshotAI: Kimi K2.560565136040.5%
Qwen 3.5 35B724341241940.0%
Mistral Small Creative74454238039.7%
Claude Opus 4.6 (Reasoning)585237292239.5%
Z.AI GLM 5 Turbo545342351439.4%
GPT-5.4 Nano (Reasoning)624742301439.2%
Ministral 3 8B69564327039.2%
MiniMax M2.590423427038.7%
Ministral 8B66595110437.8%
Gemini 2.5 Pro584634242437.2%
Gemini 3.5 Flash (Reasoning)714934171537.1%
GPT-5 Mini504031302735.7%
GPT-4o Mini (temp=1)583633301935.5%
Claude Sonnet 463552824034.3%
Mistral Medium 3.1434139272034.0%
GPT-5.4 Nano (Reasoning, Low)463937361133.9%
Claude Opus 4.763474712033.9%
DeepSeek V4 Flash89382012933.6%
Gemini 3.1 Flash Lite524030291633.3%
Qwen 3 32B56523721033.2%
Arcee AI: Trinity Large (Preview)56463021731.9%
Gemini 2.5 Flash (Reasoning)653122211731.5%
DeepSeek-V2 Chat78272420530.8%
DeepSeek V3 (2024-12-26)513030261330.1%
GPT-4o, May 13th (temp=0)612928201029.8%
Mistral Large5047466029.6%
Gemini 2.5 Flash Lite56393811029.0%
Hermes 3 70B72332711028.8%
Mistral Large 269362611128.7%
Mistral Small 3.2 24B494121181428.3%
Z.AI GLM 4.548322624928.0%
Claude 3.5 Sonnet313028252427.6%
Gemini 3.1 Flash Lite (Preview)4944440027.5%
LFM2 24B423826171327.4%
GPT-5.4 Nano353426231927.4%
o4 Mini4947279327.1%
GPT-4.1 Nano7535190026.0%
Gemini 2.5 Flash Lite (Reasoning)373426181426.0%
Llama 3.1 Nemotron 70B512919171125.5%
MoonshotAI: Kimi K2.652242321825.4%
Stealth: Hunter Alpha373422171525.0%
Aion 2.051411914024.9%
DeepSeek V3.16128258024.4%
Xiaomi MIMO v2.5 Pro52341818024.4%
Qwen3.7 Max48351614824.2%
o4 Mini High5434270023.0%
Ministral 3B4038289022.9%
GPT-4.1 Mini45361716122.8%
Xiaomi MIMO v2.533322118922.4%
GPT-5.2352919161322.4%
Gemma 4 26B41302113622.2%
DeepSeek V3.24740174322.2%
Gemma 3 27B48312111022.0%
GPT-4o, May 13th (temp=1)5636170021.8%
Ministral 3 3B3733228019.9%
GPT-4o, Aug. 6th (temp=1)3430259019.6%
GPT-5362218111019.3%
Gemma 4 26B (Reasoning)282019171019.0%
Mistral NeMO5520190018.8%
Gemini 2.5 Flash4523157318.5%
Z.AI GLM 4.63128257018.3%
Claude Sonnet 4.6 (Reasoning)3828109618.2%
Gemma 4 31B (Reasoning)3728158117.7%
Claude 3.7 Sonnet3827201017.0%
Z.AI GLM 4.5 Air661900016.8%
Stealth: Healer Alpha453090016.8%
WizardLM 2 8x22b581282016.1%
ByteDance Seed 1.63718139315.9%
DeepSeek V4 Flash (Reasoning)3125180014.9%
Qwen 3.5 Flash442190014.8%
GPT-4o, Aug. 6th (temp=0)382760014.3%
Qwen 3.5 Plus (2026-04-20)392830014.2%
Gemma 4 31B23181612013.9%
Gemini 3.5 Flash (Reasoning, Minimal)181711111113.8%
ByteDance Seed 2.0 Lite3016125212.8%
ByteDance Seed 2.0 Mini2725102012.7%
GPT-4o Mini (temp=0)432000012.7%
Qwen 3.6 27B2824120012.6%
Stealth: Aurora Alpha60000012.0%
Gemma 3 4B311863011.8%
Grok 4.3301286011.3%
Gemini 3 Flash (Preview)2815100010.5%
Qwen 2.5 72B311380010.5%
Qwen 3.5 27B4280009.9%
Nemotron 3 Super30160009.2%
Grok 4.3 (Reasoning)28160008.9%
Z.AI GLM 4.73173008.3%
Inception Mercury4100008.2%
Claude Sonnet 4.63154008.1%
Gemma 3 12B23125008.0%
Gemini 3 Flash (Preview, Reasoning)17127307.7%
Gemini 3 Pro (Preview)14109607.7%
Qwen 3.5 122B2084307.0%
Nemotron 3 Nano1464004.9%
Z.AI GLM 4.7 Flash1087004.8%
Claude Opus 4.7 (Reasoning)1174004.3%
Qwen 3.5 9B1900003.9%
Arcee AI: Trinity Mini1900003.8%
GPT-5 Nano1032003.0%
Inception Mercury 21200002.5%
Qwen 3.5 Plus (2026-02-15)1010002.2%
GPT-OSS 120B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100100100100.0%
GPT-5.4 (Reasoning)100100100938595.6%
GPT-5.41009896938293.8%
Writer: Palmyra X51001001001006492.6%
Qwen3 235B A22B Instruct 2507100100100807891.7%
Mistral Small 4 (Reasoning)10010080787586.5%
GPT-5.5 (Reasoning)919087817484.6%
GPT-5.4 (Reasoning, Low)1008775716679.8%
Claude Sonnet 4.51009369676679.1%
Z.AI GLM 5.11008177716378.6%
GPT-5.11008781626278.4%
GPT-5.5948985665777.9%
Claude 3 Haiku958870696777.9%
Claude Opus 4.7 (Reasoning)1009285605077.3%
Hermes 3 70B10010090631774.2%
Gemini 3.5 Flash (Reasoning)967873655873.8%
GPT-4o Mini (temp=1)1007971665273.5%
Z.AI GLM 4.5857979665772.9%
Z.AI GLM 51009163624572.2%
Claude Opus 4.51009068633972.1%
Gemini 2.5 Flash Lite1008276663571.9%
Llama 3.1 70B908277565171.2%
Claude Opus 4.71009666524170.9%
Claude Sonnet 4959381434070.7%
GPT-5.4 Mini (Reasoning)1008467673570.7%
MoonshotAI: Kimi K2.5877970635170.1%
GPT-5.5 (Reasoning, Low)787267666669.8%
Qwen3.6 Max Preview887773664369.7%
Claude Haiku 4.51007162605268.9%
GPT-4.1 Mini1008756524467.9%
MiniMax M2.51007765514667.7%
Z.AI GLM 5 Turbo1008068523767.5%
MiniMax M2.7898966523365.7%
Qwen 3.6 Flash828069623365.1%
Claude Sonnet 4.6 (Reasoning)876860535364.1%
Rocinante 12B1006257524863.6%
Mistral Small 41009651422963.5%
Qwen 3.6 27B838369512762.5%
DeepSeek V3 (2025-03-24)996766463462.4%
GPT-5.4 Mini817875413662.4%
GPT-4.1996357514162.1%
Gemini 2.5 Flash Lite (Reasoning)797461593661.8%
GPT-5.4 Mini (Reasoning, Low)786857545061.3%
Claude Opus 41006955433660.6%
Stealth: Hunter Alpha836459494760.6%
Grok 4.20 (Beta, Reasoning)736761514759.9%
Hermes 3 405B100885347859.2%
Ministral 3 8B887448454058.9%
GPT-4.1 Nano998373261158.4%
o4 Mini High797748444157.8%
Qwen 3.5 397B A17B846855493357.7%
GPT-4o, Aug. 6th (temp=1)928554401857.7%
MoonshotAI: Kimi K2.61007451431757.1%
Llama 3.1 Nemotron 70B786258483956.9%
Gemini 3.1 Pro (Preview)856051464056.4%
Grok 4 Fast817857431755.2%
Xiaomi MIMO v2.5 Pro916749392855.1%
Grok 4.3 (Reasoning)81757441054.4%
Claude 3.5 Sonnet916553362353.6%
Gemini 3.5 Flash (Reasoning, Minimal)646054493853.1%
Grok 4855949363452.5%
Gemini 2.5 Flash (Reasoning)747044403452.3%
GPT-4o, Aug. 6th (temp=0)927741321651.7%
Claude Opus 4.6 (Reasoning)755843413851.0%
Ministral 8B1007058161151.0%
o4 Mini696551363150.6%
GPT-5.4 Nano695351493050.5%
Gemma 3 12B755348403650.3%
LFM2 24B776349362049.2%
Grok 4.1 Fast655348413849.1%
GPT-5.4 Nano (Reasoning)585654413649.1%
Claude Sonnet 4.6965240302748.9%
GPT-5.2615945433648.8%
Claude 3.7 Sonnet706849332148.2%
DeepSeek V4 Pro (Reasoning)815043343347.9%
GPT-5.4 Nano (Reasoning, Low)625651432647.6%
Mistral Medium 3.1635743413247.1%
Grok 4.20 (Beta)605650392946.9%
Grok 4.20644844423346.2%
DeepSeek V4 Flash (Reasoning)824936362044.8%
Claude Opus 4.6614646432544.0%
Grok 4.20 (Reasoning)60595249043.8%
Ministral 3 3B90504522742.9%
Qwen 3.5 Plus (2026-04-20)913534272642.8%
Qwen 3.6 35B575347292842.8%
DeepSeek V4 Pro665048252442.6%
Z.AI GLM 4.5 Air775933301442.4%
DeepSeek V4 Flash75494538341.9%
Z.AI GLM 4.6715437311641.8%
Gemini 2.5 Pro514848362641.8%
Z.AI GLM 4.7 Flash645645311241.7%
Mistral NeMO846238121241.6%
Qwen 3.5 Plus (2026-02-15)535146382041.6%
Qwen3.7 Max616033272541.1%
GPT-4o, May 13th (temp=1)654940341239.9%
Grok 4.3565142361539.9%
Ministral 3 14B804236291139.7%
Mistral Large 3614946291239.5%
Gemini 2.5 Flash71633619839.2%
Mistral Small Creative64484334639.0%
Qwen 3.5 9B67584623038.8%
Gemma 3 27B66473935838.8%
DeepSeek V3.2594436322038.1%
Cohere Command R+ (Aug. 2024)773332281938.0%
DeepSeek V3.1545436351037.8%
Aion 2.055534239037.8%
Arcee AI: Trinity Large (Preview)55504537037.5%
Gemma 4 31B (Reasoning)685726221437.4%
Gemma 4 26B (Reasoning)484634282435.9%
DeepSeek-V2 Chat763434231235.8%
ByteDance Seed 1.6 Flash543635272535.4%
Stealth: Healer Alpha524237311134.5%
Xiaomi MIMO v2.5704432161034.3%
Qwen 3.5 35B86441713833.6%
Qwen 3 32B51494323033.2%
Mistral Large51433727332.3%
Gemma 4 26B443834261932.3%
ByteDance Seed 2.0 Lite10039202032.0%
Nemotron 3 Super54373630031.5%
WizardLM 2 8x22b1005340031.4%
GPT-556333025930.6%
GPT-4o, May 13th (temp=0)7054253030.5%
Mistral Large 243423431030.0%
DeepSeek V3 (2024-12-26)96191717029.8%
Gemma 4 31B50432423829.6%
Gemini 3.1 Flash Lite (Reasoning)71262220628.9%
Qwen 3.5 Flash49373514828.5%
GPT-5 Mini53392616928.4%
Ministral 3B68201817225.0%
Mistral Small 3.2 24B7221190022.5%
Gemma 3 4B48291814021.8%
Gemini 3 Flash (Preview)332419171321.3%
Arcee AI: Trinity Mini4031209521.1%
Gemini 3 Flash (Preview, Reasoning)431817161020.7%
Qwen 3.5 27B34271817720.6%
Nemotron 3 Nano31272015519.6%
Gemini 3.1 Flash Lite (Preview)30272215018.6%
GPT-4o Mini (temp=0)252314131117.2%
ByteDance Seed 2.0 Mini3321169616.7%
Gemini 3 Pro (Preview)3224184015.5%
Qwen 2.5 72B4916110015.4%
Gemini 3.1 Flash Lite2524233014.9%
ByteDance Seed 1.64512106014.5%
Qwen 3.5 122B371576012.9%
GPT-5 Nano2315128011.7%
Z.AI GLM 4.72814122011.2%
GPT-OSS 120B1700003.4%
Inception Mercury600001.3%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Hermes 3 70B10010094814283.4%
Qwen3 235B A22B Instruct 2507848277757077.4%
Rocinante 12B1009373606077.2%
GPT-5.4958684625476.4%
Writer: Palmyra X51008378654374.0%
Llama 3.1 8B10010074562370.6%
GPT-5.4 (Reasoning)858071664970.1%
Claude Sonnet 4.61007977462364.9%
GPT-5.4 (Reasoning, Low)828276433263.0%
Claude Opus 4.6 (Reasoning)847358494862.5%
Llama 3.1 Nemotron 70B10010073271362.4%
Claude Opus 4.7 (Reasoning)1009557401862.1%
Gemini 3.1 Flash Lite887655403959.4%
Claude Opus 4.6766766464159.2%
Gemini 3.5 Flash (Reasoning, Minimal)100896323956.8%
ByteDance Seed 2.0 Lite858579211256.3%
Hermes 3 405B988263231656.1%
WizardLM 2 8x22b806257552155.2%
Mistral Small Creative947048332955.0%
GPT-5.1725549434152.3%
Claude 3 Haiku736443403551.2%
Claude Sonnet 4.5694946454250.2%
GPT-5.5625451434050.2%
Gemma 3 12B1005143391650.0%
Claude Sonnet 4.6 (Reasoning)776350312448.9%
Gemma 3 4B636053472048.6%
GPT-5.5 (Reasoning, Low)595247454048.5%
Llama 3.1 70B1004341372148.4%
DeepSeek V4 Pro885742272347.5%
Claude Opus 4.7625648383046.9%
GPT-5.5 (Reasoning)635742383346.4%
Mistral Small 4605951322345.0%
GPT-5.4 Mini634642393344.4%
Claude Haiku 4.5785743271644.3%
Z.AI GLM 5855036252544.1%
Claude Opus 4.5825534262143.5%
Gemini 3.1 Flash Lite (Preview)605645272642.8%
DeepSeek V4 Pro (Reasoning)825143181842.7%
GPT-5.4 Mini (Reasoning, Low)714139352341.9%
Claude Opus 467644234041.6%
ByteDance Seed 1.6 Flash675843251240.8%
Claude Sonnet 4724239331840.7%
Z.AI GLM 4.5665036271739.4%
Grok 4.1 Fast755325212039.0%
MiniMax M2.510048395339.0%
Gemini 3.1 Flash Lite (Reasoning)59533937338.4%
Grok 4.20 (Reasoning)524643331838.4%
Gemini 2.5 Flash58515026437.7%
DeepSeek V3.1635137211537.4%
Ministral 3 14B604336252036.9%
Claude 3.7 Sonnet705034201036.7%
Gemma 4 31B49494936036.6%
GPT-4o, Aug. 6th (temp=1)656221211035.8%
Grok 4.20 (Beta, Reasoning)454235332335.4%
Mistral Small 4 (Reasoning)584837201134.7%
DeepSeek V3 (2025-03-24)7171145332.8%
Cohere Command R+ (Aug. 2024)10041210032.5%
Gemini 2.5 Flash (Reasoning)533934211432.4%
GPT-5.4 Mini (Reasoning)554930141332.1%
Z.AI GLM 5.19439260031.8%
DeepSeek V4 Flash8350168031.5%
Grok 4.3554323211431.4%
Stealth: Healer Alpha50474117231.3%
Xiaomi MIMO v2.5 Pro553930161531.1%
Gemini 3 Flash (Preview, Reasoning)493931231230.9%
Grok 4.2061352727030.1%
Gemini 2.5 Pro7040322029.0%
Gemini 2.5 Flash Lite46443121228.9%
Xiaomi MIMO v2.5353431301428.9%
GPT-4o, Aug. 6th (temp=0)7434257328.5%
DeepSeek V3.2373327252028.3%
Grok 451332723728.2%
ByteDance Seed 2.0 Mini49393319028.0%
Stealth: Hunter Alpha46343321527.8%
GPT-4o, May 13th (temp=1)393326251527.3%
GPT-4o Mini (temp=1)474521131027.0%
Gemini 3.5 Flash (Reasoning)363227231626.8%
Gemini 2.5 Flash Lite (Reasoning)47312926026.7%
GPT-4.159361919026.6%
Ministral 3 8B8134160026.3%
Grok 4.20 (Beta)403425161626.3%
Mistral Medium 3.1462928161226.1%
Z.AI GLM 4.6452523201726.1%
MoonshotAI: Kimi K2.543372916025.0%
Mistral Large 248382513024.6%
Qwen 3.6 35B57252019124.5%
o4 Mini40382321024.4%
GPT-5.4 Nano342925181424.2%
Qwen 3.5 397B A17B43312712724.1%
o4 Mini High64221715324.1%
GPT-4o, May 13th (temp=0)41352615023.3%
Grok 4 Fast4827259622.9%
GPT-5.245212019522.2%
Qwen 3.5 Plus (2026-04-20)6625117021.9%
Qwen 3.6 Flash5522178521.6%
Z.AI GLM 5 Turbo40342311021.6%
Arcee AI: Trinity Large (Preview)4229209821.4%
Qwen3.6 Max Preview5437160021.3%
Gemini 3 Flash (Preview)5129129320.7%
Claude 3.5 Sonnet3838260020.4%
Gemma 4 26B26242420119.0%
MiniMax M2.74227138318.7%
GPT-5.4 Nano (Reasoning)31242414018.5%
Qwen 3.5 35B573500018.5%
Mistral NeMO5030110018.1%
GPT-4.1 Mini452897218.1%
Mistral Large 34523165218.1%
Qwen 3 32B4328190017.9%
Z.AI GLM 4.5 Air2928238017.8%
Gemini 3 Pro (Preview)3426147417.1%
GPT-531231714017.0%
Mistral Large3827190016.6%
Gemma 3 27B562130016.1%
GPT-4.1 Nano23171616715.7%
DeepSeek V4 Flash (Reasoning)3618149015.4%
Gemma 4 31B (Reasoning)31201212015.2%
Ministral 8B3623160014.9%
LFM2 24B3920110013.9%
MoonshotAI: Kimi K2.6341886013.4%
Z.AI GLM 4.72923120012.9%
Qwen 3.6 27B302950012.7%
GPT-5 Mini371870012.5%
DeepSeek V3 (2024-12-26)2714138012.3%
GPT-5.4 Nano (Reasoning, Low)2116147211.9%
Aion 2.0401032010.8%
Qwen 2.5 72B24159009.5%
Ministral 3B27181009.0%
Grok 4.3 (Reasoning)251010008.9%
Mistral Small 3.2 24B24136008.7%
Qwen 3.5 Plus (2026-02-15)4210008.6%
Gemma 4 26B (Reasoning)20163108.2%
Qwen 3.5 Flash16117006.7%
Qwen3.7 Max3000006.1%
Z.AI GLM 4.7 Flash2440005.5%
DeepSeek-V2 Chat960003.0%
Qwen 3.5 9B1320003.0%
Gemini 3.1 Pro (Preview)1021002.7%
GPT-4o Mini (temp=0)650002.2%
Nemotron 3 Nano700001.5%
Qwen 3.5 27B700001.4%
Nemotron 3 Super700001.3%
Ministral 3 3B320001.0%
Qwen 3.5 122B310000.8%
GPT-OSS 120B000000.1%
ByteDance Seed 1.6000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5 Nano000000.0%
Inception Mercury000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Rocinante 12B1001001001009799.4%
Writer: Palmyra X51001001001009198.2%
Claude Sonnet 4.5100100100998697.0%
Cohere Command R+ (Aug. 2024)10010095918594.2%
Claude Opus 4.5999291848489.9%
GPT-4o, Aug. 6th (temp=1)1009489877889.4%
Claude 3 Haiku10010097875587.8%
Claude Opus 4.6 (Reasoning)1009793776987.0%
Claude Sonnet 4.6 (Reasoning)10010085806986.8%
Gemma 3 27B999189866586.2%
Grok 4.1 Fast10010089845786.2%
Llama 3.1 8B100100100795186.0%
Qwen 3.5 Plus (2026-04-20)10010097765685.8%
GPT-5.41008987836685.1%
Claude Sonnet 41009083836484.0%
GPT-5.4 (Reasoning)1008886856083.9%
Hermes 3 70B100100100694783.2%
Mistral Small 4 (Reasoning)100100100793482.5%
Claude Opus 4.610010074726482.0%
Z.AI GLM 5 Turbo1009380726481.8%
MiniMax M2.5969381685779.2%
GPT-5.4 (Reasoning, Low)1008377745577.8%
Z.AI GLM 5.110010099464277.4%
Hermes 3 405B1009771595676.6%
Claude Opus 410010092573176.0%
MiniMax M2.71007268686173.8%
Stealth: Hunter Alpha1007675734473.6%
DeepSeek V4 Pro (Reasoning)1008572534871.4%
Claude Opus 4.7 (Reasoning)1009966533871.0%
GPT-5.5 (Reasoning)828177645070.9%
Grok 4.20918277594270.2%
Ministral 8B100998167369.9%
DeepSeek V4 Pro867270605969.3%
GPT-5.5 (Reasoning, Low)747270696069.0%
Z.AI GLM 5888266634468.6%
GPT-5.11008165543967.7%
Claude Haiku 4.51009857423466.1%
Claude Opus 4.71006655545165.3%
GPT-5.5747069634965.1%
Mistral Small 41008460541963.5%
Stealth: Healer Alpha897171393661.3%
GPT-5.4 Mini817155544561.3%
Claude Sonnet 4.692717061059.1%
Xiaomi MIMO v2.5726764523858.6%
Mistral Large 2837065472257.3%
GPT-5.4 Mini (Reasoning, Low)696357514657.2%
Llama 3.1 Nemotron 70B797058453156.6%
ByteDance Seed 1.6 Flash736563552756.5%
Grok 4886447443856.2%
Arcee AI: Trinity Large (Preview)756052474656.0%
DeepSeek V3 (2025-03-24)896354403055.1%
Xiaomi MIMO v2.5 Pro746958432954.5%
MoonshotAI: Kimi K2.61006643372353.8%
Ministral 3 14B84706641753.5%
Mistral Small Creative876646353153.0%
WizardLM 2 8x22b817644342952.9%
GPT-4.1 Nano615752513451.2%
Grok 4 Fast696648413151.0%
ByteDance Seed 2.0 Mini706058353250.9%
Z.AI GLM 4.6846448312350.1%
Gemma 3 12B655349423849.5%
Grok 4.20 (Beta)676150412749.1%
Aion 2.0565050474048.6%
Mistral Large 3796449232347.6%
GPT-4.1504949484047.4%
DeepSeek V4 Flash (Reasoning)645046423447.3%
Ministral 3 8B785248401245.9%
Mistral Large96514133745.6%
Gemini 2.5 Flash Lite (Reasoning)626151341344.4%
Gemini 2.5 Flash Lite565145363344.2%
DeepSeek V3.2544541403843.7%
GPT-4.1 Mini774639381743.6%
GPT-5.4 Nano535147343343.4%
Mistral Medium 3.1595251431143.4%
Gemini 3.5 Flash (Reasoning, Minimal)605037363343.3%
GPT-4o, May 13th (temp=1)515042363643.0%
GPT-5.4 Nano (Reasoning)545049342742.8%
Claude 3.7 Sonnet81633724842.7%
DeepSeek V4 Flash725439261541.2%
GPT-4o Mini (temp=1)706031232140.9%
Z.AI GLM 4.5 Air82552923939.6%
Qwen3.6 Max Preview9967235038.9%
Gemini 3.5 Flash (Reasoning)696030191238.1%
Qwen 3.5 Plus (2026-02-15)524036322937.8%
Gemini 2.5 Pro78443818937.5%
GPT-5.4 Mini (Reasoning)393635353435.9%
DeepSeek V3 (2024-12-26)10053233035.7%
Claude 3.5 Sonnet60454524034.6%
DeepSeek-V2 Chat8560280034.5%
GPT-58440357634.3%
Llama 3.1 70B605724161133.5%
Z.AI GLM 4.7514629201832.7%
Ministral 3B605124111031.2%
Grok 4.20 (Reasoning)582928211830.8%
Qwen 3.6 Flash66491711930.5%
Z.AI GLM 4.7 Flash5942368530.3%
Arcee AI: Trinity Mini5551358029.8%
Gemini 3 Pro (Preview)463128281128.8%
DeepSeek V3.1502926241328.5%
Gemini 3.1 Flash Lite (Preview)7143200026.8%
GPT-5.240402919626.8%
Gemini 2.5 Flash (Reasoning)433524201226.7%
Mistral NeMO352726232126.4%
Qwen 3 32B39362821225.3%
Gemma 3 4B45382118024.5%
o4 Mini High5243178024.1%
Mistral Small 3.2 24B6733210024.1%
Gemini 2.5 Flash52272120024.1%
ByteDance Seed 2.0 Lite48331613523.1%
Qwen 3.6 35B7022172022.3%
GPT-5.4 Nano (Reasoning, Low)282521191421.5%
Qwen 3.5 9B5427260021.3%
MoonshotAI: Kimi K2.54330158720.7%
Z.AI GLM 4.53530260018.1%
LFM2 24B27261311917.0%
Ministral 3 3B611680016.9%
o4 Mini3924147016.7%
GPT-5 Mini3328130014.8%
Gemini 3 Flash (Preview)302855414.4%
Grok 4.20 (Beta, Reasoning)541053014.4%
Gemma 4 31B25151411514.1%
Grok 4.33419160013.7%
Gemini 3.1 Flash Lite (Reasoning)292691013.0%
Qwen 3.5 397B A17B2020107011.5%
Gemini 3 Flash (Preview, Reasoning)331361010.7%
GPT-4o, Aug. 6th (temp=0)28220009.9%
Qwen 3.5 Flash25202009.5%
Qwen 2.5 72B4061009.5%
Nemotron 3 Super19178008.8%
Qwen 3.5 35B26160008.4%
Gemma 4 26B (Reasoning)1288707.0%
Gemma 4 31B (Reasoning)20104006.7%
Gemini 3.1 Pro (Preview)15132005.9%
Qwen 3.5 122B2900005.8%
GPT-5 Nano2170005.6%
Gemma 4 26B1780005.0%
Grok 4.3 (Reasoning)1470004.3%
Qwen 3.6 27B1250003.4%
GPT-4o, May 13th (temp=0)1250003.3%
GPT-4o Mini (temp=0)753003.0%
Gemini 3.1 Flash Lite1200002.4%
Qwen3.7 Max640002.1%
ByteDance Seed 1.6300000.5%
Nemotron 3 Nano100000.2%
Qwen 3.5 27B000000.0%
GPT-OSS 120B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.20 (Reasoning)100100100776087.5%
Llama 3.1 8B10010072514072.6%
Writer: Palmyra X5998266561864.2%
Qwen3 235B A22B Instruct 2507876962613562.9%
Llama 3.1 Nemotron 70B10010074251162.0%
Ministral 3 14B1008358362460.3%
GPT-5 Nano1001001000060.1%
Mistral Large 2987460511359.2%
Llama 3.1 70B836362454259.0%
Grok 4.1 Fast100925940058.2%
Grok 4.20 (Beta, Reasoning)906862452257.4%
Mistral Small 4786849492854.5%
Z.AI GLM 51005856352154.0%
Mistral Small 4 (Reasoning)846748411951.6%
Mistral Small Creative856454381250.6%
Claude Opus 4.5565447412143.8%
Mistral NeMO100100160043.1%
Rocinante 12B100100110042.2%
Cohere Command R+ (Aug. 2024)8568560041.9%
Claude Opus 4.7834240212141.2%
Hermes 3 405B8068490039.2%
o4 Mini High7975354038.4%
Claude Sonnet 4.679582424638.2%
Claude Sonnet 4.595482423038.0%
DeepSeek V4 Pro655229261837.8%
Ministral 3 8B94462418437.5%
Mistral Medium 3.158474137337.1%
Grok 4.20574341251836.8%
Arcee AI: Trinity Large (Preview)10067160036.5%
Claude Sonnet 46655535036.0%
Grok 4 Fast574629232135.1%
Mistral Large 3524340251434.5%
Hermes 3 70B89352320033.2%
GPT-5.4393428241828.5%
DeepSeek V4 Pro (Reasoning)735931027.2%
Grok 4.20 (Beta)43412814225.8%
Claude Opus 4353325221125.1%
Z.AI GLM 5.16731139024.1%
Claude Sonnet 4.6 (Reasoning)40272424524.1%
MiniMax M2.750302514023.8%
Claude Haiku 4.5524996023.1%
Qwen 3 32B712790021.4%
Gemini 3.1 Flash Lite (Reasoning)34302317020.6%
Claude 3.7 Sonnet5325213020.2%
MiniMax M2.57015141020.0%
Claude Opus 4.640241916119.8%
Mistral Small 3.2 24B801900019.8%
MoonshotAI: Kimi K2.55120164318.7%
Z.AI GLM 5 Turbo4231182018.7%
GPT-4o, May 13th (temp=1)731610018.1%
LFM2 24B4033161018.0%
Z.AI GLM 4.65323121018.0%
Claude Opus 4.6 (Reasoning)3527242017.6%
GPT-5.5 (Reasoning, Low)3027260016.7%
DeepSeek V4 Flash3431144016.6%
Qwen3.6 Max Preview4130100016.3%
DeepSeek V3 (2024-12-26)4816133015.9%
Qwen 3.5 Plus (2026-04-20)79000015.7%
Qwen 3.5 27B70600015.2%
GPT-4o, Aug. 6th (temp=1)2926193015.2%
Gemma 3 27B551450014.8%
o4 Mini31161610014.4%
GPT-5.5 (Reasoning)3419127014.2%
GPT-5.4 (Reasoning)29131310614.2%
Stealth: Healer Alpha541330014.1%
MoonshotAI: Kimi K2.6342564114.1%
GPT-5363300013.9%
GPT-4.12322220013.1%
Claude Opus 4.7 (Reasoning)343000012.8%
Qwen 3.6 Flash2120200012.1%
GPT-5.4 Mini2619103211.9%
Z.AI GLM 4.548520010.9%
GPT-5.52215116010.9%
Mistral Large281933010.6%
ByteDance Seed 2.0 Lite49300010.6%
DeepSeek V3 (2025-03-24)351700010.5%
GPT-4o, May 13th (temp=0)411100010.3%
Stealth: Hunter Alpha361400010.0%
DeepSeek V4 Flash (Reasoning)161614009.1%
GPT-5.4 (Reasoning, Low)181711009.0%
DeepSeek-V2 Chat25200009.0%
Ministral 8B35100008.9%
Gemini 3.1 Flash Lite (Preview)30110008.2%
ByteDance Seed 1.6 Flash25160008.1%
DeepSeek V3.13900007.7%
Grok 43520007.3%
Ministral 3B16146007.2%
Qwen3.7 Max22130006.9%
GPT-5.4 Mini (Reasoning, Low)23100006.6%
GPT-4o Mini (temp=1)20120006.4%
Gemini 2.5 Pro2084006.3%
Gemini 3 Pro (Preview)20110006.3%
Qwen 2.5 72B3000006.0%
GPT-5.4 Mini (Reasoning)15112005.7%
Aion 2.016100005.1%
Qwen 3.5 397B A17B1465005.1%
Claude 3 Haiku2330005.0%
Xiaomi MIMO v2.5 Pro2131004.9%
Gemma 3 4B1480004.3%
Gemini 3.5 Flash (Reasoning, Minimal)1640004.1%
Qwen 3.5 35B2000004.0%
Z.AI GLM 4.71900003.8%
Qwen 3.6 35B1630003.8%
Gemma 3 12B1620003.7%
WizardLM 2 8x22b870003.1%
Z.AI GLM 4.7 Flash1600003.1%
Claude 3.5 Sonnet1600003.1%
GPT-4.1 Mini1600003.1%
GPT-5.11131003.1%
DeepSeek V3.2653003.0%
Gemini 3 Flash (Preview)1400002.9%
Nemotron 3 Super1400002.7%
Z.AI GLM 4.5 Air1400002.7%
Gemma 4 26B1300002.5%
Gemini 2.5 Flash Lite (Reasoning)920002.1%
GPT-4.1 Nano540001.9%
ByteDance Seed 1.6900001.8%
GPT-5.4 Nano (Reasoning, Low)900001.8%
Ministral 3 3B800001.6%
Qwen 3.5 Plus (2026-02-15)800001.5%
Qwen 3.5 Flash600001.2%
Gemini 2.5 Flash Lite310000.8%
Xiaomi MIMO v2.5400000.8%
Gemma 4 31B300000.6%
Gemini 3 Flash (Preview, Reasoning)300000.6%
Qwen 3.5 122B210000.5%
GPT-5 Mini300000.5%
Gemini 3.1 Flash Lite200000.3%
Nemotron 3 Nano200000.3%
Grok 4.3100000.2%
Inception Mercury000000.1%
Gemini 3.1 Pro (Preview)000000.0%
Gemini 3.5 Flash (Reasoning)000000.0%
Grok 4.3 (Reasoning)000000.0%
Gemma 4 31B (Reasoning)000000.0%
Gemma 4 26B (Reasoning)000000.0%
GPT-5.2000000.0%
Qwen 3.6 27B000000.0%
ByteDance Seed 2.0 Mini000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
Gemini 2.5 Flash000000.0%
GPT-4o Mini (temp=0)000000.0%
GPT-5.4 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 25071009392797688.0%
Writer: Palmyra X51001001001003987.8%
Rocinante 12B100100100952283.2%
GPT-5.4 (Reasoning)10010088464576.0%
Cohere Command R+ (Aug. 2024)100100100641074.9%
GPT-5.4887774725973.8%
Qwen3.6 Max Preview968761545270.2%
Claude Opus 4.6 (Reasoning)838383514869.4%
Z.AI GLM 51008478463869.2%
Grok 4.20 (Reasoning)907362605768.5%
Claude Sonnet 4.61009070363666.5%
GPT-5.1878167494365.3%
Grok 410010078371065.0%
Mistral Small 4 (Reasoning)997364473463.4%
GPT-5.5767066614062.5%
Claude Opus 4.6948151513462.3%
GPT-5.4 (Reasoning, Low)696462624861.1%
Claude Opus 4827956541156.4%
Arcee AI: Trinity Large (Preview)856560561556.3%
GPT-5.5 (Reasoning)785652494656.3%
Mistral Small Creative756357463855.9%
GPT-5.5 (Reasoning, Low)696159504155.9%
Grok 4.201007142342454.2%
Qwen 3.6 35B1006657311453.4%
Claude Opus 4.7 (Reasoning)745752463452.7%
Claude Haiku 4.592716825552.1%
Claude Opus 4.7965652451152.0%
Stealth: Healer Alpha816552401750.9%
Stealth: Hunter Alpha895348342750.2%
DeepSeek V4 Pro686546412749.3%
ByteDance Seed 2.0 Lite80765236249.2%
Z.AI GLM 5.1824639393247.6%
MiniMax M2.7936441231547.1%
MoonshotAI: Kimi K2.5757455221047.0%
Claude Opus 4.5766636342046.5%
GPT-5.4 Mini (Reasoning)705744332546.0%
Claude Sonnet 4.6 (Reasoning)885039361645.7%
GPT-5946726221945.6%
Xiaomi MIMO v2.5796448271145.4%
Claude 3.5 Sonnet100554823045.0%
Z.AI GLM 5 Turbo81673432042.8%
Claude Sonnet 483564035042.8%
Grok 4.1 Fast71635425042.5%
Hermes 3 405B9071428042.3%
GPT-4.1625143312242.1%
DeepSeek V4 Flash (Reasoning)635642301140.4%
Grok 4.20 (Beta, Reasoning)68604528040.3%
o4 Mini100463417239.9%
WizardLM 2 8x22b655631242339.8%
Mistral Large 272483434839.4%
ByteDance Seed 2.0 Mini9880180039.2%
Grok 4 Fast80413733038.3%
DeepSeek V3.172514616037.1%
Ministral 3 14B545134232336.8%
Mistral Medium 3.164564124036.8%
Gemini 2.5 Flash (Reasoning)78493720036.7%
o4 Mini High57514723536.7%
MiniMax M2.5514539262136.3%
ByteDance Seed 1.6 Flash100452115036.0%
GPT-4o, May 13th (temp=0)54544327035.7%
Grok 4.20 (Beta)524535331535.6%
GPT-5.4 Mini (Reasoning, Low)62403933035.1%
Qwen 3.6 27B664029291134.8%
DeepSeek V3.2624627231133.9%
Z.AI GLM 4.647424026933.0%
Qwen 3.5 Plus (2026-04-20)6951353031.7%
Claude Sonnet 4.55150495231.5%
Mistral Small 47742219731.3%
Xiaomi MIMO v2.5 Pro78282517831.2%
Hermes 3 70B7443363031.2%
Gemini 2.5 Pro353431292430.6%
Gemini 2.5 Flash Lite61423314030.2%
DeepSeek V3 (2025-03-24)423931271130.0%
GPT-5.4 Mini43403428329.7%
Qwen 3.6 Flash885811029.5%
Llama 3.1 70B1004121028.7%
Qwen 3.5 Flash473629171028.0%
Ministral 3B6548270028.0%
MoonshotAI: Kimi K2.656333318028.0%
Gemini 3.5 Flash (Reasoning)42383424027.7%
Claude 3.7 Sonnet43402921126.9%
Qwen 3 32B5833315426.5%
DeepSeek V4 Pro (Reasoning)572819151326.3%
Qwen 3.5 Plus (2026-02-15)53272723026.0%
GPT-4o, Aug. 6th (temp=1)4742401025.9%
Ministral 3 8B65291817025.8%
Mistral Large823980025.7%
Ministral 8B6050170025.4%
Gemini 3.1 Flash Lite (Reasoning)4745310024.7%
ByteDance Seed 1.653401713024.6%
Gemma 3 27B5534205524.0%
Grok 4.331312820122.3%
Mistral Small 3.2 24B713150021.6%
Gemini 2.5 Flash43242319021.5%
DeepSeek V4 Flash5637130021.3%
Qwen 3.5 397B A17B5326250020.8%
Gemini 3.1 Flash Lite (Preview)55181512020.1%
Aion 2.04527250019.5%
Qwen 3.5 122B3829236019.3%
LFM2 24B3633270019.2%
Claude 3 Haiku3331280018.4%
Gemini 2.5 Flash Lite (Reasoning)2927249018.0%
Gemini 3 Pro (Preview)4231160017.7%
Gemini 3.1 Flash Lite503620017.6%
Gemma 3 4B531987017.5%
Nemotron 3 Super412799017.2%
Llama 3.1 Nemotron 70B2827265017.1%
Gemma 4 31B321414131216.9%
GPT-5.4 Nano2825189016.0%
DeepSeek V3 (2024-12-26)552300015.5%
Gemini 3 Flash (Preview)32231111015.4%
Z.AI GLM 4.7 Flash551332014.4%
Gemma 3 12B3222160013.9%
GPT-4o Mini (temp=1)282675313.9%
GPT-4o, Aug. 6th (temp=0)58730013.7%
Z.AI GLM 4.5 Air2721180013.3%
GPT-4o, May 13th (temp=1)2517167213.1%
Llama 3.1 8B2821104012.7%
Gemini 3.5 Flash (Reasoning, Minimal)342530012.6%
GPT-5.22315129011.9%
GPT-4.1 Nano232291010.8%
GPT-5.4 Nano (Reasoning, Low)161616209.8%
Inception Mercury4600009.3%
Mistral NeMO28104308.9%
Gemini 3 Flash (Preview, Reasoning)18137328.9%
Mistral Large 328142008.8%
GPT-5.4 Nano (Reasoning)171212208.7%
Gemma 4 26B3180007.8%
Gemma 4 31B (Reasoning)16137007.2%
Z.AI GLM 4.717117006.9%
GPT-4o Mini (temp=0)3400006.7%
DeepSeek-V2 Chat2580006.6%
Qwen 3.5 35B3200006.3%
Gemini 3.1 Pro (Preview)1398006.1%
Grok 4.3 (Reasoning)1540003.9%
GPT-5 Mini1251003.6%
Arcee AI: Trinity Mini1700003.4%
GPT-4.1 Mini1420003.1%
Qwen3.7 Max1300002.7%
Ministral 3 3B1300002.6%
GPT-5 Nano920002.2%
Z.AI GLM 4.5300000.6%
Qwen 2.5 72B100000.2%
Gemma 4 26B (Reasoning)000000.1%
Qwen 3.5 27B000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100989669072.6%
Mistral Small Creative1007265623166.2%
Mistral Small 4 (Reasoning)1007251352556.6%
Claude Opus 4975846403655.4%
Claude Opus 4.6 (Reasoning)856749482053.8%
Ministral 8B834747423951.8%
Qwen 3.6 35B716655382150.3%
Ministral 3 14B75635946750.2%
Hermes 3 405B1005633282548.4%
Qwen 3 32B776844371147.3%
Z.AI GLM 5965441251746.8%
Claude Opus 4.6714542383345.8%
Arcee AI: Trinity Large (Preview)78685710744.0%
Qwen3 235B A22B Instruct 2507705249301943.9%
Z.AI GLM 5.1100504127043.9%
Grok 4.20594040373642.4%
Writer: Palmyra X5825940151241.4%
Qwen 3.6 Flash77524036041.0%
MiniMax M2.5554739362540.5%
MiniMax M2.7714837271038.6%
Mistral Small 410071165038.6%
GPT-5.4 Mini463838343237.7%
GPT-5 Nano1008350037.5%
Claude Opus 4.5584931242336.9%
DeepSeek V4 Pro514832272336.4%
Llama 3.1 70B56453834836.1%
Llama 3.1 Nemotron 70B66454117835.6%
ByteDance Seed 1.6 Flash594136281435.4%
GPT-5.4 (Reasoning, Low)414139272634.7%
Ministral 3 8B76383119032.9%
GPT-5.5414037301632.8%
Gemini 3.1 Flash Lite (Reasoning)64472822032.0%
Qwen 3.5 Plus (2026-04-20)9135330031.7%
Llama 3.1 8B6969125031.2%
Z.AI GLM 5 Turbo90421310031.0%
Claude 3 Haiku45453523530.7%
o4 Mini High50433219028.9%
GPT-5.4 (Reasoning)403131251328.0%
Mistral Large 37744170027.8%
GPT-5.4 Mini (Reasoning)51472120027.6%
DeepSeek V4 Flash47362423827.6%
DeepSeek V4 Pro (Reasoning)533222181327.5%
Hermes 3 70B71312112027.2%
Grok 4 Fast66312414027.2%
GPT-5.5 (Reasoning, Low)46302926126.2%
DeepSeek V4 Flash (Reasoning)5937340025.9%
Grok 4.20 (Beta, Reasoning)5942260025.5%
Grok 4.20 (Reasoning)60391612025.3%
DeepSeek V3 (2024-12-26)6329276025.1%
ByteDance Seed 2.0 Lite45362218024.2%
Z.AI GLM 4.664241911023.6%
Grok 4.1 Fast3535328823.5%
Claude Opus 4.7 (Reasoning)5437148223.0%
MoonshotAI: Kimi K2.54939168022.3%
GPT-5.43936286021.7%
Claude Opus 4.74334198020.8%
Claude Sonnet 4.6 (Reasoning)49231812020.5%
GPT-4.1 Nano332317161220.2%
Gemini 3.1 Flash Lite (Preview)39252117020.1%
GPT-4o, Aug. 6th (temp=1)3730267020.1%
o4 Mini4328270019.8%
Claude Sonnet 4.55628140019.7%
Gemma 3 27B31242220219.5%
WizardLM 2 8x22b791160019.3%
Claude Haiku 4.55023156018.8%
GPT-4o Mini (temp=1)46171610218.3%
DeepSeek V3 (2025-03-24)35252010018.0%
GPT-5454320018.0%
DeepSeek V3.1661750017.6%
Qwen 3.5 397B A17B26232117017.4%
DeepSeek V3.23030240016.7%
GPT-4o, May 13th (temp=1)27221717016.5%
GPT-5.1421498716.0%
Xiaomi MIMO v2.52723217015.6%
Grok 4.20 (Beta)3427161015.6%
Mistral Small 3.2 24B501890015.5%
Gemma 3 12B3223165015.4%
Stealth: Healer Alpha4814110014.6%
Claude Sonnet 42827113214.2%
Mistral Large2925160013.9%
Mistral Medium 3.12622210013.7%
Cohere Command R+ (Aug. 2024)3419160013.6%
Qwen3.6 Max Preview402700013.4%
MoonshotAI: Kimi K2.6511400013.1%
GPT-5.5 (Reasoning)19151311512.8%
ByteDance Seed 2.0 Mini55540012.8%
Gemini 2.5 Flash Lite3118113012.7%
Gemini 2.5 Pro391860012.6%
Grok 42114137512.0%
GPT-5.4 Mini (Reasoning, Low)341186011.7%
Gemini 2.5 Flash Lite (Reasoning)1716149011.4%
GPT-4.1 Mini302610011.3%
Mistral Large 257000011.3%
Stealth: Hunter Alpha2614123011.1%
GPT-4o, Aug. 6th (temp=0)391050010.9%
Qwen 3.5 35B312300010.8%
Xiaomi MIMO v2.5 Pro431000010.7%
GPT-4.12216142010.7%
Mistral NeMO421110010.7%
Gemini 3 Pro (Preview)411000010.1%
Grok 4.3 (Reasoning)4500009.0%
Qwen 3.5 122B26181009.0%
Qwen 3.5 Flash3852009.0%
Claude Sonnet 4.6171513008.9%
Gemini 3.1 Flash Lite3190008.1%
Claude 3.5 Sonnet161311007.9%
Z.AI GLM 4.526110007.3%
Aion 2.016153207.2%
Gemini 2.5 Flash3400006.7%
Claude 3.7 Sonnet16140005.9%
Z.AI GLM 4.72900005.7%
Grok 4.32600005.1%
Gemma 3 4B1690004.9%
Qwen3.7 Max2200004.4%
GPT-4o, May 13th (temp=0)1070003.5%
ByteDance Seed 1.61330003.1%
DeepSeek-V2 Chat1500002.9%
Gemini 3.5 Flash (Reasoning)1400002.8%
Gemini 2.5 Flash (Reasoning)1300002.6%
Z.AI GLM 4.5 Air1300002.6%
GPT-4o Mini (temp=0)1300002.5%
Ministral 3 3B1200002.3%
GPT-5.21100002.2%
LFM2 24B810001.9%
Arcee AI: Trinity Mini900001.8%
Qwen 3.5 9B600001.2%
Nemotron 3 Super320001.0%
Qwen 2.5 72B400000.8%
Qwen 3.6 27B300000.6%
Ministral 3B200000.5%
GPT-5.4 Nano100000.3%
Gemini 3.5 Flash (Reasoning, Minimal)100000.2%
Z.AI GLM 4.7 Flash100000.2%
Gemini 3.1 Pro (Preview)000000.0%
GPT-5 Mini000000.0%
Gemma 4 31B (Reasoning)000000.0%
Gemma 4 26B (Reasoning)000000.0%
Qwen 3.5 27B000000.0%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Gemma 4 31B000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemma 4 26B000000.0%
Gemini 3 Flash (Preview)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
Inception Mercury000000.0%
GPT-5.4 Nano (Reasoning, Low)000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 250710010096896289.4%
Writer: Palmyra X5100100100816188.3%
Z.AI GLM 51009996905487.8%
Hermes 3 70B10010093727187.3%
Rocinante 12B10010071585576.7%
Claude Opus 4.7998578554973.1%
Claude Haiku 4.510010075602571.9%
Llama 3.1 8B10010065623171.8%
Grok 4.1 Fast1008378574071.6%
Claude Sonnet 4.5897575574969.1%
Arcee AI: Trinity Large (Preview)878759535367.7%
GPT-5.4867664545166.0%
Z.AI GLM 5.11008462582465.6%
GPT-4.1 Nano917368464564.4%
Claude Opus 4.61007973392964.0%
Claude Opus 4.5887372444163.7%
Claude Opus 4.7 (Reasoning)827161563761.5%
MiniMax M2.5736958544860.6%
Mistral Small 4 (Reasoning)100848227960.5%
GPT-5.4 (Reasoning, Low)796864543459.6%
DeepSeek V4 Pro726763484859.5%
Claude Opus 4.6 (Reasoning)796750504458.2%
Mistral Small 4978464311457.8%
Claude Sonnet 4.6857955471957.1%
MoonshotAI: Kimi K2.5948955281656.4%
GPT-5.4 (Reasoning)777448443956.4%
Grok 4.20705352523652.6%
Claude 3 Haiku838267191052.3%
Grok 4.20 (Reasoning)585757523451.7%
Hermes 3 405B706649472351.0%
Ministral 3 14B876847312050.4%
Gemini 2.5 Flash Lite786142382348.2%
GPT-4o, Aug. 6th (temp=1)695640383647.9%
GPT-5.5 (Reasoning, Low)595651383547.7%
Mistral Small Creative825639392047.3%
Claude Opus 4635441393646.7%
DeepSeek V4 Pro (Reasoning)695242392946.1%
Gemini 2.5 Flash Lite (Reasoning)774139353445.3%
Cohere Command R+ (Aug. 2024)716542281544.4%
GPT-4.1564544373643.7%
Claude 3.7 Sonnet594947392142.9%
GPT-5.4 Mini714641362042.8%
DeepSeek V4 Flash (Reasoning)814444311442.8%
GPT-5.5 (Reasoning)504843373542.7%
Llama 3.1 70B904835231341.8%
Qwen 3.6 27B675745251441.8%
Mistral Small 3.2 24B85653029041.7%
Llama 3.1 Nemotron 70B655037302841.7%
GPT-4o, May 13th (temp=1)67634135041.2%
GPT-4.1 Mini74574821240.5%
Z.AI GLM 4.6504240373240.4%
Mistral Medium 3.1665131282339.9%
Grok 4.20 (Beta)614543272439.8%
Z.AI GLM 5 Turbo57554235839.5%
Qwen3.6 Max Preview70543924939.0%
Gemma 3 27B625638241539.0%
Gemma 3 12B614944211738.6%
Grok 4 Fast684238222238.5%
Stealth: Hunter Alpha65653526038.1%
GPT-5.5474039333037.9%
Gemini 3.1 Flash Lite (Reasoning)704742171237.7%
Grok 4634642201336.9%
MoonshotAI: Kimi K2.6494536352036.8%
GPT-5.4 Mini (Reasoning, Low)484736311836.1%
Claude Sonnet 4100551410035.8%
DeepSeek V3.157494415734.6%
ByteDance Seed 1.6 Flash60523418834.4%
Gemini 3.1 Pro (Preview)74473213534.3%
Claude Sonnet 4.6 (Reasoning)494331262134.2%
GPT-5.1483938261833.7%
Xiaomi MIMO v2.558453323532.9%
Xiaomi MIMO v2.5 Pro58543210832.5%
Aion 2.0404030301831.4%
DeepSeek V4 Flash514025241531.0%
o4 Mini High503826211429.7%
Ministral 3 8B96261412029.6%
Grok 4.20 (Beta, Reasoning)423728211729.0%
Qwen 3.5 Plus (2026-02-15)59432319028.8%
DeepSeek-V2 Chat43403416327.2%
Qwen 3.6 Flash403634131327.1%
Stealth: Healer Alpha63421511326.8%
Z.AI GLM 4.7 Flash433027191326.5%
WizardLM 2 8x22b46312524626.5%
Gemini 2.5 Pro512922141225.6%
Claude 3.5 Sonnet4342367025.5%
Gemini 3.5 Flash (Reasoning)60242214525.1%
GPT-4o, Aug. 6th (temp=0)39372220825.1%
Z.AI GLM 4.5 Air40402321125.0%
Ministral 3 3B43352412824.6%
Gemini 3 Pro (Preview)4634309224.3%
DeepSeek V3.2412623191224.2%
MiniMax M2.7473117131224.0%
Z.AI GLM 4.5422725161124.0%
o4 Mini51331615523.8%
GPT-5.4 Mini (Reasoning)41302914523.8%
Qwen 3.5 Plus (2026-04-20)704800023.7%
GPT-5 Nano1001400022.9%
Mistral Large7017168122.5%
GPT-4o Mini (temp=1)402219191122.3%
Gemini 2.5 Flash35282719121.9%
Mistral Large 336292511821.8%
Gemini 2.5 Flash (Reasoning)38342115021.5%
Ministral 8B5831101019.9%
Gemma 3 4B262514141418.7%
Mistral NeMO33281715018.6%
DeepSeek V3 (2025-03-24)3431172016.9%
ByteDance Seed 2.0 Mini3030240016.8%
Arcee AI: Trinity Mini28231614016.2%
Qwen 3.6 35B5416110016.1%
Gemini 3.1 Flash Lite (Preview)402672015.2%
LFM2 24B3620118015.1%
ByteDance Seed 2.0 Lite3914119215.0%
Grok 4.3 (Reasoning)353180015.0%
Qwen 3.5 397B A17B353550014.9%
Nemotron 3 Nano491870014.9%
Gemini 3.5 Flash (Reasoning, Minimal)3817132014.1%
Qwen3.7 Max511700013.6%
Z.AI GLM 4.72815118513.4%
Qwen 3.5 35B302590012.9%
GPT-5.4 Nano332441112.6%
Mistral Large 2311698012.6%
GPT-5 Mini371654012.5%
GPT-4o, May 13th (temp=0)2820120012.0%
Ministral 3B2721112012.0%
Qwen 3.5 Flash381560011.7%
Qwen 3 32B211910009.9%
ByteDance Seed 1.64180009.9%
GPT-5.4 Nano (Reasoning, Low)19157439.4%
Grok 4.329123109.0%
Gemini 3.1 Flash Lite4200008.4%
Gemma 4 31B25130007.8%
Gemma 4 31B (Reasoning)141110007.1%
GPT-5.4 Nano (Reasoning)2490006.4%
Gemini 3 Flash (Preview)12104305.9%
Gemma 4 26B (Reasoning)16130005.7%
Qwen 2.5 72B15111005.1%
Qwen 3.5 9B1900003.8%
DeepSeek V3 (2024-12-26)1800003.6%
Gemma 4 26B1610003.4%
Gemini 3 Flash (Preview, Reasoning)1310002.8%
Nemotron 3 Super1200002.3%
GPT-4o Mini (temp=0)1100002.2%
GPT-51100002.2%
GPT-5.2900001.8%
Qwen 3.5 27B500001.1%
Qwen 3.5 122B310000.8%
GPT-OSS 120B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%