Subject-first sentence starts

Test: Bad Writing Habits

Avg. Score
35.5%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Writer: Palmyra X583.2%$0.01122.0s50%
2Qwen3 235B A22B Instruct 250781.0%$0.001159.2s48%
3Rocinante 12B77.8%$0.001438.4s38%
4Llama 3.1 8B71.0%$0.00031.3m29%
5GPT-5.470.2%$0.0491.4m39%
6Mistral Small 4 (Reasoning)60.6%$0.002230.2s25%
7Mistral Small 456.1%$0.001418.2s24%
8Ministral 3 14B50.5%$0.000711.7s25%
9GPT-5.4 (Reasoning, Low)67.7%$0.0551.4m36%
10Claude Sonnet 4.563.6%$0.03538.1s27%
11Grok 4.1 Fast54.6%$0.001837.8s24%
12Mistral Small Creative49.3%$0.00079.1s22%
13Z.AI GLM 558.9%$0.00841.2m26%
14GPT-5.4 Mini50.0%$0.01516.8s25%
15Claude Haiku 4.552.5%$0.01121.6s21%
16Z.AI GLM 5 Turbo49.6%$0.008133.2s22%
17GPT-5.4 Mini (Reasoning, Low)47.8%$0.01516.8s23%
18Grok 4.2045.9%$0.009345.7s26%
19Hermes 3 70B57.9%$0.00101.2m20%
20Mistral Medium 3.143.3%$0.004836.5s25%
21MiniMax M2.552.7%$0.00341.3m23%
22Grok 4 Fast45.4%$0.001724.1s20%
23Hermes 3 405B53.5%$0.003253.2s19%
24DeepSeek V4 Pro51.0%$0.00481.3m23%
25Grok 4.20 (Beta)38.9%$0.01815.8s25%
26Claude Opus 4.760.1%$0.06930.4s26%
27Claude Sonnet 453.0%$0.03243.7s22%
28Claude Sonnet 4.653.3%$0.03139.3s21%
29Llama 3.1 Nemotron 70B45.5%$0.003831.7s18%
30Arcee AI: Trinity Large (Preview)44.0%$0.000043.6s19%
31Llama 3.1 70B44.4%$0.001529.4s17%
32Claude Opus 4.558.8%$0.07053.4s27%
33Claude 3 Haiku45.6%$0.002514.9s14%
34DeepSeek V4 Flash (Reasoning)41.5%$0.000731.1s18%
35Gemini 2.5 Flash Lite34.5%$0.00099.5s18%
36DeepSeek V4 Flash42.5%$0.000631.6s16%
37GPT-5.4 Mini (Reasoning)45.0%$0.02228.1s19%
38ByteDance Seed 1.6 Flash38.0%$0.001327.3s18%
39Z.AI GLM 5.152.9%$0.0141.5m20%
40Grok 4.20 (Reasoning)45.7%$0.0181.5m25%
41Ministral 8B37.6%$0.000410.4s15%
42GPT-4o, Aug. 6th (temp=1)44.8%$0.01824.4s17%
43MiniMax M2.746.3%$0.00401.1m17%
44Stealth: Healer Alpha34.3%$0.000023.7s18%
45Xiaomi MIMO v2.5 Pro41.1%$0.008553.5s19%
46DeepSeek V3 (2025-03-24)42.0%$0.001439.4s14%
47GPT-4.139.4%$0.01844.7s21%
48Ministral 3 8B38.5%$0.000819.6s14%
49Gemma 3 12B36.6%$0.000441.3s17%
50Mistral Large 240.1%$0.01329.4s16%
51Mistral Large 336.9%$0.003330.3s16%
52Stealth: Hunter Alpha40.0%$0.000055.0s17%
53Gemini 2.5 Flash Lite (Reasoning)33.3%$0.002830.8s17%
54GPT-4o Mini (temp=1)38.9%$0.001234.8s13%
55Xiaomi MIMO v2.536.7%$0.005431.8s15%
56Gemma 3 27B39.8%$0.000652.6s14%
57Claude Sonnet 4.6 (Reasoning)53.7%$0.0601.2m22%
58Claude Opus 4.656.4%$0.0781.2m25%
59Qwen 3.6 Flash38.1%$0.01041.4s16%
60GPT-4.1 Nano34.2%$0.000713.3s12%
61Claude Opus 4.6 (Reasoning)58.5%$0.0881.4m26%
62Qwen 3.6 35B37.5%$0.00831.0m17%
63Cohere Command R+ (Aug. 2024)45.9%$0.02052.5s13%
64Claude Opus 4.7 (Reasoning)52.7%$0.07632.0s20%
65Mistral Large37.3%$0.01430.9s13%
66GPT-5.4 Nano28.8%$0.005726.3s16%
67Gemini 2.5 Flash28.1%$0.005210.6s13%
68GPT-5.4 (Reasoning)65.0%$0.0892.6m29%
69LFM2 24B28.6%$0.000228.4s13%
70GPT-5.4 Nano (Reasoning, Low)26.5%$0.005520.6s15%
71GPT-5.151.4%$0.0541.8m22%
72Z.AI GLM 4.628.9%$0.006551.5s17%
73Gemma 3 4B29.4%$0.000220.0s11%
74GPT-5.4 Nano (Reasoning)26.8%$0.006124.5s15%
75Mistral NeMO28.6%$0.000510.1s10%
76Grok 4.20 (Beta, Reasoning)34.7%$0.03934.0s18%
77Qwen 3 32B32.1%$0.001554.6s13%
78Gemini 2.5 Flash (Reasoning)32.2%$0.01121.5s11%
79WizardLM 2 8x22b38.5%$0.00261.8m16%
80DeepSeek V4 Pro (Reasoning)46.7%$0.0153.1m23%
81GPT-4o, May 13th (temp=1)30.5%$0.03314.4s15%
82Gemini 3.1 Flash Lite (Reasoning)23.5%$0.003011.9s11%
83DeepSeek V3.233.7%$0.00141.9m17%
84Ministral 3B23.3%$0.00018.1s10%
85o4 Mini25.8%$0.01525.7s14%
86Z.AI GLM 4.529.0%$0.005142.1s12%
87GPT-4.1 Mini24.8%$0.002719.0s10%
88Aion 2.028.8%$0.00641.3m15%
89Z.AI GLM 4.5 Air28.2%$0.002958.2s12%
90Claude 3.7 Sonnet32.1%$0.04246.7s17%
91Gemini 2.5 Pro29.5%$0.03636.2s16%
92Qwen 3.5 Plus (2026-02-15)22.7%$0.006031.5s12%
93Gemini 3.1 Flash Lite (Preview)21.3%$0.00308.4s8%
94MoonshotAI: Kimi K2.543.7%$0.0193.2m22%
95GPT-5.5 (Reasoning, Low)55.5%$0.1391.8m32%
96o4 Mini High28.1%$0.02547.2s14%
97Grok 443.0%$0.0481.7m16%
98GPT-5.556.1%$0.1391.7m29%
99Grok 4.321.7%$0.006930.5s9%
100DeepSeek V3.126.6%$0.00201.8m13%
101Ministral 3 3B18.7%$0.000511.1s5%
102Claude 3.5 Sonnet32.1%$0.04835.5s11%
103DeepSeek V3 (2024-12-26)22.0%$0.002154.6s7%
104Qwen 3.5 Plus (2026-04-20)31.8%$0.0171.8m12%
105Gemini 3 Flash (Preview)17.4%$0.007819.6s7%
106DeepSeek-V2 Chat21.7%$0.002153.3s6%
107Gemini 3.1 Flash Lite19.0%$0.003012.1s3%
108GPT-4o, Aug. 6th (temp=0)17.5%$0.02322.7s8%
109GPT-4o, May 13th (temp=0)18.6%$0.03514.1s8%
110Qwen 3.5 397B A17B33.8%$0.0143.0m15%
111Z.AI GLM 4.7 Flash16.8%$0.00171.2m8%
112Gemini 3 Flash (Preview, Reasoning)17.3%$0.01230.1s5%
113GPT-5.5 (Reasoning)51.9%$0.1421.8m25%
114GPT-5 Mini17.7%$0.010057.4s7%
115Gemma 4 26B14.2%$0.000955.1s6%
116Gemma 4 31B19.2%$0.00101.6m8%
117Qwen3.6 Max Preview44.1%$0.0503.5m19%
118Arcee AI: Trinity Mini11.6%$0.00039.2s0%
119Z.AI GLM 4.716.6%$0.0101.4m8%
120Qwen 3.5 Flash12.9%$0.002547.5s2%
121Qwen 2.5 72B8.8%$0.001036.7s2%
122Nemotron 3 Super13.3%$0.00001.4m5%
123Claude Opus 457.5%$0.2091.4m28%
124GPT-4o Mini (temp=0)9.5%$0.001234.8s0%
125Gemini 3 Pro (Preview)20.1%$0.05554.4s9%
126GPT-5.222.9%$0.0561.5m11%
127Stealth: Aurora Alpha0.7%$0.00009.8s0%
128Inception Mercury 21.1%$0.00327.0s0%
129ByteDance Seed 2.0 Lite22.0%$0.0122.2m6%
130GPT-5 Nano13.5%$0.00421.4m2%
131Qwen 3.5 35B14.4%$0.0181.0m1%
132Gemma 4 31B (Reasoning)15.2%$0.00142.2m5%
133Qwen 3.6 27B22.7%$0.0252.3m6%
134Inception Mercury1.2%$0.01117.6s0%
135Nemotron 3 Nano7.5%$0.00101.1m0%
136Gemma 4 26B (Reasoning)13.4%$0.00132.0m3%
137Qwen 3.5 9B7.6%$0.00111.4m0%
138GPT-529.2%$0.0652.8m13%
139Grok 4.3 (Reasoning)16.9%$0.0212.3m5%
140Qwen 3.5 122B8.2%$0.0251.1m0%
141Qwen 3.5 27B6.7%$0.0201.6m0%
142GPT-OSS 120B0.9%$0.00151.8m0%
143ByteDance Seed 1.611.8%$0.0132.5m0%
144ByteDance Seed 2.0 Mini22.7%$0.00454.9m9%
145Gemini 3.1 Pro (Preview)23.9%$0.1071.8m6%
146Mistral Small 3.2 24B25.5%$0.00695.7m9%
147MoonshotAI: Kimi K2.635.9%$0.0586.5m16%
35.51%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X510010096888092.7%
Rocinante 12B10010098837891.8%
GPT-5.4 (Reasoning)1009187866084.7%
GPT-5.4 (Reasoning, Low)1008980717182.5%
Qwen3 235B A22B Instruct 250710010098872782.5%
Claude Sonnet 4.6 (Reasoning)979281746281.2%
Claude Sonnet 4.6967978776378.4%
Claude Sonnet 4.51008771676477.9%
Claude Opus 4.7 (Reasoning)1007773603468.7%
Claude Opus 4.7937270594668.0%
Gemma 3 4B1007876423967.1%
Hermes 3 70B1009660562366.9%
GPT-5.4817567605166.8%
Claude Opus 4.5838172652965.9%
Claude Opus 4.6 (Reasoning)1007271393563.2%
Llama 3.1 Nemotron 70B857363413559.2%
MiniMax M2.7886051504258.2%
Cohere Command R+ (Aug. 2024)998240363458.2%
Claude Opus 4757471422457.3%
Claude Opus 4.6767050464457.3%
WizardLM 2 8x22b976953392757.0%
Gemma 3 12B877353413156.9%
GPT-5.4 Mini777055414156.6%
GPT-5.5 (Reasoning, Low)635958574456.2%
Z.AI GLM 51007867201355.5%
Llama 3.1 8B1001005618054.8%
GPT-5.4 Mini (Reasoning)756455463454.6%
Claude Haiku 4.51006143363354.4%
GPT-5.5685453524454.2%
Hermes 3 405B10010045141154.0%
MoonshotAI: Kimi K2.5746054482953.2%
GPT-5.4 Mini (Reasoning, Low)696356463053.1%
Aion 2.0645955433350.8%
Z.AI GLM 5 Turbo786747431650.2%
Xiaomi MIMO v2.5 Pro585549463849.2%
Gemini 2.5 Pro636161451148.1%
Gemini 2.5 Flash704842413046.3%
Gemma 3 27B595655431846.2%
GPT-4o Mini (temp=1)904542381546.0%
Arcee AI: Trinity Large (Preview)685047392245.0%
Gemini 2.5 Flash Lite524943423744.7%
Xiaomi MIMO v2.572534744744.7%
Z.AI GLM 5.169674741044.6%
Ministral 8B796046231544.5%
DeepSeek V4 Pro (Reasoning)734843391844.2%
Gemini 2.5 Flash (Reasoning)81773714943.7%
DeepSeek V4 Pro794537312443.1%
Ministral 3 14B755043281843.1%
Claude Sonnet 4685449291142.1%
Stealth: Hunter Alpha883433312041.4%
DeepSeek V4 Flash (Reasoning)885126221941.3%
GPT-5.5 (Reasoning)565437322440.6%
Qwen 3.6 35B734438242340.4%
DeepSeek V3 (2025-03-24)94562626040.3%
GPT-5.1544938362540.2%
DeepSeek V3.2545243391240.1%
Grok 4.3634542361239.7%
Gemini 2.5 Flash Lite (Reasoning)773832302039.5%
LFM2 24B86612221037.9%
Grok 4.20 (Reasoning)534643242337.7%
Mistral Small 4 (Reasoning)555544171537.3%
DeepSeek V4 Flash10042327036.2%
GPT-4o, Aug. 6th (temp=1)75582918036.2%
Grok 4.20 (Beta)644539211035.9%
Mistral Small Creative534643171334.6%
ByteDance Seed 1.6 Flash79342923834.6%
Mistral Small 3.2 24B81352928034.6%
GPT-4.162533027034.4%
Stealth: Healer Alpha80403814034.4%
Gemini 3 Flash (Preview, Reasoning)483935281633.2%
Ministral 3B6560370032.4%
Z.AI GLM 4.6494128232232.4%
Qwen 3.6 Flash77332822232.2%
Mistral Large 2393732291831.2%
Claude 3 Haiku732621211430.9%
Grok 4.20543631161630.6%
Gemma 4 31B (Reasoning)353131282730.4%
MoonshotAI: Kimi K2.6582828271030.0%
Llama 3.1 70B10031107029.6%
Z.AI GLM 4.5 Air494421191529.5%
Mistral Medium 3.1464231151329.2%
GPT-5.242413716528.2%
MiniMax M2.573262516028.0%
Grok 4 Fast433131211328.0%
Gemini 3 Pro (Preview)49362720727.6%
Claude 3.7 Sonnet50402820027.6%
Gemini 3 Flash (Preview)38373624027.0%
Mistral Large 3463120191726.4%
GPT-4o, May 13th (temp=1)602616141225.7%
Mistral Large48312720125.4%
Grok 4.1 Fast48272617624.8%
Claude 3.5 Sonnet7023148724.1%
ByteDance Seed 2.0 Lite51312316024.1%
DeepSeek V3.133322520924.0%
Gemma 4 31B4943250023.5%
DeepSeek-V2 Chat542218111022.9%
GPT-5.4 Nano432818141122.9%
Ministral 3 8B6039122022.6%
Ministral 3 3B45411310021.8%
Qwen 3.5 397B A17B4031288021.4%
Qwen 3.6 27B6324164021.2%
GPT-5.4 Nano (Reasoning, Low)27272420420.4%
GPT-5.4 Nano (Reasoning)37351411520.3%
Qwen3.6 Max Preview6021115019.5%
Z.AI GLM 4.728272314218.9%
ByteDance Seed 2.0 Mini552580017.4%
o4 Mini High462597017.4%
DeepSeek V3 (2024-12-26)3825213017.4%
Mistral Small 434171512917.3%
GPT-54018109716.9%
GPT-4.1 Mini4123180016.2%
Qwen 3.5 Plus (2026-04-20)3925105015.8%
Grok 43028106215.3%
Gemini 3.1 Flash Lite (Preview)3717135315.1%
GPT-4o, Aug. 6th (temp=0)2823174014.3%
Mistral NeMO322780013.5%
Gemma 4 26B401780013.0%
GPT-4.1 Nano2520162012.3%
Arcee AI: Trinity Mini431700012.0%
Qwen 3.5 9B391800011.4%
Z.AI GLM 4.7 Flash2515140010.9%
Qwen 2.5 72B241953210.7%
Gemini 3.1 Pro (Preview)2316113010.5%
Grok 4.20 (Beta, Reasoning)231310209.7%
Gemma 4 26B (Reasoning)201514009.6%
Qwen 3.5 Plus (2026-02-15)21147309.0%
Z.AI GLM 4.523135409.0%
Qwen 3.5 Flash4500009.0%
Grok 4.3 (Reasoning)26145008.9%
GPT-4o, May 13th (temp=0)24165008.8%
Gemini 3.1 Flash Lite23117208.7%
GPT-5 Mini2385007.2%
Qwen 3.5 35B3120006.5%
GPT-5 Nano1796006.3%
Qwen 3 32B1884106.3%
o4 Mini11106216.3%
GPT-4o Mini (temp=0)2731006.1%
Nemotron 3 Super17130006.0%
Nemotron 3 Nano1570004.4%
Qwen 3.5 27B1900003.8%
Gemini 3.1 Flash Lite (Reasoning)1260003.6%
ByteDance Seed 1.61400002.8%
Stealth: Aurora Alpha500001.0%
Inception Mercury 2400000.9%
Qwen 3.5 122B000000.0%
GPT-OSS 120B000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 27B1001001001008597.0%
MiniMax M2.7100100100949297.0%
Claude Sonnet 4.5100100100997695.1%
Claude Opus 4.6 (Reasoning)1001001001006993.9%
Claude Haiku 4.510010099947793.9%
GPT-4o Mini (temp=1)10010095957693.0%
Claude Opus 4.7 (Reasoning)1009393888291.2%
Claude Opus 4100100100885789.1%
Rocinante 12B100100100856088.9%
Z.AI GLM 5.11009895787088.2%
DeepSeek V4 Pro100100100835788.0%
Claude Sonnet 4.6999089847687.5%
GPT-5.4 (Reasoning)979382807986.2%
Mistral Small 4 (Reasoning)1009889786285.3%
DeepSeek V4 Flash10010095834684.9%
Claude Opus 4.61001001001002184.3%
Z.AI GLM 5 Turbo979389766684.2%
GPT-4o, Aug. 6th (temp=1)10010095685784.1%
GPT-5.41008681777483.8%
GPT-5.4 (Reasoning, Low)948584777482.8%
Claude Opus 4.71008380786881.8%
Xiaomi MIMO v2.5 Pro1009081746481.8%
Claude 3 Haiku10010090852980.7%
Claude Sonnet 410010088595480.0%
Grok 41009690654679.5%
Claude Sonnet 4.6 (Reasoning)10010074595878.0%
Stealth: Hunter Alpha908483745677.5%
DeepSeek V3 (2025-03-24)10010090603777.4%
Llama 3.1 8B1001009977876.8%
GPT-5.5918379735876.6%
Z.AI GLM 51008574625675.4%
WizardLM 2 8x22b958969635974.7%
DeepSeek V4 Flash (Reasoning)10010074583873.9%
Claude Opus 4.510010082444273.7%
DeepSeek V4 Pro (Reasoning)10010083562873.4%
Mistral Large 21008973564873.1%
Gemma 3 4B998181515072.4%
GPT-5.5 (Reasoning)757473736071.0%
Z.AI GLM 4.5949481553171.0%
DeepSeek V3.2907266665870.6%
Mistral Small Creative998579523770.3%
MiniMax M2.5927368665170.1%
GPT-5.4 Mini (Reasoning, Low)857367645969.8%
Arcee AI: Trinity Large (Preview)1009177481666.4%
Ministral 3 14B966866504665.3%
GPT-5.1907171524165.2%
Z.AI GLM 4.5 Air997867482964.2%
Mistral Large 310010048452663.8%
Mistral Large937169532762.7%
Aion 2.0816762604362.7%
Ministral 8B827465603162.4%
GPT-5.5 (Reasoning, Low)806458514960.6%
Gemini 2.5 Flash Lite827460414159.6%
Cohere Command R+ (Aug. 2024)100827243059.5%
Mistral Medium 3.1876057484459.1%
Xiaomi MIMO v2.5786663612758.8%
Hermes 3 70B100926238158.5%
Gemma 3 12B716763593258.3%
GPT-5.4 Mini766051484856.8%
Grok 4.1 Fast914947474455.6%
GPT-4.1865951383754.1%
Gemini 2.5 Pro676060493454.0%
Gemini 3 Pro (Preview)875751482453.5%
MoonshotAI: Kimi K2.5776159402853.0%
Llama 3.1 70B776560491353.0%
GPT-5.4 Mini (Reasoning)805350483352.7%
Claude 3.7 Sonnet676454473252.6%
DeepSeek V3.1928039312152.6%
Gemini 2.5 Flash (Reasoning)81805843052.4%
Gemini 2.5 Flash1004843363351.8%
DeepSeek V3 (2024-12-26)1007247191851.2%
ByteDance Seed 1.6 Flash635552513551.1%
LFM2 24B79715343650.7%
GPT-4.1 Nano725650393550.4%
GPT-4o, May 13th (temp=1)685953363249.6%
Grok 4.20 (Beta)554848464548.4%
Grok 4 Fast71696529648.2%
Ministral 3 8B754848373448.2%
GPT-4.1 Mini665958322648.1%
Claude 3.5 Sonnet100484741147.4%
Mistral NeMO100593935347.2%
GPT-5.4 Nano (Reasoning, Low)695746322946.5%
Mistral Small 468645640045.6%
GPT-5.4 Nano664947441945.2%
Gemini 2.5 Flash Lite (Reasoning)85803524044.9%
Hermes 3 405B93843311044.2%
Stealth: Healer Alpha615347411743.7%
Grok 4.20 (Reasoning)545046392442.6%
Llama 3.1 Nemotron 70B100532917641.1%
Grok 4.20464642383140.6%
DeepSeek-V2 Chat100641913439.8%
Z.AI GLM 4.6514836322939.2%
ByteDance Seed 2.0 Mini554340292738.9%
Nemotron 3 Super73504616938.8%
Z.AI GLM 4.7743232282838.7%
Qwen 3.6 27B64604721038.2%
MoonshotAI: Kimi K2.6555443201537.6%
Qwen 3 32B52494539137.2%
GPT-5554636331537.0%
GPT-5.2574336321736.9%
Arcee AI: Trinity Mini10041236334.5%
Qwen 3.5 Plus (2026-04-20)55483322833.2%
Ministral 3 3B694025161432.6%
Ministral 3B54383624932.0%
GPT-5.4 Nano (Reasoning)51333232931.4%
Qwen 3.5 Plus (2026-02-15)582725251329.7%
Grok 4.20 (Beta, Reasoning)553818151227.4%
Gemini 3.1 Pro (Preview)423025201927.4%
Gemini 3 Flash (Preview)39322823926.3%
o4 Mini403423161625.8%
GPT-4o Mini (temp=0)373028181525.4%
Z.AI GLM 4.7 Flash333022201724.4%
Gemma 4 31B353425121223.8%
o4 Mini High392925151123.7%
Qwen3.6 Max Preview822933023.6%
Qwen 3.6 35B46312217023.2%
Grok 4.3 (Reasoning)452522131123.1%
Grok 4.35538137022.7%
Gemma 4 31B (Reasoning)4330258722.6%
Qwen 3.6 Flash5034234022.2%
ByteDance Seed 1.6252323151219.3%
ByteDance Seed 2.0 Lite64121010019.3%
Gemma 4 26B23191916015.4%
Gemini 3 Flash (Preview, Reasoning)353370015.1%
Gemma 4 26B (Reasoning)462350014.8%
GPT-5 Mini30121010012.6%
GPT-4o, May 13th (temp=0)54520012.4%
GPT-4o, Aug. 6th (temp=0)351642011.4%
Qwen 2.5 72B36966011.2%
GPT-5 Nano2317122111.1%
Qwen 3.5 397B A17B161586610.1%
Gemini 3.1 Flash Lite (Reasoning)2640006.0%
Gemini 3.1 Flash Lite (Preview)1662004.9%
Mistral Small 3.2 24B1311003.1%
Qwen 3.5 35B1500003.0%
Qwen 3.5 122B1310002.9%
Qwen 3.5 Flash900001.8%
Nemotron 3 Nano900001.8%
Gemini 3.1 Flash Lite000000.1%
Qwen 3.5 27B000000.0%
GPT-OSS 120B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X51001001001007895.7%
Qwen3 235B A22B Instruct 250710010099826789.7%
Llama 3.1 8B100100100834084.5%
Claude Opus 4100100100625082.5%
Rocinante 12B10010010052270.8%
Mistral Small 4959368601366.1%
Mistral Small 4 (Reasoning)79776857958.1%
Claude Opus 4.7907070291655.0%
GPT-5.4686762443454.8%
Claude Opus 4.5836643433153.5%
Claude Haiku 4.51005346353453.4%
Claude Sonnet 41005647372553.0%
Mistral Large 2856747342852.3%
Ministral 3 8B1009129231952.3%
Claude Sonnet 4.5757261291750.9%
Claude Sonnet 4.6 (Reasoning)655646454150.7%
DeepSeek V4 Pro907631252549.6%
Hermes 3 405B88664539047.5%
Hermes 3 70B74694742046.4%
Mistral Medium 3.168615545045.8%
MiniMax M2.571565045645.8%
Mistral Large 31006224191744.6%
Ministral 3 14B85584331544.2%
Llama 3.1 70B824841241742.4%
Claude Sonnet 4.674585512040.0%
GPT-5.4 (Reasoning, Low)595044281739.8%
Claude Opus 4.6574848242339.8%
Mistral Large75454227438.6%
Z.AI GLM 5544844271537.6%
DeepSeek V4 Pro (Reasoning)504038342337.1%
Llama 3.1 Nemotron 70B69434131036.9%
Grok 4.1 Fast71484811035.8%
MoonshotAI: Kimi K2.5544128272635.0%
DeepSeek V4 Flash8161271033.9%
Qwen3.6 Max Preview645520171233.7%
GPT-5975483032.4%
ByteDance Seed 1.6 Flash553130232332.3%
Qwen 3.5 397B A17B50453429031.6%
GPT-5.5 (Reasoning, Low)563127261431.0%
Ministral 8B46453320930.8%
Arcee AI: Trinity Large (Preview)886050030.6%
Z.AI GLM 5 Turbo413733221930.3%
Gemma 3 12B61352920029.2%
GPT-5.5363434251328.6%
Claude Opus 4.7 (Reasoning)55363412027.4%
DeepSeek V3 (2025-03-24)5549310027.1%
Gemma 3 27B493725121127.0%
MiniMax M2.740342824726.9%
GPT-4.16041270025.5%
GPT-4o, Aug. 6th (temp=1)42392717025.0%
Qwen 3 32B4536356024.5%
Z.AI GLM 5.15935250023.7%
WizardLM 2 8x22b4339313023.3%
Grok 4 Fast5331186522.8%
Grok 4.20 (Beta)47351610021.5%
GPT-5.4 Nano (Reasoning, Low)39241815720.7%
GPT-4o Mini (temp=1)4323237620.6%
GPT-5.4 (Reasoning)3933310020.6%
GPT-5.5 (Reasoning)41281614020.1%
Gemini 2.5 Flash (Reasoning)100000020.0%
Xiaomi MIMO v2.5 Pro50181715019.9%
Grok 4.204230251019.7%
Qwen 3.5 35B4440140019.6%
Claude 3 Haiku7114130019.6%
Z.AI GLM 4.5 Air524600019.6%
ByteDance Seed 2.0 Lite3930263019.5%
Claude Opus 4.6 (Reasoning)474170019.0%
Mistral Small Creative4437112019.0%
DeepSeek V3 (2024-12-26)6020100018.1%
GPT-5.14525143017.2%
Claude 3.5 Sonnet4816119016.8%
LFM2 24B4916135016.8%
Gemma 3 4B25231414516.2%
Mistral NeMO3918160014.7%
GPT-4.1 Nano3022192014.5%
GPT-4o, May 13th (temp=1)531900014.3%
DeepSeek-V2 Chat3814117013.8%
GPT-5.4 Nano2318129513.3%
GPT-5.4 Mini2620127013.1%
DeepSeek V4 Flash (Reasoning)59500012.9%
GPT-5.4 Mini (Reasoning)431800012.2%
Claude 3.7 Sonnet342600012.0%
Grok 4.20 (Reasoning)341661011.5%
Ministral 3B2518130011.1%
Qwen 3.6 35B51400011.0%
MoonshotAI: Kimi K2.6322100010.6%
Gemini 2.5 Flash Lite311154010.3%
GPT-4o, May 13th (temp=0)311630010.2%
GPT-5 Nano51000010.2%
Cohere Command R+ (Aug. 2024)411000010.2%
Qwen 3.6 Flash381200010.1%
GPT-5.4 Nano (Reasoning)2713100010.0%
GPT-4.1 Mini28145009.6%
Mistral Small 3.2 24B4150009.2%
GPT-5.4 Mini (Reasoning, Low)23166009.1%
DeepSeek V3.223106007.9%
Z.AI GLM 4.62396007.7%
Xiaomi MIMO v2.52791007.3%
Z.AI GLM 4.53600007.3%
Ministral 3 3B3410007.1%
ByteDance Seed 1.63130007.0%
Aion 2.019131006.6%
Qwen 3.5 27B19140006.4%
Qwen 3.5 Plus (2026-02-15)17150006.4%
Gemini 2.5 Flash3100006.3%
Nemotron 3 Super3100006.1%
o4 Mini15115006.1%
Stealth: Healer Alpha1594105.8%
Grok 42600005.1%
o4 Mini High1294005.0%
Qwen 3.5 Flash2230004.9%
Gemini 3 Pro (Preview)13110004.7%
GPT-5.22300004.5%
Qwen 3.5 Plus (2026-04-20)1530003.6%
Qwen 2.5 72B1700003.4%
Arcee AI: Trinity Mini1160003.4%
GPT-4o, Aug. 6th (temp=0)1600003.1%
Gemini 2.5 Flash Lite (Reasoning)753002.9%
Gemini 3.1 Pro (Preview)1300002.6%
GPT-4o Mini (temp=0)920002.2%
Gemini 2.5 Pro1000002.1%
Z.AI GLM 4.7 Flash1000002.1%
Grok 4.20 (Beta, Reasoning)710001.7%
GPT-5 Mini711001.6%
Gemini 3.1 Flash Lite800001.5%
Qwen 3.6 27B700001.5%
Gemini 3 Flash (Preview, Reasoning)700001.4%
Gemma 4 26B (Reasoning)310000.7%
Grok 4.3300000.6%
Gemini 3.1 Flash Lite (Preview)200000.5%
ByteDance Seed 2.0 Mini110000.3%
Gemma 4 31B (Reasoning)100000.2%
Qwen 3.5 9B100000.2%
Stealth: Hunter Alpha100000.2%
Nemotron 3 Nano000000.1%
Grok 4.3 (Reasoning)000000.0%
Qwen 3.5 122B000000.0%
Z.AI GLM 4.7000000.0%
Gemma 4 31B000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Gemma 4 26B000000.0%
Gemini 3 Flash (Preview)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V3.1000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X5100100100100100100.0%
Qwen3 235B A22B Instruct 25071001001001008396.7%
GPT-5.4 (Reasoning, Low)100100100927393.0%
Claude Opus 4.6 (Reasoning)10010090864684.6%
Claude Sonnet 4.510010094804583.7%
Xiaomi MIMO v2.5 Pro10010094604179.0%
Stealth: Hunter Alpha1009379664576.8%
Claude Sonnet 4.61009877604676.2%
GPT-5.11009668675076.1%
Hermes 3 405B1009672585476.0%
Claude Sonnet 41007772625773.7%
Mistral Small 4 (Reasoning)1009565575273.7%
GPT-5.41009465634573.4%
Rocinante 12B10010088463173.0%
Grok 4.1 Fast958474535171.4%
Claude Sonnet 4.6 (Reasoning)959074484871.1%
Claude Opus 4.610010059514571.0%
Claude Opus 4.7 (Reasoning)968987443670.5%
DeepSeek V4 Flash1007372654270.5%
Z.AI GLM 51007265645270.5%
GPT-5.4 (Reasoning)977271634469.4%
Claude Opus 4938273613568.8%
DeepSeek V4 Flash (Reasoning)1008758534568.6%
Claude Opus 4.71008776631367.9%
MoonshotAI: Kimi K2.51007065633967.4%
Mistral Small Creative1007167573466.0%
GPT-5.4 Mini827362625166.0%
Llama 3.1 8B10010060501665.0%
Mistral Small 41006657574164.2%
MiniMax M2.7817979453363.4%
GPT-5.5917355554062.8%
Mistral Large92756663760.7%
Z.AI GLM 5 Turbo796663603059.8%
Ministral 3 14B95776354959.6%
Hermes 3 70B1001006530059.0%
GPT-5.5 (Reasoning, Low)676664544458.9%
Claude Opus 4.5909045392958.5%
Gemma 3 27B1007157442058.4%
MiniMax M2.5818167312457.1%
Xiaomi MIMO v2.51009238302456.6%
Qwen3.6 Max Preview776055453754.8%
GPT-5.4 Mini (Reasoning)595959514554.6%
Arcee AI: Trinity Large (Preview)876256422454.2%
DeepSeek V4 Pro806548433053.5%
Z.AI GLM 5.1666351503352.6%
GPT-5.4 Mini (Reasoning, Low)746252412951.5%
Stealth: Healer Alpha676665362351.4%
DeepSeek V4 Pro (Reasoning)815748462050.4%
Qwen 3 32B94604939048.3%
Qwen 3.6 Flash83785921048.1%
DeepSeek V3.1854739363247.9%
Mistral Medium 3.1726959251347.6%
GPT-5846935331246.4%
GPT-5.5 (Reasoning)655251501246.3%
Claude 3.7 Sonnet814140383146.2%
WizardLM 2 8x22b86784025046.0%
Claude 3.5 Sonnet100574519745.7%
Gemma 3 4B605344432545.0%
Mistral Large 279625330044.8%
Claude Haiku 4.564595148144.5%
Qwen 3.6 35B635251371744.2%
Grok 4665936312843.9%
GPT-4.1 Nano67574943043.1%
DeepSeek V3 (2025-03-24)1004336211342.5%
Gemma 3 12B88504828042.5%
ByteDance Seed 1.6 Flash1004823212042.3%
Grok 4.20 (Beta)585234333342.0%
Gemini 2.5 Flash Lite785143191741.6%
Qwen 3.5 397B A17B73675711041.6%
Gemini 2.5 Flash (Reasoning)68664724341.5%
Grok 4.208771318740.8%
Aion 2.0665043251540.0%
GPT-5.4 Nano594138322338.7%
Gemini 2.5 Flash Lite (Reasoning)674137301738.4%
GPT-4o, Aug. 6th (temp=1)544238381437.1%
Mistral Small 3.2 24B8455440036.6%
Mistral Large 3843428201636.5%
Qwen 3.5 Plus (2026-04-20)653828231834.2%
Gemma 4 31B724723161133.6%
Gemini 2.5 Flash45453734032.1%
GPT-5.4 Nano (Reasoning, Low)585318161331.8%
Ministral 3 8B593228251131.1%
GPT-4o Mini (temp=1)57503712031.1%
MoonshotAI: Kimi K2.6553130231530.9%
Ministral 8B8742145029.8%
Grok 4.20 (Reasoning)60442311428.6%
LFM2 24B623024141228.3%
GPT-4o, May 13th (temp=1)69342215027.9%
Gemini 3 Pro (Preview)59402512127.3%
o4 Mini High7922188025.6%
DeepSeek V3.240383415125.6%
Cohere Command R+ (Aug. 2024)635344024.9%
Mistral NeMO423022191124.9%
GPT-4.14139380023.6%
Grok 4 Fast48332311123.2%
Claude 3 Haiku7711109322.0%
Gemini 3.1 Flash Lite (Preview)37262216921.9%
Grok 4.3 (Reasoning)50202019021.9%
Arcee AI: Trinity Mini752390021.2%
Grok 4.342321614120.9%
Z.AI GLM 4.76128122020.7%
Z.AI GLM 4.65232115220.5%
Ministral 3B5226169020.4%
Gemini 3 Flash (Preview)37311813019.9%
Ministral 3 3B38241612819.5%
Gemini 2.5 Pro51231111219.4%
GPT-4o, May 13th (temp=0)4429194019.1%
o4 Mini3530217018.6%
GPT-5.4 Nano (Reasoning)443841017.5%
Gemini 3 Flash (Preview, Reasoning)3823188017.4%
Llama 3.1 70B3129250017.1%
Z.AI GLM 4.7 Flash3932103017.0%
Z.AI GLM 4.5472592016.5%
GPT-5.22826199016.3%
DeepSeek V3 (2024-12-26)4025100015.0%
Qwen 3.5 Plus (2026-02-15)2726201014.8%
Grok 4.20 (Beta, Reasoning)3421108014.7%
Qwen 3.6 27B65500013.9%
Z.AI GLM 4.5 Air2726140013.4%
GPT-4o, Aug. 6th (temp=0)3521100013.3%
Gemma 4 26B25141211012.2%
Qwen 2.5 72B47840011.9%
Gemini 3.1 Pro (Preview)2317153111.7%
Gemini 3.1 Flash Lite (Reasoning)272054011.3%
Llama 3.1 Nemotron 70B46500010.3%
GPT-5 Mini22127008.1%
ByteDance Seed 1.620180007.5%
Qwen 3.5 35B2800005.5%
Gemini 3.1 Flash Lite1852005.0%
Gemma 4 31B (Reasoning)13100004.6%
Nemotron 3 Super2200004.4%
DeepSeek-V2 Chat984004.1%
Qwen 3.5 122B1600003.3%
Gemma 4 26B (Reasoning)1150003.2%
ByteDance Seed 2.0 Mini1200002.4%
Qwen 3.5 27B1000002.0%
GPT-4.1 Mini900001.7%
GPT-4o Mini (temp=0)300000.6%
GPT-OSS 120B300000.6%
ByteDance Seed 2.0 Lite300000.6%
Qwen 3.5 Flash000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5 Nano000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100786187.8%
Writer: Palmyra X51009682764379.4%
Llama 3.1 8B10010066605776.7%
Hermes 3 70B1009393772076.6%
Cohere Command R+ (Aug. 2024)1001008074070.8%
Claude Opus 4.71008381761170.1%
Claude 3 Haiku838171673868.1%
Mistral Medium 3.1868474513766.4%
Claude Opus 4987854474364.0%
GPT-5.4 (Reasoning, Low)746861585362.7%
Claude Opus 4.5868660433562.1%
Claude Opus 4.6 (Reasoning)756761574661.2%
Claude Sonnet 41007261422059.0%
Qwen3 235B A22B Instruct 2507806161602357.1%
DeepSeek V4 Flash (Reasoning)1005347472153.7%
DeepSeek V4 Flash96665749053.7%
GPT-5.4806138373449.9%
Claude Opus 4.6826248351949.4%
Grok 4.1 Fast886932292749.0%
Mistral NeMO75665145648.7%
MiniMax M2.5605149373446.1%
Z.AI GLM 5.1100623521945.5%
GPT-5.5674836323042.6%
Mistral Small Creative100584210142.3%
Z.AI GLM 5814332322142.0%
Ministral 3 8B81694510041.0%
Grok 4 Fast615339331940.9%
Mistral Small 4 (Reasoning)585837292340.8%
Mistral Small 3.2 24B100353430140.0%
Qwen3.6 Max Preview82493730039.5%
Llama 3.1 70B100432817137.7%
Grok 4.20514036332937.7%
DeepSeek V4 Pro504340272637.4%
Xiaomi MIMO v2.5 Pro614131302337.3%
Claude Sonnet 4.6 (Reasoning)58564329037.2%
Claude Sonnet 4.566483633036.6%
GPT-4o, Aug. 6th (temp=1)60514131036.6%
Mistral Large8176164035.5%
Grok 4.20 (Beta)57433838035.3%
Qwen 3.5 397B A17B544336231333.8%
Claude Haiku 4.5574137191333.5%
Mistral Large 360523815033.0%
Ministral 3 14B71453118032.9%
Qwen 3.5 Plus (2026-02-15)8948260032.5%
Qwen 3.6 Flash554623231632.5%
MiniMax M2.774463211032.4%
GPT-4o, May 13th (temp=0)474629271432.4%
Ministral 8B63473119032.0%
Mistral Small 47155290031.1%
Claude Sonnet 4.610035119031.0%
Gemma 3 27B55432320930.0%
GPT-5.4 Mini413634261129.7%
Gemini 2.5 Flash Lite49402722929.5%
Hermes 3 405B54502813029.0%
GPT-5.5 (Reasoning, Low)373431261629.0%
Stealth: Hunter Alpha5045389128.6%
Z.AI GLM 5 Turbo6357105528.1%
Z.AI GLM 4.5 Air76361512027.9%
Gemma 3 12B363431211627.7%
GPT-4.1 Nano48332925027.0%
o4 Mini712119121126.7%
Stealth: Healer Alpha8527174026.6%
MoonshotAI: Kimi K2.54643329026.1%
Arcee AI: Trinity Large (Preview)46413112025.9%
GPT-4o, May 13th (temp=1)423123171325.2%
Grok 4.20 (Reasoning)5845139025.0%
Gemma 3 4B61282311024.5%
ByteDance Seed 1.6 Flash48292520024.4%
Llama 3.1 Nemotron 70B56251816624.1%
Ministral 3 3B932620024.1%
DeepSeek V3.24541330023.9%
LFM2 24B54302214023.9%
GPT-5.4 (Reasoning)502916141023.7%
ByteDance Seed 2.0 Mini6523179022.6%
GPT-5.4 Nano (Reasoning, Low)4826199922.3%
Gemini 2.5 Flash6030160021.2%
Gemini 2.5 Flash Lite (Reasoning)4637176021.1%
Qwen 3 32B90640020.1%
WizardLM 2 8x22b40231918019.8%
Claude Opus 4.7 (Reasoning)622791019.7%
GPT-5 Nano741184019.3%
Grok 434321212519.1%
MoonshotAI: Kimi K2.637211914318.9%
Qwen 3.5 Plus (2026-04-20)552288018.5%
Qwen 3.6 35B4830150018.5%
DeepSeek V3 (2025-03-24)38271513018.3%
GPT-5.14233115018.2%
Mistral Large 235241614017.9%
Arcee AI: Trinity Mini4518176017.1%
Claude 3.7 Sonnet562061016.7%
Claude 3.5 Sonnet3731150016.6%
GPT-4o, Aug. 6th (temp=0)37171414016.3%
GPT-5.5 (Reasoning)27241614016.2%
Xiaomi MIMO v2.53223149115.7%
GPT-5.4 Nano24201713415.5%
Qwen 3.5 Flash471594015.0%
GPT-5.4 Mini (Reasoning, Low)2823116013.6%
Z.AI GLM 4.6402030012.6%
Ministral 3B51730012.2%
ByteDance Seed 1.6303000012.0%
GPT-4.1281765011.2%
Grok 4.20 (Beta, Reasoning)391430011.1%
GPT-5.4 Mini (Reasoning)2120130010.9%
Qwen 3.6 27B292030010.3%
DeepSeek V4 Pro (Reasoning)301660010.3%
Qwen 3.5 27B252500010.0%
Gemini 2.5 Flash (Reasoning)4440009.6%
Aion 2.031140009.0%
GPT-5.4 Nano (Reasoning)20126428.8%
o4 Mini High4210008.6%
Gemini 3 Flash (Preview, Reasoning)3254008.3%
GPT-5 Mini27140008.1%
Gemini 3 Pro (Preview)15128107.2%
GPT-4o Mini (temp=1)2560006.2%
Z.AI GLM 4.519120006.1%
DeepSeek V3 (2024-12-26)16101005.4%
Gemini 3.1 Pro (Preview)2430005.3%
Qwen 3.5 35B2700005.3%
Grok 4.31474004.9%
Gemini 2.5 Pro2020004.5%
GPT-51074004.4%
GPT-4.1 Mini955003.7%
Z.AI GLM 4.7 Flash1331003.5%
Nemotron 3 Super1700003.4%
Gemini 3 Flash (Preview)1400002.7%
Gemini 3.1 Flash Lite1120002.6%
Z.AI GLM 4.7750002.6%
Qwen 2.5 72B533002.5%
DeepSeek-V2 Chat1100002.1%
DeepSeek V3.1630001.8%
Gemini 3.1 Flash Lite (Preview)430001.6%
ByteDance Seed 2.0 Lite500001.1%
Qwen 3.5 122B500001.1%
GPT-4o Mini (temp=0)500000.9%
Gemma 4 26B (Reasoning)400000.8%
GPT-5.2300000.6%
Gemma 4 31B200000.5%
Grok 4.3 (Reasoning)200000.5%
Gemma 4 31B (Reasoning)000000.0%
GPT-OSS 120B000000.0%
Gemini 3.1 Flash Lite (Reasoning)000000.0%
Qwen 3.5 9B000000.0%
Gemma 4 26B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 25071001001001009498.7%
Writer: Palmyra X51001001001008196.2%
Rocinante 12B100100100938295.0%
Z.AI GLM 5100100100844986.6%
Llama 3.1 8B100100100784885.2%
Hermes 3 405B10010093913684.1%
DeepSeek V4 Flash (Reasoning)999796834283.5%
Claude Sonnet 4.6 (Reasoning)10010083765582.7%
Claude Opus 4.7 (Reasoning)10010094853282.3%
Claude Opus 4.7100100100624080.5%
DeepSeek V4 Flash1008774716278.7%
Claude Haiku 4.51009084694377.2%
Claude Opus 4.5928475646475.9%
Hermes 3 70B10010085741875.3%
Claude Sonnet 4.61009078564674.2%
Claude Sonnet 4.51009465574071.4%
GPT-5.4 (Reasoning, Low)897770635069.6%
Claude Sonnet 4978264544769.0%
Xiaomi MIMO v2.5 Pro907268645269.0%
Mistral Small 410010078432368.7%
Z.AI GLM 5.11008863523768.2%
MiniMax M2.71009349494867.7%
GPT-5.4847660595767.2%
Mistral Medium 3.1918867483666.2%
Claude Opus 4.6 (Reasoning)838357534964.9%
DeepSeek V4 Pro1009897161364.8%
Cohere Command R+ (Aug. 2024)1007573512364.6%
GPT-4o, Aug. 6th (temp=1)1008355424264.5%
Z.AI GLM 5 Turbo797069594464.3%
Claude Opus 4.6827672533764.0%
Arcee AI: Trinity Large (Preview)858570572063.5%
WizardLM 2 8x22b866960504962.8%
DeepSeek V4 Pro (Reasoning)776865535162.5%
Qwen 3.5 397B A17B1006755454261.9%
Grok 4.1 Fast747061594561.6%
GPT-4o Mini (temp=1)967053483860.9%
Mistral Large 3726765642859.2%
Gemma 3 27B886052444056.8%
Z.AI GLM 4.5 Air1007054471156.2%
GPT-4.1876561412756.2%
Mistral Small 4 (Reasoning)98755748256.1%
Claude 3 Haiku1001004529054.8%
MiniMax M2.51007143302754.1%
GPT-5.1785951483454.1%
Gemini 2.5 Flash Lite706853502753.9%
GPT-4.1 Nano767368272653.7%
Mistral Large99866120053.4%
Qwen 3.5 Plus (2026-04-20)757563282553.3%
Gemini 2.5 Flash Lite (Reasoning)635353534453.2%
Claude Opus 4796850382652.3%
GPT-4o, May 13th (temp=1)755857462351.7%
DeepSeek V3.2695752413450.5%
Mistral Large 2905331312946.8%
MoonshotAI: Kimi K2.5625440383846.2%
Grok 4 Fast605651312645.0%
GPT-5.5 (Reasoning, Low)694540393044.8%
Mistral Small Creative816536281344.6%
DeepSeek V3 (2025-03-24)696743271644.4%
Qwen 3 32B75524742043.3%
GPT-5.4 (Reasoning)544940373442.8%
Z.AI GLM 4.6645235342642.3%
Gemini 2.5 Flash675343311541.9%
Grok 4675532312441.7%
Z.AI GLM 4.5604747282641.6%
Stealth: Healer Alpha69654528041.4%
MoonshotAI: Kimi K2.675503835841.2%
Grok 4.20625243331741.2%
Llama 3.1 70B604340382441.1%
GPT-5.5 (Reasoning)484242383440.6%
Ministral 3B754831271940.2%
o4 Mini High605242281539.5%
Qwen3.6 Max Preview735239201139.1%
GPT-5.552525036438.9%
Aion 2.0663635312638.9%
GPT-5575044261638.6%
Ministral 3 14B912927252038.4%
GPT-4.1 Mini605232301738.1%
Ministral 3 3B9749319337.9%
Claude 3.5 Sonnet6865560037.7%
GPT-5.4 Mini (Reasoning, Low)704825232337.7%
Stealth: Hunter Alpha584533262336.9%
Qwen 3.6 Flash72452927836.1%
Gemini 2.5 Pro56554911234.5%
Xiaomi MIMO v2.5653429271634.3%
GPT-5.4 Nano564335211634.2%
Gemma 3 4B753431161534.1%
Llama 3.1 Nemotron 70B574837141333.8%
Z.AI GLM 4.7 Flash575423191333.2%
Grok 4.20 (Beta)523634281533.1%
Ministral 3 8B8166130032.2%
o4 Mini63413117731.7%
Claude 3.7 Sonnet68373214831.6%
LFM2 24B403834272031.6%
DeepSeek V3.1474126231630.7%
Qwen 3.6 35B6150318030.0%
Mistral Small 3.2 24B60452222029.7%
Grok 4.20 (Beta, Reasoning)483523212029.2%
ByteDance Seed 1.6 Flash45413513427.5%
Gemma 3 12B44343423227.4%
Qwen 3.5 Plus (2026-02-15)43363213125.0%
Grok 4.3863530024.9%
Ministral 8B5337311024.6%
GPT-5.4 Nano (Reasoning, Low)31303021223.1%
GPT-5.4 Mini37