Sentence opener variety

Test: Bad Writing Habits

Avg. Score
54.7%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast87.6%$0.001837.8s72%
2GPT-4o Mini (temp=1)82.2%$0.001234.8s66%
3GPT-4o, Aug. 6th (temp=1)84.8%$0.01824.4s66%
4Llama 3.1 Nemotron 70B82.4%$0.003831.7s63%
5Rocinante 12B82.5%$0.001438.4s57%
6Llama 3.1 8B83.5%$0.00031.3m59%
7DeepSeek V3 (2025-03-24)77.8%$0.001439.4s56%
8Hermes 3 405B78.7%$0.003253.2s57%
9Grok 4 Fast72.0%$0.001724.1s57%
10Claude 3.5 Sonnet79.7%$0.04835.5s62%
11GPT-4.1 Mini69.6%$0.002719.0s55%
12Claude Sonnet 475.8%$0.03243.7s58%
13Claude 3 Haiku68.6%$0.002514.9s52%
14GPT-4o, May 13th (temp=1)72.7%$0.03314.4s55%
15Claude Sonnet 4.572.8%$0.03538.1s58%
16Cohere Command R+ (Aug. 2024)73.8%$0.02052.5s52%
17Hermes 3 70B75.6%$0.00101.2m49%
18GPT-4.1 Nano64.7%$0.000713.3s51%
19Z.AI GLM 4.568.1%$0.005142.1s50%
20GPT-4.165.8%$0.01844.7s54%
21Gemma 3 27B65.9%$0.000652.6s48%
22Gemma 3 12B63.1%$0.000441.3s49%
23Grok 473.8%$0.0481.7m58%
24Claude Haiku 4.563.3%$0.01121.6s49%
25Claude 3.7 Sonnet67.8%$0.04246.7s54%
26DeepSeek V3 (2024-12-26)63.9%$0.002154.6s47%
27Z.AI GLM 5 Turbo62.5%$0.008133.2s47%
28GPT-4o Mini (temp=0)62.1%$0.001234.8s45%
29DeepSeek-V2 Chat63.5%$0.002153.3s46%
30Claude Opus 4.769.1%$0.06930.4s53%
31Arcee AI: Trinity Large (Preview)61.4%$0.000043.6s45%
32Writer: Palmyra X558.2%$0.01122.0s47%
33Z.AI GLM 563.0%$0.00841.2m48%
34Gemma 3 4B57.9%$0.000220.0s44%
35Llama 3.1 70B60.5%$0.001529.4s43%
36LFM2 24B58.8%$0.000228.4s43%
37Qwen3 235B A22B Instruct 250758.7%$0.001159.2s46%
38Gemini 2.5 Flash Lite56.3%$0.00099.5s42%
39Z.AI GLM 4.5 Air61.8%$0.002958.2s43%
40Grok 4.356.6%$0.006930.5s44%
41Grok 4.2056.1%$0.009345.7s46%
42Gemini 2.5 Flash58.2%$0.005210.6s39%
43Claude Opus 4.7 (Reasoning)65.9%$0.07632.0s50%
44Gemini 3.5 Flash (Reasoning, Minimal)57.5%$0.01812.0s42%
45DeepSeek V4 Flash59.4%$0.000631.6s38%
46Claude Sonnet 4.662.6%$0.03139.3s43%
47Qwen 3.5 Plus (2026-02-15)54.6%$0.006031.5s43%
48Mistral Small 454.9%$0.001418.2s40%
49Qwen 3 32B58.5%$0.001554.6s41%
50Grok 4.20 (Beta)55.2%$0.01815.8s43%
51Gemini 2.5 Flash (Reasoning)55.3%$0.01121.5s42%
52DeepSeek V4 Pro61.4%$0.00481.3m42%
53o4 Mini53.0%$0.01525.7s45%
54Z.AI GLM 5.162.2%$0.0141.5m44%
55Grok 4.20 (Reasoning)59.0%$0.0181.5m48%
56GPT-4o, Aug. 6th (temp=0)57.0%$0.02322.7s42%
57DeepSeek V4 Flash (Reasoning)57.1%$0.000731.1s38%
58Arcee AI: Trinity Mini55.9%$0.00039.2s36%
59Qwen 2.5 72B52.8%$0.001036.7s42%
60Gemini 2.5 Flash Lite (Reasoning)53.0%$0.002830.8s40%
61MiniMax M2.757.3%$0.00401.1m41%
62Mistral Medium 3.152.0%$0.004836.5s42%
63Claude Opus 4.562.7%$0.07053.4s49%
64Grok 4.20 (Beta, Reasoning)55.8%$0.03934.0s46%
65MiniMax M2.556.9%$0.00341.3m41%
66Mistral Large 251.7%$0.01329.4s42%
67Mistral Small Creative47.2%$0.00079.1s39%
68Mistral Large 349.7%$0.003330.3s40%
69Mistral Small 4 (Reasoning)50.1%$0.002230.2s39%
70ByteDance Seed 1.6 Flash50.8%$0.001327.3s37%
71Mistral Large50.6%$0.01430.9s41%
72Ministral 3 14B48.2%$0.000711.7s37%
73GPT-5.4 Mini (Reasoning, Low)46.4%$0.01516.8s41%
74GPT-5.4 Mini46.6%$0.01516.8s41%
75GPT-4o, May 13th (temp=0)53.3%$0.03514.1s40%
76WizardLM 2 8x22b57.4%$0.00261.8m40%
77Ministral 3B46.8%$0.00018.1s36%
78Ministral 3 8B45.3%$0.000819.6s38%
79o4 Mini High53.1%$0.02547.2s40%
80Xiaomi MIMO v2.5 Pro50.4%$0.008553.5s39%
81Claude Sonnet 4.6 (Reasoning)61.1%$0.0601.2m43%
82Ministral 8B44.3%$0.000410.4s37%
83Xiaomi MIMO v2.547.2%$0.005431.8s38%
84Mistral NeMO46.4%$0.000510.1s34%
85GPT-5.4 Mini (Reasoning)47.7%$0.02228.1s40%
86Nemotron 3 Super48.5%$0.00001.4m41%
87GPT-5.455.7%$0.0491.4m46%
88Ministral 3 3B44.7%$0.000511.1s35%
89Stealth: Hunter Alpha47.7%$0.000055.0s37%
90Stealth: Healer Alpha45.6%$0.000023.7s35%
91GPT-5.4 Nano (Reasoning)41.2%$0.006124.5s39%
92Aion 2.049.4%$0.00641.3m39%
93Inception Mercury 241.1%$0.00327.0s36%
94GPT-5.4 Nano (Reasoning, Low)40.7%$0.005520.6s39%
95Gemini 2.5 Pro50.7%$0.03636.2s39%
96GPT-5.4 Nano40.5%$0.005726.3s39%
97Z.AI GLM 4.647.4%$0.006551.5s36%
98Gemini 3.1 Flash Lite43.6%$0.003012.1s33%
99Stealth: Aurora Alpha40.2%$0.00009.8s34%
100Gemini 3.1 Flash Lite (Preview)43.2%$0.00308.4s32%
101Gemini 3 Flash (Preview)43.7%$0.007819.6s34%
102Gemini 3.1 Flash Lite (Reasoning)43.4%$0.003011.9s32%
103GPT-5.4 (Reasoning, Low)53.8%$0.0551.4m44%
104DeepSeek V3.247.8%$0.00141.9m40%
105GPT-5.154.4%$0.0541.8m44%
106Gemini 3 Flash (Preview, Reasoning)43.5%$0.01230.1s33%
107MoonshotAI: Kimi K2.557.7%$0.0193.2m42%
108DeepSeek V3.151.6%$0.00201.8m33%
109DeepSeek V4 Pro (Reasoning)57.6%$0.0153.1m40%
110Qwen 3.6 Flash46.5%$0.01041.4s31%
111Z.AI GLM 4.745.1%$0.0101.4m37%
112Claude Opus 4.655.1%$0.0781.2m42%
113GPT-5 Mini41.9%$0.010057.4s36%
114Grok 4.3 (Reasoning)53.5%$0.0212.3m38%
115Z.AI GLM 4.7 Flash43.1%$0.00171.2m34%
116Claude Opus 4.6 (Reasoning)57.2%$0.0881.4m43%
117Claude Opus 472.7%$0.2091.4m57%
118GPT-OSS 120B43.5%$0.00151.8m37%
119ByteDance Seed 2.0 Lite55.7%$0.0122.2m32%
120Gemma 4 26B39.7%$0.000955.1s33%
121Gemini 3.5 Flash (Reasoning)50.3%$0.07137.6s38%
122Gemini 3 Pro (Preview)48.5%$0.05554.4s38%
123Qwen 3.5 Flash38.0%$0.002547.5s34%
124Nemotron 3 Nano40.9%$0.00101.1m33%
125Gemma 4 31B43.1%$0.00101.6m34%
126Qwen 3.6 35B43.0%$0.00831.0m30%
127ByteDance Seed 1.651.7%$0.0132.5m34%
128Qwen 3.5 Plus (2026-04-20)47.5%$0.0171.8m32%
129GPT-5.242.7%$0.0561.5m41%
130Gemma 4 31B (Reasoning)41.0%$0.00142.2m34%
131Qwen 3.5 122B37.9%$0.0251.1m33%
132Gemma 4 26B (Reasoning)39.8%$0.00132.0m32%
133Qwen 3.5 9B37.1%$0.00111.4m30%
134Qwen 3.6 27B46.7%$0.0252.3m33%
135Qwen 3.5 35B35.7%$0.0181.0m31%
136GPT-5.4 (Reasoning)53.8%$0.0892.6m42%
137Qwen 3.5 27B37.5%$0.0201.6m33%
138Inception Mercury32.4%$0.01117.6s25%
139GPT-5 Nano33.5%$0.00421.4m31%
140Qwen 3.5 397B A17B42.7%$0.0143.0m34%
141Gemini 3.1 Pro (Preview)49.1%$0.1071.8m37%
142ByteDance Seed 2.0 Mini50.8%$0.00454.9m36%
143Qwen3.7 Max45.2%$0.0682.3m35%
144Qwen3.6 Max Preview48.2%$0.0503.5m35%
145GPT-544.5%$0.0652.8m36%
146GPT-5.5 (Reasoning)48.3%$0.1421.8m41%
147GPT-5.5 (Reasoning, Low)48.4%$0.1391.8m40%
148GPT-5.548.2%$0.1391.7m39%
149MoonshotAI: Kimi K2.652.9%$0.0586.5m39%
150Mistral Small 3.2 24B37.6%$0.00685.6m30%
54.66%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B10010095938995.4%
DeepSeek V3 (2025-03-24)1009894928393.4%
GPT-4o Mini (temp=1)1009889878291.2%
Hermes 3 405B999687846385.8%
Claude 3.5 Sonnet908685848285.3%
Claude Opus 4.7 (Reasoning)939381807985.2%
GPT-4o, Aug. 6th (temp=1)989779777184.4%
Hermes 3 70B999782727084.0%
Claude Sonnet 41009084757083.9%
Grok 4.1 Fast918786837083.4%
Llama 3.1 Nemotron 70B989779736983.1%
DeepSeek V3 (2024-12-26)918482816580.8%
Claude Opus 4.7918880737280.8%
DeepSeek-V2 Chat989581656179.9%
Z.AI GLM 5848381797179.6%
Claude Sonnet 4.5948180806279.4%
Llama 3.1 8B1009997514879.0%
DeepSeek V4 Flash848482796478.7%
Gemma 3 12B998774726278.6%
Grok 4868180737278.4%
DeepSeek V4 Pro878680756378.3%
Claude Sonnet 4.6808079757477.7%
DeepSeek V4 Pro (Reasoning)1009171656077.6%
DeepSeek V4 Flash (Reasoning)838078767077.6%
Z.AI GLM 5.1888477706677.0%
Z.AI GLM 5 Turbo848177756777.0%
Claude Haiku 4.5867673727075.3%
Gemma 3 4B948073656274.8%
ByteDance Seed 2.0 Lite1009088554174.6%
GPT-4o, May 13th (temp=1)947673656374.2%
Grok 4 Fast867776666473.7%
Gemma 3 27B838278665873.3%
Gemini 2.5 Flash928477585473.1%
Claude Sonnet 4.6 (Reasoning)828076675672.4%
ByteDance Seed 1.6857272716072.3%
Z.AI GLM 4.5837675705672.1%
MiniMax M2.7817869666571.9%
Qwen3 235B A22B Instruct 2507887870635871.5%
Claude Opus 4828072715371.4%
Claude 3 Haiku797372696371.3%
GPT-4o Mini (temp=0)797675656170.9%
GPT-4.1847270646470.8%
Claude Opus 4.6 (Reasoning)797272676370.6%
Gemini 2.5 Flash (Reasoning)787569646370.0%
Z.AI GLM 4.5 Air887964645569.8%
MoonshotAI: Kimi K2.5757372656469.8%
MoonshotAI: Kimi K2.6977360605969.8%
Claude Opus 4.5757268676669.6%
Grok 4.3938062605068.8%
Cohere Command R+ (Aug. 2024)907367585568.7%
Gemini 3.5 Flash (Reasoning, Minimal)777471655368.1%
Claude 3.7 Sonnet858460565567.9%
Grok 4.20757464645867.0%
LFM2 24B866464615966.8%
Arcee AI: Trinity Large (Preview)777769585366.7%
Claude Opus 4.6737163626165.9%
GPT-5.4 (Reasoning)767067605665.9%
GPT-4.1 Mini756865645765.8%
Grok 4.3 (Reasoning)796966625365.8%
Grok 4.20 (Beta)706765655865.2%
GPT-5.4 (Reasoning, Low)776865585664.8%
Mistral Large 2767570574664.7%
Gemini 2.5 Flash Lite777164634764.6%
Grok 4.20 (Reasoning)727067645064.5%
Writer: Palmyra X5726661595562.5%
GPT-5.4726363605562.4%
GPT-4.1 Nano696865585062.2%
GPT-5.4 Mini (Reasoning)706461575461.2%
Grok 4.20 (Beta, Reasoning)666560595561.1%
Xiaomi MIMO v2.5 Pro696563575060.9%
Gemini 3.1 Flash Lite (Preview)706957575160.8%
Gemini 2.5 Pro656360565659.9%
ByteDance Seed 2.0 Mini827057493959.4%
WizardLM 2 8x22b735959535259.2%
Gemini 3 Pro (Preview)716759504858.9%
Gemini 3.1 Pro (Preview)766253525158.7%
GPT-5.5666159575058.6%
Mistral Small 4 (Reasoning)737252524358.5%
MiniMax M2.5806955484058.4%
ByteDance Seed 1.6 Flash816358474358.4%
GPT-5.5 (Reasoning, Low)605957575557.8%
Mistral Small 4725957544457.3%
Gemini 2.5 Flash Lite (Reasoning)625958565157.2%
Qwen 3.6 Flash646363623357.2%
Aion 2.0646160514756.8%
Nemotron 3 Super625855555356.8%
Qwen3.6 Max Preview715956554256.7%
Qwen 3.5 Plus (2026-02-15)676555494756.5%
o4 Mini646255544756.4%
GPT-5.5 (Reasoning)625957564756.2%
Ministral 3 3B985350463356.1%
Gemini 3.1 Flash Lite (Reasoning)686753513855.4%
Ministral 3 8B735554484655.2%
Llama 3.1 70B686251484655.0%
Z.AI GLM 4.7635755544554.8%
Qwen 3.6 35B646357494054.5%
Arcee AI: Trinity Mini726149464454.4%
DeepSeek V3.2615554544754.3%
DeepSeek V3.1676152464554.3%
GPT-5.1605856564154.3%
Mistral Large736452424054.1%
Qwen 3.5 9B806749413253.9%
Stealth: Hunter Alpha595756514253.1%
o4 Mini High705448474653.0%
Ministral 3 14B695351514152.8%
Gemma 4 31B (Reasoning)605653484752.6%
Gemma 4 31B666051444252.5%
Qwen 3.6 27B645953463952.5%
Mistral Medium 3.1565655504452.4%
Gemini 3.5 Flash (Reasoning)595752494552.2%
Nemotron 3 Nano646260443252.2%
Qwen 3 32B616048464552.1%
Stealth: Healer Alpha635653434151.1%
GPT-4o, May 13th (temp=0)645850463751.1%
GPT-5.4 Mini635350464250.9%
Gemini 3.1 Flash Lite595653513350.4%
GPT-5605550483750.0%
Mistral Large 3585149474450.0%
Xiaomi MIMO v2.5645150403948.9%
GPT-OSS 120B625148424148.7%
Gemini 3 Flash (Preview, Reasoning)575147464348.7%
GPT-4o, Aug. 6th (temp=0)505049494648.7%
GPT-5.4 Mini (Reasoning, Low)545248474348.7%
Z.AI GLM 4.7 Flash635046453948.6%
Z.AI GLM 4.6565550453648.4%
Qwen 3.5 397B A17B595245444248.4%
Qwen 2.5 72B615246434048.3%
GPT-5 Mini565347413847.1%
Qwen3.7 Max644644414047.0%
Ministral 8B494847444145.8%
GPT-5.2494545444445.5%
Ministral 3B584646413445.1%
Gemma 4 26B (Reasoning)625043412845.0%
Gemini 3 Flash (Preview)575042383844.8%
GPT-5.4 Nano (Reasoning, Low)484745433944.6%
Mistral NeMO565443353444.4%
Qwen 3.5 Plus (2026-04-20)575041373644.2%
Mistral Small Creative474644434044.0%
Gemma 4 26B514942413743.9%
GPT-5.4 Nano464444413842.7%
GPT-5.4 Nano (Reasoning)444342403941.4%
Qwen 3.5 122B424240404040.8%
Qwen 3.5 27B424141393940.4%
Inception Mercury 2454140373740.0%
Stealth: Aurora Alpha474038343338.6%
Qwen 3.5 Flash434040363238.4%
Mistral Small 3.2 24B454140382537.7%
Qwen 3.5 35B444137332936.8%
Inception Mercury504030292534.8%
GPT-5 Nano353432282530.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
DeepSeek V3 (2025-03-24)1009485797786.9%
GPT-4o Mini (temp=1)868685857382.9%
Grok 4.1 Fast969491686582.5%
Hermes 3 70B1009884753578.5%
GPT-4o, Aug. 6th (temp=1)998377725376.7%
Cohere Command R+ (Aug. 2024)957875735875.8%
Rocinante 12B10010092612575.7%
Hermes 3 405B10010064575675.5%
Claude Sonnet 4837473706974.0%
Llama 3.1 Nemotron 70B998967575573.4%
Claude Sonnet 4.5797675735872.2%
DeepSeek V3 (2024-12-26)787676644467.4%
Claude Opus 4878263584467.0%
Gemma 3 27B786766645866.8%
Llama 3.1 8B996362595066.5%
GPT-4.1717068636166.4%
Claude 3.5 Sonnet747063636266.3%
Mistral Small 41007356534866.0%
Claude Opus 4.5746666616065.3%
Grok 4807159575664.6%
Claude 3 Haiku686765606064.0%
Claude Opus 4.7746860585763.5%
Claude 3.7 Sonnet706664625363.2%
DeepSeek-V2 Chat797970444463.0%
Claude Opus 4.7 (Reasoning)666561605761.9%
Z.AI GLM 5.1726758565561.5%
DeepSeek V4 Pro756060585461.5%
Gemini 3.5 Flash (Reasoning, Minimal)925857494660.6%
Z.AI GLM 5 Turbo686459595260.5%
DeepSeek V4 Flash746657515059.7%
GPT-4.1 Mini666463535159.5%
Z.AI GLM 4.5736358574659.2%
DeepSeek V4 Pro (Reasoning)706964474458.9%
GPT-4o, May 13th (temp=1)696362564558.9%
Z.AI GLM 5725656545358.1%
Claude Opus 4.6 (Reasoning)656560584258.0%
Qwen 3.6 27B786752494457.9%
MiniMax M2.7686257574557.9%
Grok 4 Fast666462524557.7%
Arcee AI: Trinity Large (Preview)696258524857.6%
Qwen 3.5 Plus (2026-04-20)875656463856.6%
Writer: Palmyra X5726052524556.4%
Claude Sonnet 4.6605956534654.9%
Claude Opus 4.6626156544154.9%
GPT-5.4615554535054.8%
MiniMax M2.5565555545354.6%
Qwen3 235B A22B Instruct 2507645856484754.5%
DeepSeek V4 Flash (Reasoning)746154473554.3%
Claude Sonnet 4.6 (Reasoning)645453514954.2%
MoonshotAI: Kimi K2.5655652494853.8%
Claude Haiku 4.5605653514352.7%
Gemini 2.5 Flash Lite (Reasoning)655849464252.1%
Gemini 3.5 Flash (Reasoning)605858443951.8%
Gemini 3.1 Pro (Preview)645651474151.8%
GPT-4.1 Nano605451474551.7%
DeepSeek V3.1815046414051.6%
GPT-4o Mini (temp=0)595350474751.2%
Arcee AI: Trinity Mini765043434050.5%
WizardLM 2 8x22b685944413850.1%
GPT-5.4 (Reasoning)605250444450.1%
GPT-5.4 (Reasoning, Low)575648474349.9%
LFM2 24B664948454049.7%
Gemini 2.5 Flash Lite565352483849.6%
ByteDance Seed 2.0 Lite655047473649.0%
Z.AI GLM 4.5 Air685348413549.0%
Mistral Large 3565352453748.6%
Gemini 3 Pro (Preview)535150484048.5%
Gemma 3 12B605046463948.3%
Grok 4.20 (Beta, Reasoning)644645454048.1%
GPT-5.1545150444047.8%
Qwen 3 32B565147424147.7%
Ministral 3 14B715242373647.6%
GPT-4o, Aug. 6th (temp=0)564746454347.5%
Gemini 2.5 Flash (Reasoning)664643414047.3%
Xiaomi MIMO v2.5 Pro605145413847.1%
Grok 4.20 (Reasoning)524745454346.5%
Nemotron 3 Super604643434046.3%
Gemma 3 4B535349373745.9%
Grok 4.20 (Beta)515049403945.7%
Gemini 2.5 Pro545045403845.5%
Grok 4.20494645444245.4%
Mistral Small 4 (Reasoning)534747423745.3%
Gemini 2.5 Flash564543424045.3%
Stealth: Aurora Alpha614544393845.3%
Mistral Medium 3.1554947423245.2%
Mistral NeMO555344383545.2%
GPT-5.5 (Reasoning, Low)494845434145.1%
Llama 3.1 70B704443412544.9%
o4 Mini474746444144.8%
Z.AI GLM 4.7 Flash544545423744.5%
Grok 4.3 (Reasoning)484444434344.4%
o4 Mini High504845413944.4%
Inception Mercury 2554542404044.3%
GPT-5.5544641413944.2%
Mistral Large564840393844.1%
Z.AI GLM 4.7545148343344.0%
Qwen 3.6 35B694838342943.6%
Aion 2.0494742423843.5%
GPT-5.5 (Reasoning)504543413743.0%
Mistral Large 2544744363342.7%
GPT-5.4 Mini (Reasoning)434342424142.3%
GPT-OSS 120B464442413942.2%
Grok 4.3474442413742.2%
Gemini 3.1 Flash Lite (Preview)494838373441.6%
GPT-5.4 Mini464242403841.6%
Ministral 3 3B554540402741.5%
GPT-5.4 Mini (Reasoning, Low)424241413941.2%
ByteDance Seed 1.6564936343141.1%
Qwen3.6 Max Preview644139352540.8%
Qwen 3.5 397B A17B444340403540.5%
Qwen 2.5 72B434341393640.5%
DeepSeek V3.2474642343340.2%
ByteDance Seed 1.6 Flash424140403840.0%
Qwen 3.5 Plus (2026-02-15)434241383439.7%
Ministral 3 8B454239393439.7%
Stealth: Hunter Alpha484444362639.6%
GPT-5464237373539.4%
Qwen 3.6 Flash444238373539.4%
GPT-5 Mini444140373338.9%
Ministral 3B484437372738.7%
GPT-5.2454038383438.7%
ByteDance Seed 2.0 Mini483937353538.7%
MoonshotAI: Kimi K2.6443937363638.6%
Gemma 4 31B (Reasoning)523836363038.6%
GPT-4o, May 13th (temp=0)454440342537.6%
Mistral Small Creative434237353037.4%
Gemini 3.1 Flash Lite424137343237.2%
GPT-5.4 Nano (Reasoning)403736363436.8%
Qwen3.7 Max413937343336.8%
Gemini 3.1 Flash Lite (Reasoning)444135303036.2%
GPT-5.4 Nano (Reasoning, Low)393835353336.1%
Gemini 3 Flash (Preview, Reasoning)393836343335.9%
Stealth: Healer Alpha433836322935.7%
GPT-5.4 Nano383636353435.7%
Qwen 3.5 9B443933313135.5%
Z.AI GLM 4.6433634332935.2%
Ministral 8B383635333335.1%
Qwen 3.5 27B413835313035.0%
Gemini 3 Flash (Preview)373635353235.0%
Gemma 4 31B433534333034.8%
Qwen 3.5 122B393633323234.3%
Nemotron 3 Nano453832282634.0%
Xiaomi MIMO v2.5353535343033.8%
Qwen 3.5 Flash383734322833.6%
Mistral Small 3.2 24B443837252533.6%
Gemma 4 26B (Reasoning)433330302532.2%
Gemma 4 26B363433292531.4%
Inception Mercury413425252530.1%
Qwen 3.5 35B363030282529.9%
GPT-5 Nano252525252525.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast989896958995.2%
GPT-4o, Aug. 6th (temp=1)999898918794.4%
Llama 3.1 Nemotron 70B969492888490.7%
Llama 3.1 8B1009897926089.7%
GPT-4o Mini (temp=1)989593817989.3%
Claude 3.5 Sonnet929090868488.5%
DeepSeek V3 (2025-03-24)1009286806985.3%
GPT-4o, May 13th (temp=1)918988847084.4%
Hermes 3 70B969583796783.9%
Claude Sonnet 4928985767182.5%
Cohere Command R+ (Aug. 2024)958282777582.1%
Grok 4998281747081.3%
DeepSeek V3 (2024-12-26)908877777481.2%
Z.AI GLM 4.5 Air939079785979.6%
Hermes 3 405B938878776079.4%
Claude Sonnet 4.5958077756879.3%
Claude 3 Haiku948973726478.5%
Claude Opus 4888276757078.4%
Rocinante 12B999385674778.1%
Grok 4 Fast848281727178.0%
Z.AI GLM 4.5908278696777.0%
Gemini 2.5 Flash1007675686476.6%
GPT-4.1 Mini918278685574.8%
Claude 3.7 Sonnet817774736373.7%
GPT-4o Mini (temp=0)767575746873.6%
Z.AI GLM 5.1818172726173.4%
DeepSeek-V2 Chat958467665372.9%
Grok 4.20 (Beta)797872656471.7%
DeepSeek V4 Flash (Reasoning)797876675070.0%
Gemma 3 12B827472675369.7%
Z.AI GLM 5896866615968.4%
DeepSeek V4 Pro797467665668.3%
Mistral Medium 3.1907463615368.1%
Grok 4.3948260574567.7%
Z.AI GLM 5 Turbo746969655866.9%
Claude Opus 4.5787366625566.6%
GPT-4o, Aug. 6th (temp=0)827465625066.6%
Z.AI GLM 4.6817266635166.6%
GPT-4.1 Nano747272625266.3%
GPT-4.1696965656466.2%
Claude Opus 4.7 (Reasoning)696868665865.9%
MiniMax M2.5737369625265.8%
ByteDance Seed 1.6976955555465.7%
Gemini 2.5 Flash (Reasoning)976563584665.7%
Gemma 3 27B756764615965.3%
Grok 4.20 (Reasoning)716766616065.0%
DeepSeek V4 Flash747267575063.8%
Qwen 3 32B787165534963.3%
Claude Opus 4.7706565605663.3%
Qwen 3.5 Plus (2026-02-15)696763605763.3%
Mistral Large 2756858585763.2%
Arcee AI: Trinity Large (Preview)797764593462.8%
DeepSeek V4 Pro (Reasoning)676462615962.6%
Writer: Palmyra X5717065525161.9%
Gemini 2.5 Flash Lite736759565561.9%
Claude Sonnet 4.6 (Reasoning)696463595361.6%
Arcee AI: Trinity Mini937158473961.5%
Gemini 2.5 Flash Lite (Reasoning)686762575060.8%
GPT-4o, May 13th (temp=0)656362585660.6%
Claude Sonnet 4.6796059535160.4%
Mistral Large706660564559.5%
MiniMax M2.7726761514659.5%
WizardLM 2 8x22b716459535159.4%
Gemma 3 4B716558524558.4%
Qwen 2.5 72B645857565558.0%
Claude Opus 4.6656361544757.9%
Claude Haiku 4.5755955534857.9%
Llama 3.1 70B686862583257.4%
Claude Opus 4.6 (Reasoning)636159564957.3%
Grok 4.20685856554857.0%
Grok 4.20 (Beta, Reasoning)646059534756.7%
o4 Mini High776450494356.7%
MoonshotAI: Kimi K2.5666257534356.3%
Mistral Large 3615957524655.1%
Qwen3 235B A22B Instruct 2507676552454454.4%
MoonshotAI: Kimi K2.6695653514254.1%
Gemini 3.1 Flash Lite655853504354.0%
Aion 2.0616154494453.6%
Xiaomi MIMO v2.5 Pro575553525053.4%
Xiaomi MIMO v2.5615553494953.2%
Mistral Small 4695452484353.1%
LFM2 24B655552504353.0%
o4 Mini595954484652.9%
ByteDance Seed 1.6 Flash585653514552.8%
Gemini 2.5 Pro625958463852.5%
Qwen 3.6 Flash656154453552.0%
Qwen3.6 Max Preview665555443851.6%
Stealth: Hunter Alpha615150484551.2%
DeepSeek V3.2535250504950.9%
Grok 4.3 (Reasoning)655649444050.7%
Nemotron 3 Super575451474350.6%
Qwen3.7 Max585552454250.4%
Mistral Small 4 (Reasoning)515150474649.2%
Gemini 3.1 Pro (Preview)535048484749.0%
Gemini 3 Pro (Preview)545049454548.5%
ByteDance Seed 2.0 Lite595844423848.3%
Ministral 3 14B535347464248.2%
Ministral 3B534848464548.1%
Ministral 3 3B554747454247.4%
Gemini 3.5 Flash (Reasoning, Minimal)565350443547.3%
Stealth: Healer Alpha575544413847.2%
Ministral 3 8B514846464547.1%
GPT-5.4514847464346.9%
Gemini 3.5 Flash (Reasoning)554846434146.4%
Ministral 8B484747454145.9%
Mistral Small Creative484646454445.8%
GPT-5.4 (Reasoning, Low)514645444345.7%
GPT-5.4 Mini484745444345.5%
GPT-5.4 Mini (Reasoning)484846443945.1%
Nemotron 3 Nano534544424245.1%
Gemma 4 31B525143393844.6%
DeepSeek V3.1564844413344.6%
GPT-OSS 120B474744434244.5%
GPT-5.1534543424044.5%
GPT-5.5 (Reasoning)484444434344.4%
GPT-5.4 (Reasoning)464444444344.3%
GPT-5.5464544444344.2%
GPT-5.5 (Reasoning, Low)454544444244.2%
GPT-5.4 Mini (Reasoning, Low)474543434244.0%
Inception Mercury 2474444434144.0%
GPT-5.2464444434243.8%
Qwen 3.5 Plus (2026-04-20)504743433643.7%
Z.AI GLM 4.7544743373743.7%
Stealth: Aurora Alpha474543424243.7%
Mistral NeMO494844433443.6%
Qwen 3.5 397B A17B484544413843.3%
GPT-5.4 Nano (Reasoning)444444434143.1%
ByteDance Seed 2.0 Mini594341383443.1%
GPT-5.4 Nano454442424143.0%
Gemma 4 31B (Reasoning)484342413942.8%
GPT-5554339373742.2%
Qwen 3.6 35B624545322542.0%
Gemma 4 26B (Reasoning)484841363541.7%
GPT-5.4 Nano (Reasoning, Low)444342413941.6%
Inception Mercury504743432541.6%
Mistral Small 3.2 24B505045332941.5%
Gemini 3 Flash (Preview, Reasoning)544437373541.3%
Gemini 3.1 Flash Lite (Reasoning)543939383340.3%
GPT-5 Mini414140393940.2%
Gemini 3.1 Flash Lite (Preview)454340383139.7%
Qwen 3.5 122B464140353238.9%
Gemini 3 Flash (Preview)403838383838.5%
Z.AI GLM 4.7 Flash434334343337.6%
Qwen 3.6 27B434239342937.5%
Qwen 3.5 9B424139372937.5%
Qwen 3.5 27B414036343437.1%
GPT-5 Nano403837363537.1%
Qwen 3.5 Flash383838363035.9%
Gemma 4 26B393938323035.5%
Qwen 3.5 35B393837372535.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o, Aug. 6th (temp=1)10010098989297.7%
Hermes 3 405B999895959195.7%
Grok 4.1 Fast10010098938795.4%
Claude 3.5 Sonnet1009997968695.4%
Llama 3.1 Nemotron 70B10010097898393.9%
DeepSeek V3 (2025-03-24)1009996888192.7%
Rocinante 12B1009895858492.5%
Cohere Command R+ (Aug. 2024)979795946990.5%
DeepSeek V4 Flash989792897690.4%
DeepSeek V4 Pro979189888790.2%
Z.AI GLM 4.51009788847789.1%
Z.AI GLM 4.5 Air989692837689.1%
Claude 3 Haiku1009787837488.1%
Claude Opus 4.7989391837487.9%
Hermes 3 70B1009692856387.3%
GPT-4o Mini (temp=1)928989847485.6%
DeepSeek V4 Flash (Reasoning)999088777485.4%
Claude Opus 4949089777384.6%
Claude Sonnet 4.5979085757584.5%
Claude Sonnet 4.6908784847483.7%
GPT-4o Mini (temp=0)898787817483.4%
Claude Sonnet 4958480797482.5%
DeepSeek V4 Pro (Reasoning)918784767482.5%
Llama 3.1 8B989897645281.9%
Claude Opus 4.5908784826481.4%
Z.AI GLM 5 Turbo978978727181.4%
GPT-4o, May 13th (temp=1)949275746980.8%
Grok 4 Fast868680787480.8%
GPT-4.1 Mini947979747379.7%
MiniMax M2.7958577726979.4%
Grok 4928076747278.8%
Z.AI GLM 5828280777378.8%
Claude Opus 4.7 (Reasoning)848381757278.7%
Claude Opus 4.6 (Reasoning)878078767278.7%
Gemini 2.5 Flash1008273726378.0%
Gemma 3 27B908475706977.4%
MoonshotAI: Kimi K2.5908981715677.3%
GPT-4o, Aug. 6th (temp=0)888578686777.2%
GPT-4o, May 13th (temp=0)988480675476.2%
Claude 3.7 Sonnet868278676675.7%
MiniMax M2.5957771686675.3%
Z.AI GLM 5.1867978725674.2%
Qwen3 235B A22B Instruct 2507777676766173.2%
Gemma 3 12B787670706872.5%
Xiaomi MIMO v2.5 Pro837673725872.4%
GPT-4.1808069696472.3%
GPT-4.1 Nano817868676672.0%
Claude Opus 4.6938266595971.8%
Claude Sonnet 4.6 (Reasoning)888667615771.7%
WizardLM 2 8x22b908863615471.2%
DeepSeek-V2 Chat848375605471.0%
Aion 2.0817370676070.0%
DeepSeek V3.1956765635969.9%
Gemini 3.5 Flash (Reasoning, Minimal)977367585469.9%
Writer: Palmyra X5817068656469.8%
DeepSeek V3 (2024-12-26)847574674869.6%
Gemini 3 Pro (Preview)827672625669.5%
Stealth: Hunter Alpha777473665669.2%
Llama 3.1 70B827466616068.5%
Arcee AI: Trinity Large (Preview)787572704768.3%
Mistral Small 4797167645968.2%
Grok 4.20 (Beta)737266656267.9%
GPT-5.4726868686367.8%
ByteDance Seed 2.0 Lite1007368524667.8%
ByteDance Seed 1.6747066646367.2%
Claude Haiku 4.5757370625667.2%
GPT-5.1747468595866.8%
GPT-5.4 (Reasoning, Low)736767656266.6%
Arcee AI: Trinity Mini797666625066.6%
Gemini 3.1 Pro (Preview)777671664266.2%
Gemini 2.5 Flash Lite886564595766.2%
LFM2 24B777469575165.6%
Grok 4.20757165625465.5%
Qwen 2.5 72B736868595965.4%
Grok 4.20 (Reasoning)807061605765.4%
Grok 4.3796961605765.1%
ByteDance Seed 2.0 Mini927058564564.4%
Grok 4.20 (Beta, Reasoning)706665645263.4%
Ministral 3 3B706965634963.1%
Qwen 3.6 27B816863614262.9%
GPT-5.4 (Reasoning)676563625662.9%
ByteDance Seed 1.6 Flash877060544362.9%
GPT-5.5686767595462.8%
Gemini 3.5 Flash (Reasoning)736562575562.5%
Xiaomi MIMO v2.5836356565362.3%
Ministral 3 14B797954504962.1%
Ministral 3B837953514562.1%
Qwen3.6 Max Preview756964534961.8%
Gemini 2.5 Pro727169494861.7%
o4 Mini High747358544861.4%
Mistral NeMO706563624560.9%
Gemma 3 4B746356555460.5%
o4 Mini736262545160.4%
GPT-5.5 (Reasoning, Low)666361565460.1%
Qwen 3.6 Flash876559523860.1%
Gemini 3.1 Flash Lite676662604560.0%
GPT-5.5 (Reasoning)646459585459.5%
Gemini 2.5 Flash (Reasoning)786565444459.3%
Gemini 3.1 Flash Lite (Reasoning)746863464559.2%
Qwen 3.6 35B836553524359.1%
Mistral Large636160575358.9%
Qwen 3 32B726460554358.7%
Gemini 2.5 Flash Lite (Reasoning)846351474558.2%
Qwen3.7 Max656458544657.7%
DeepSeek V3.2605856565657.3%
Z.AI GLM 4.6696361474657.2%
Mistral Small 4 (Reasoning)666458544357.0%
Gemini 3 Flash (Preview)705952524756.0%
Mistral Medium 3.1676253484755.6%
Gemini 3.1 Flash Lite (Preview)635553535155.1%
Grok 4.3 (Reasoning)755554464555.0%
MoonshotAI: Kimi K2.6706155454354.7%
Stealth: Healer Alpha615858494754.5%
Qwen 3.5 Plus (2026-02-15)625755524654.3%
Qwen 3.5 397B A17B745949454454.1%
Z.AI GLM 4.7755347454452.9%
Ministral 8B676447443952.3%
Mistral Small Creative595852524152.2%
GPT-5.4 Mini565552504752.1%
Nemotron 3 Super625849464451.8%
Gemma 4 31B635350474551.7%
GPT-5575252504851.6%
Qwen 3.5 Plus (2026-04-20)704947453950.1%
GPT-5 Mini615649463649.6%
Ministral 3 8B595748434249.6%
GPT-5.4 Mini (Reasoning, Low)594948464549.6%
Mistral Large 3665546423849.4%
Z.AI GLM 4.7 Flash625150423948.8%
Mistral Large 2615643424248.7%
GPT-OSS 120B615246434248.6%
GPT-5.4 Mini (Reasoning)534947474648.2%
Mistral Small 3.2 24B715050432647.7%
Nemotron 3 Nano564947464047.5%
Gemini 3 Flash (Preview, Reasoning)534946454347.1%
Gemma 4 31B (Reasoning)615442393145.4%
GPT-5.2474644434344.4%
Gemma 4 26B484645443844.3%
Inception Mercury 2544444383743.6%
Qwen 3.5 122B654139393343.2%
GPT-5.4 Nano444443424243.1%
Gemma 4 26B (Reasoning)484639393942.2%
GPT-5.4 Nano (Reasoning)454441404042.1%
Stealth: Aurora Alpha464342413441.3%
Qwen 3.5 27B484843372540.3%
GPT-5.4 Nano (Reasoning, Low)424141383840.1%
Qwen 3.5 Flash464138333338.1%
Qwen 3.5 35B443938363337.8%
Inception Mercury473937342536.4%
Qwen 3.5 9B504429252534.6%
GPT-5 Nano383533323133.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o, Aug. 6th (temp=1)1009591908692.5%
Grok 4.1 Fast999888886988.6%
GPT-4o Mini (temp=1)1009186807586.5%
Llama 3.1 Nemotron 70B989188827386.4%
Hermes 3 70B989391896086.0%
Claude 3.5 Sonnet979284797786.0%
GPT-4o, May 13th (temp=1)959381806883.4%
Rocinante 12B1009897724882.9%
Z.AI GLM 4.5938881817282.9%
Cohere Command R+ (Aug. 2024)1009391695481.2%
Hermes 3 405B998378736980.4%
DeepSeek V3 (2025-03-24)868078777679.5%
DeepSeek V3 (2024-12-26)978181695776.8%
Grok 4848279756476.8%
Grok 4 Fast897776756776.8%
Z.AI GLM 5.1937974745675.5%
Mistral Small 410010064575575.3%
Qwen 3 32B868573705874.6%
DeepSeek V4 Flash (Reasoning)828279745374.1%
Llama 3.1 8B1009591453573.3%
Claude Sonnet 4787673706572.5%
GPT-4.1 Mini867976635972.4%
GPT-4o Mini (temp=0)868069666072.1%
DeepSeek V4 Flash977574664971.9%
GPT-4.1 Nano868368635971.7%
Claude Opus 4.7797873695871.2%
DeepSeek-V2 Chat787467676770.6%
Claude 3 Haiku897164635769.0%
Llama 3.1 70B1006659595868.6%
Z.AI GLM 4.5 Air767670645668.3%
Claude Sonnet 4.5737170685968.3%
GPT-4.1757370615867.4%
Grok 4.3717067646467.4%
Qwen3 235B A22B Instruct 2507847367615167.2%
Claude Sonnet 4.6 (Reasoning)836866595867.0%
Gemma 3 27B777369625467.0%
MoonshotAI: Kimi K2.5787168595666.4%
Gemma 3 4B797666575266.2%
Qwen 3.6 Flash948362464566.2%
Claude Opus 4.7 (Reasoning)736864645965.5%
Gemini 2.5 Flash (Reasoning)797160595865.3%
Gemini 3.5 Flash (Reasoning, Minimal)726463636264.7%
Gemini 2.5 Flash837259575364.7%
Claude Sonnet 4.6876159595764.7%
Claude 3.7 Sonnet706665635964.7%
Qwen 2.5 72B706964595964.2%
Claude Haiku 4.5766760595964.1%
Claude Opus 4706766595764.0%
Z.AI GLM 5 Turbo716866595163.0%
DeepSeek V4 Pro706766585362.9%
GPT-4o, May 13th (temp=0)706969564862.3%
Grok 4.20 (Reasoning)716560595261.4%
Z.AI GLM 4.6736558575361.2%
Grok 4.3 (Reasoning)706761585061.2%
Grok 4.20 (Beta)706868524660.6%
Grok 4.20716761564860.6%
Gemini 2.5 Flash Lite825958534960.3%
Writer: Palmyra X5646361575560.2%
GPT-4o, Aug. 6th (temp=0)786358514959.7%
Qwen3.6 Max Preview736860534459.5%
Claude Opus 4.5666661564859.5%
Claude Opus 4.6686459584759.0%
Gemma 3 12B775756565059.0%
ByteDance Seed 1.6666261594558.6%
MiniMax M2.7766157494958.6%
LFM2 24B686457564858.5%
o4 Mini636161574858.0%
Claude Opus 4.6 (Reasoning)615957575157.0%
Gemini 2.5 Flash Lite (Reasoning)656455515056.9%
Mistral Small 4 (Reasoning)615957555056.3%
Arcee AI: Trinity Mini775652484655.9%
MiniMax M2.5675754544755.7%
Arcee AI: Trinity Large (Preview)666555474455.4%
Mistral NeMO686154494655.4%
DeepSeek V4 Pro (Reasoning)736253454355.3%
Z.AI GLM 5676060523655.1%
ByteDance Seed 1.6 Flash685755514555.0%
Gemini 3.1 Flash Lite (Reasoning)676160523554.9%
Aion 2.0656358444354.5%
Mistral Medium 3.1625753534754.3%
Grok 4.20 (Beta, Reasoning)605752525054.1%
Xiaomi MIMO v2.5 Pro715352504453.9%
Xiaomi MIMO v2.5585752524953.8%
Mistral Large 3675750474653.4%
Gemini 3 Pro (Preview)606056494353.4%
DeepSeek V3.2605552505053.2%
Stealth: Hunter Alpha595956494453.2%
Mistral Large 2625654454452.2%
Gemini 3.5 Flash (Reasoning)635650464552.1%
Qwen 3.5 Plus (2026-04-20)735655423351.9%
WizardLM 2 8x22b616053434251.9%
GPT-5.4 (Reasoning, Low)555552494951.8%
Ministral 3B706848393551.8%
Gemini 2.5 Pro625648484551.7%
Nemotron 3 Super575451504651.6%
Qwen 3.5 Plus (2026-02-15)635249494551.5%
GPT-5.4575151514751.3%
Mistral Large635450464451.2%
Stealth: Aurora Alpha605949484050.9%
Stealth: Healer Alpha655148474250.6%
MoonshotAI: Kimi K2.6565249494650.2%
Gemini 3.1 Pro (Preview)694848454050.2%
DeepSeek V3.1595151503749.9%
Mistral Small Creative614949474149.5%
GPT-OSS 120B625146454449.4%
ByteDance Seed 2.0 Lite696854292549.2%
Z.AI GLM 4.7625147453748.6%
Ministral 3 14B564846454548.0%
Gemini 3.1 Flash Lite525151493747.9%
o4 Mini High544848464247.6%
GPT-5.4 (Reasoning)504948464347.1%
Ministral 3 3B535045454347.1%
ByteDance Seed 2.0 Mini664842423747.0%
GPT-5.1544847454046.6%
GPT-5.4 Mini (Reasoning)474646464646.3%
Inception Mercury 2484848444246.2%
GPT-5.5 (Reasoning, Low)524545444446.0%
Ministral 8B504747463945.8%
GPT-5.5 (Reasoning)494545444445.4%
GPT-5.4 Mini494645454245.4%
Nemotron 3 Nano484847414045.0%
Gemini 3.1 Flash Lite (Preview)565044423344.9%
GPT-5.5504645434244.9%
GPT-5.4 Mini (Reasoning, Low)494845424144.8%
Ministral 3 8B464645444344.7%
GPT-5574442413644.0%
GPT-5.2454544444244.0%
GPT-5.4 Nano (Reasoning)444443414142.9%
Qwen 3.6 35B564342393442.7%
Z.AI GLM 4.7 Flash514645393242.6%
Qwen 3.6 27B514842383442.5%
GPT-5 Mini484542413842.5%
Qwen 3.5 397B A17B514643403142.2%
Gemma 4 31B (Reasoning)484440403842.1%
GPT-5.4 Nano444343433742.0%
GPT-5.4 Nano (Reasoning, Low)434242424041.8%
Mistral Small 3.2 24B525048342541.7%
Gemma 4 31B474141403741.3%
Qwen 3.5 9B643737373141.3%
Gemini 3 Flash (Preview)464240393840.9%
Qwen3.7 Max464339393440.2%
Qwen 3.5 27B454240393540.1%
Qwen 3.5 122B464238363439.0%
Gemma 4 26B483937353438.6%
GPT-5 Nano414039373538.3%
Gemma 4 26B (Reasoning)403938363637.7%
Qwen 3.5 Flash433938333236.9%
Gemini 3 Flash (Preview, Reasoning)454037313036.8%
Inception Mercury503930252533.7%
Qwen 3.5 35B393633272532.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast1009693938793.9%
DeepSeek V3 (2025-03-24)10010093847790.8%
Hermes 3 70B1009886827187.6%
Rocinante 12B969487866785.9%
Hermes 3 405B988986777685.2%
Llama 3.1 Nemotron 70B1009383816885.0%
Claude 3.5 Sonnet999777776482.7%
Llama 3.1 8B10010098614881.4%
GPT-4o, Aug. 6th (temp=1)979085775781.1%
GPT-4o Mini (temp=1)968280796881.1%
Claude Sonnet 4.5888379786779.0%
Claude 3 Haiku997675706977.6%
Z.AI GLM 4.5 Air977975656476.1%
DeepSeek V4 Pro868482784875.7%
MiniMax M2.5858381695975.5%
Claude Opus 4.7897676726375.3%
Claude Sonnet 4908380695174.5%
Claude Opus 4797676676572.5%
Grok 4 Fast808072686272.5%
Claude Sonnet 4.6 (Reasoning)807973676372.4%
Writer: Palmyra X5817473676572.2%
DeepSeek V4 Flash917773675271.8%
Z.AI GLM 5 Turbo807271696671.8%
Claude Opus 4.5947266656171.8%
Claude 3.7 Sonnet787673656471.2%
MiniMax M2.7907469625970.8%
DeepSeek V4 Flash (Reasoning)847472645870.4%
Z.AI GLM 5.1797572666070.4%
DeepSeek-V2 Chat857967655169.5%
Claude Sonnet 4.6807571615969.3%
Claude Haiku 4.5767573645869.3%
Cohere Command R+ (Aug. 2024)907064635769.0%
GPT-4o, May 13th (temp=1)717070696368.5%
Grok 4777271635667.7%
Qwen3.6 Max Preview837268645167.6%
Z.AI GLM 4.5746666656567.2%
Z.AI GLM 5817165605867.1%
GPT-4.1 Mini736966636366.9%
GPT-5.4707069655966.5%
MoonshotAI: Kimi K2.5777266595666.0%
Qwen 3.5 Plus (2026-04-20)837369534965.3%
Claude Opus 4.6 (Reasoning)726665636065.3%
Claude Opus 4.6757170585365.3%
Grok 4.20 (Reasoning)707063616165.1%
Qwen3 235B A22B Instruct 2507737265615364.9%
Claude Opus 4.7 (Reasoning)766662605964.6%
DeepSeek V4 Pro (Reasoning)857860564264.3%
DeepSeek V3 (2024-12-26)757366574864.0%
Grok 4.20717061605964.0%
Grok 4.20 (Beta, Reasoning)727267614663.7%
GPT-4.1716762625062.8%
o4 Mini High777259564962.6%
Gemini 3.5 Flash (Reasoning, Minimal)756463555462.2%
GPT-5.4 (Reasoning, Low)676763595361.7%
Xiaomi MIMO v2.5 Pro686664644561.4%
MoonshotAI: Kimi K2.6796159584760.9%
GPT-4.1 Nano716360605060.7%
Gemini 3.5 Flash (Reasoning)676664555060.5%
Grok 4.20 (Beta)646463595360.4%
GPT-5.4 (Reasoning)676361565560.4%
Mistral Small 4 (Reasoning)876056474659.2%
Gemma 3 12B707066464058.3%
Arcee AI: Trinity Large (Preview)666463534357.8%
Gemini 3.1 Pro (Preview)676356534857.2%
Grok 4.3 (Reasoning)745656524957.2%
WizardLM 2 8x22b666656554257.0%
Gemma 3 4B666362514356.9%
Qwen 3.5 397B A17B745854534356.2%
Aion 2.0715555544656.1%
GPT-4o Mini (temp=0)645853525055.5%
Llama 3.1 70B675853504655.1%
GPT-5.4 Mini (Reasoning)595655545054.9%
Grok 4.3666255474354.6%
Gemma 3 27B585755525154.6%
Arcee AI: Trinity Mini676747464654.5%
GPT-5.5 (Reasoning)575753535154.2%
Z.AI GLM 4.7 Flash635555504954.1%
Gemini 3 Pro (Preview)625954534253.9%
ByteDance Seed 1.6 Flash676060433953.8%
Qwen 3 32B776546433753.5%
GPT-5.1615654534253.2%
ByteDance Seed 2.0 Mini625554534153.2%
Gemini 2.5 Flash565454534953.2%
Mistral Small Creative625353494853.1%
o4 Mini715551474153.0%
GPT-5.4 Mini655450494552.8%
GPT-5.4 Mini (Reasoning, Low)595251515052.5%
ByteDance Seed 1.6595453514552.4%
Gemini 2.5 Flash (Reasoning)675752444052.2%
Qwen 3.5 Plus (2026-02-15)605950484252.0%
Qwen 3.6 Flash706450383651.6%
ByteDance Seed 2.0 Lite686355412951.4%
Mistral Medium 3.1615351494251.3%
GPT-5.5 (Reasoning, Low)555451484751.0%
LFM2 24B635846464251.0%
GPT-5.5595450504251.0%
Nemotron 3 Super565450474750.8%
Stealth: Hunter Alpha615554453650.3%
Mistral Small 4745444413649.8%
DeepSeek V3.2575647464249.5%
Nemotron 3 Nano635850403349.1%
Mistral Large 2615543434248.6%
Qwen 3.6 27B765048373148.6%
Qwen 3.6 35B575552512848.5%
Stealth: Healer Alpha565449463848.3%
Ministral 3B625047433948.1%
Gemini 3.1 Flash Lite675343403647.8%
GPT-4o, Aug. 6th (temp=0)594745434247.5%
Gemini 2.5 Flash Lite545348413947.0%
Gemini 2.5 Pro545246424147.0%
Gemini 2.5 Flash Lite (Reasoning)615446373546.7%
Mistral Large574545443946.0%
Inception Mercury 2634341414045.7%
Ministral 3 14B604443433845.5%
DeepSeek V3.1494946433945.2%
Xiaomi MIMO v2.5494746424145.1%
Z.AI GLM 4.7524843424145.1%
Gemini 3.1 Flash Lite (Reasoning)525042414044.9%
Ministral 8B484744434244.8%
Qwen 2.5 72B474545454144.6%
Ministral 3 8B504642424144.1%
Ministral 3 3B564542383843.9%
GPT-4o, May 13th (temp=0)484544424043.9%
Z.AI GLM 4.6564948382943.8%
Gemini 3 Flash (Preview, Reasoning)534342403643.1%
GPT-OSS 120B574140383842.6%
Stealth: Aurora Alpha494641403742.5%
Gemini 3.1 Flash Lite (Preview)525138363542.2%
GPT-5.2444343414042.2%
Mistral Large 3484241413742.1%
GPT-5504341393742.1%
Qwen 3.5 9B704238302841.7%
GPT-5.4 Nano (Reasoning)454241403941.4%
GPT-5.4 Nano (Reasoning, Low)434342403941.3%
Gemini 3 Flash (Preview)504341373340.9%
GPT-5.4 Nano444240403740.5%
Qwen 3.5 Flash514040363239.9%
Qwen3.7 Max414140383839.6%
Gemma 4 26B444438343138.3%
Gemma 4 31B424039373237.9%
Gemma 4 31B (Reasoning)454539322937.9%
Qwen 3.5 35B393737373637.3%
GPT-5 Mini373636343435.4%
Gemma 4 26B (Reasoning)383834323234.8%
GPT-5 Nano373634313033.5%
Qwen 3.5 122B383735312533.2%
Mistral NeMO423933252532.8%
Qwen 3.5 27B373630282531.3%
Inception Mercury353329252529.5%
Mistral Small 3.2 24B252525252525.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B1009695887891.4%
Llama 3.1 8B1009787836085.5%
GPT-4o Mini (temp=1)969286797185.0%
ByteDance Seed 2.0 Lite969487776483.5%
GPT-4o, Aug. 6th (temp=1)1009685696482.8%
Llama 3.1 Nemotron 70B969594715582.2%
Claude 3.5 Sonnet928786776581.3%
Grok 4.1 Fast868483767180.0%
Claude Sonnet 4999784675379.8%
Hermes 3 405B1008174726979.0%
Hermes 3 70B1007976736778.9%
DeepSeek V3 (2025-03-24)988974726278.8%
ByteDance Seed 1.6908177726577.0%
Claude Opus 4.7867877766576.1%
GPT-4.1878177716275.6%
Gemma 3 4B827575716873.9%
Claude Opus 4.7 (Reasoning)807973686172.4%
Grok 4 Fast797672686371.8%
GPT-4.1 Mini838374645371.3%
GPT-4o, May 13th (temp=1)787767666670.7%
Claude 3 Haiku767271716270.3%
Z.AI GLM 5 Turbo797371646470.0%
Claude Opus 4797671665669.4%
Grok 4727171676268.6%
Z.AI GLM 5878270534868.1%
Claude Haiku 4.5857271585067.2%
Claude Sonnet 4.5867465575367.0%
Cohere Command R+ (Aug. 2024)867165575366.6%
Qwen 3 32B857568574766.4%
Gemini 3.5 Flash (Reasoning, Minimal)866862595666.3%
DeepSeek V3 (2024-12-26)756969645566.2%
Claude 3.7 Sonnet776764636066.1%
GPT-4o Mini (temp=0)747065605965.7%
Gemma 3 12B717066645765.6%
DeepSeek-V2 Chat736965645365.0%
Claude Opus 4.6 (Reasoning)827161595265.0%
GPT-4.1 Nano786862555363.2%
Z.AI GLM 4.5666563635863.0%
GPT-5.1706662595362.2%
Gemma 3 27B686362615561.8%
Llama 3.1 70B817159534561.6%
Qwen 3.5 Plus (2026-02-15)666460595861.4%
DeepSeek V4 Flash776965534261.0%
Claude Sonnet 4.6676563595160.9%
Writer: Palmyra X5736059575660.8%
Claude Opus 4.6676561585160.5%
Z.AI GLM 5.1786860494660.3%
Claude Sonnet 4.6 (Reasoning)806555554660.0%
MoonshotAI: Kimi K2.6656463545259.7%
GPT-5.4 (Reasoning)746057555359.6%
DeepSeek V4 Flash (Reasoning)786160504959.6%
Gemini 3.1 Pro (Preview)756459544359.3%
GPT-4o, Aug. 6th (temp=0)706462514959.1%
Claude Opus 4.5666362574759.0%
Gemini 3 Flash (Preview, Reasoning)775857525058.9%
GPT-5.5 (Reasoning, Low)636259575158.5%
Gemini 2.5 Flash (Reasoning)666462514958.3%
ByteDance Seed 2.0 Mini636259555258.2%
Grok 4.20 (Reasoning)606058555557.8%
WizardLM 2 8x22b635957555057.0%
Qwen3 235B A22B Instruct 2507615958545256.7%
Gemini 3.5 Flash (Reasoning)695855544656.4%
GPT-5.4595756565456.2%
Gemini 3 Flash (Preview)795751504456.1%
Arcee AI: Trinity Mini666361484256.1%
GPT-5.4 (Reasoning, Low)655853535156.0%
GPT-5 Mini646054534855.9%
Z.AI GLM 4.7 Flash706355474555.8%
Gemma 4 31B685954504755.6%
Gemini 2.5 Flash755851484655.3%
Gemini 2.5 Flash Lite726051484555.1%
GPT-5.4 Mini (Reasoning)695854504454.9%
DeepSeek V3.1646453524154.9%
Arcee AI: Trinity Large (Preview)615555525054.7%
Grok 4.3616054514654.5%
MiniMax M2.5676454463954.0%
Z.AI GLM 4.5 Air646059464153.9%
Grok 4.20 (Beta)565554544953.6%
o4 Mini615953494653.6%
Qwen 3.5 397B A17B655252514753.3%
Grok 4.20 (Beta, Reasoning)605951504653.3%
DeepSeek V4 Pro (Reasoning)725749474253.1%
Gemini 2.5 Pro625850494653.1%
MoonshotAI: Kimi K2.5655650464652.8%
GPT-4o, May 13th (temp=0)655649474652.6%
Z.AI GLM 4.7625252504752.6%
Gemini 3.1 Flash Lite585353494952.5%
Grok 4.20565453524552.3%
ByteDance Seed 1.6 Flash645953454152.3%
Qwen 2.5 72B636049454552.2%
Mistral Large 3595654494352.1%
Xiaomi MIMO v2.5575651494752.0%
MiniMax M2.7646250463952.0%
DeepSeek V4 Pro585450494851.9%
Z.AI GLM 4.6565655474551.9%
Mistral Large 2665447474551.8%
Mistral Medium 3.1605451494351.7%
Mistral Large666049424251.5%
o4 Mini High565653524051.4%
Qwen 3.6 27B645948473951.4%
Stealth: Healer Alpha635749464251.3%
Qwen 3.5 Plus (2026-04-20)685552424051.2%
GPT-5.5 (Reasoning)575551494351.2%
Ministral 3B725443424250.7%
Grok 4.3 (Reasoning)565349484750.6%
Mistral Small 4635452414050.1%
Gemma 4 31B (Reasoning)565249474549.9%
Stealth: Hunter Alpha635449453749.7%
Gemma 4 26B (Reasoning)545451464349.6%
Gemini 2.5 Flash Lite (Reasoning)645546424049.6%
Gemini 3.1 Flash Lite (Reasoning)674746454349.2%
Mistral NeMO625553373749.1%
GPT-5.4 Mini (Reasoning, Low)565345444448.7%
Mistral Small Creative654646434148.2%
Gemini 3 Pro (Preview)595844404048.2%
Gemini 3.1 Flash Lite (Preview)695839393648.2%
Qwen 3.5 35B804241413748.1%
GPT-5.4 Mini564947454347.9%
Gemma 4 26B544847454547.7%
Ministral 3 8B564846454247.5%
GPT-5.5614645444247.5%
Qwen3.6 Max Preview545249424047.4%
Nemotron 3 Super545048454147.4%
Qwen3.7 Max504948464447.4%
DeepSeek V3.2584746434046.8%
Xiaomi MIMO v2.5 Pro564948404046.5%
LFM2 24B544947424046.5%
GPT-5535046444046.5%
Qwen 3.5 122B535243434146.3%
Aion 2.0515047424146.1%
Qwen 3.6 Flash575542383645.7%
GPT-5.2474544434244.3%
Qwen 3.5 Flash494744423743.8%
GPT-5.4 Nano (Reasoning)454443434243.2%
Ministral 3 14B474343423842.6%
Mistral Small 4 (Reasoning)484241403942.3%
GPT-5.4 Nano444242414141.9%
GPT-5.4 Nano (Reasoning, Low)434242414141.8%
Qwen 3.6 35B484543353441.0%
Qwen 3.5 27B434341413740.9%
Nemotron 3 Nano484638383440.9%
Ministral 8B504239373740.9%
GPT-OSS 120B444240393740.2%
Ministral 3 3B424139383639.4%
Inception Mercury 2424039373338.2%
Qwen 3.5 9B454339312737.0%
Stealth: Aurora Alpha393836323035.0%
Mistral Small 3.2 24B504329252534.3%
GPT-5 Nano353533313132.9%
Inception Mercury383126252529.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B999695864383.6%
Grok 41009970646279.0%
Rocinante 12B969076735578.1%
Grok 4.1 Fast808077757477.1%
GPT-4o, Aug. 6th (temp=1)857966636170.7%
GPT-4o Mini (temp=1)806969676570.0%
Grok 4 Fast1006962615769.9%
Llama 3.1 Nemotron 70B837772645169.6%
DeepSeek V3 (2025-03-24)777264645165.4%
Hermes 3 405B836959584763.2%
Claude Opus 4696756555560.3%
GPT-4o, May 13th (temp=1)745959575260.1%
Hermes 3 70B737262524159.9%
Claude Sonnet 4636258575258.4%
Claude Sonnet 4.5636259564957.9%
Claude Opus 4.7695959544957.8%
Cohere Command R+ (Aug. 2024)696853504757.3%
Claude Opus 4.7 (Reasoning)626055555156.7%
GPT-5.4625756535256.0%
Arcee AI: Trinity Large (Preview)735956474555.9%
Claude 3.5 Sonnet695653534655.4%
GPT-4.1 Mini595957564655.3%
Claude 3 Haiku795250474755.0%
DeepSeek-V2 Chat766255443854.8%
Claude Haiku 4.5685749494653.6%
DeepSeek V4 Pro (Reasoning)855248463653.6%
Claude 3.7 Sonnet555454534953.1%
Z.AI GLM 4.5635451504652.9%
Writer: Palmyra X5615653494452.6%
Claude Opus 4.5655951454252.5%
GPT-4.1625252494251.7%
Gemma 3 27B545351484249.6%
Z.AI GLM 5 Turbo575452413548.0%
GPT-5.4 (Reasoning)555246444347.7%
GPT-4.1 Nano634843434047.4%
Claude Sonnet 4.6535048434147.2%
Mistral NeMO575650393447.1%
LFM2 24B535046444246.9%
Gemma 3 12B584845444046.9%
Gemini 2.5 Flash594845393946.1%
Qwen3 235B A22B Instruct 2507584645433745.8%
DeepSeek V3 (2024-12-26)524847433745.3%
o4 Mini High585142403545.2%
Llama 3.1 70B484846443944.9%
MiniMax M2.5625043422544.4%
Mistral Large 2625336363444.4%
GPT-5.4 (Reasoning, Low)504741414044.0%
GPT-5.1494844393943.9%
GPT-4o Mini (temp=0)464444434243.8%
Claude Sonnet 4.6 (Reasoning)584945343343.7%
Qwen 2.5 72B584241393843.4%
Z.AI GLM 5.1545340363443.3%
Gemma 3 4B494641403943.1%
Grok 4.20454443434043.1%
Gemini 2.5 Flash Lite494542403943.0%
GPT-4o, Aug. 6th (temp=0)494642403743.0%
Grok 4.20 (Beta, Reasoning)464442424042.9%
Z.AI GLM 5484442423942.8%
Grok 4.3 (Reasoning)454443414042.5%
Qwen 3 32B494242413642.1%
MoonshotAI: Kimi K2.6524441363541.8%
DeepSeek V4 Flash (Reasoning)554638353541.7%
Grok 4.3464341393941.6%
Grok 4.20 (Beta)484439393941.5%
Claude Opus 4.6 (Reasoning)474641403341.4%
Grok 4.20 (Reasoning)464440393841.3%
o4 Mini444342423541.0%
Gemini 3.5 Flash (Reasoning, Minimal)474540373640.9%
Gemini 2.5 Flash (Reasoning)593936353440.9%
GPT-5.4 Mini (Reasoning, Low)464140393940.8%
GPT-5.4 Mini424141413840.7%
Mistral Medium 3.1484541343440.6%
DeepSeek V4 Pro484640363340.5%
DeepSeek V4 Flash444439373740.2%
WizardLM 2 8x22b464140393640.2%
Nemotron 3 Super464439363440.0%
GPT-5.5434040393840.0%
Mistral Small 4 (Reasoning)454342383039.7%
Qwen 3.6 35B494039373339.6%
Qwen3.6 Max Preview454040403239.3%
Aion 2.0434138383639.3%
Mistral Small Creative453939383539.2%
Mistral Large 3464140383139.1%
Gemini 2.5 Pro464140353439.1%
GPT-5.5 (Reasoning)413939383839.0%
Qwen 3.5 Plus (2026-04-20)444337363438.9%
Gemini 3.1 Pro (Preview)454437343438.8%
GPT-5.4 Mini (Reasoning)443937373638.7%
ByteDance Seed 1.6623534313138.6%
GPT-5.5 (Reasoning, Low)403938383838.4%
Mistral Small 4424138343438.0%
MoonshotAI: Kimi K2.5424138353538.0%
ByteDance Seed 1.6 Flash443838383237.9%
ByteDance Seed 2.0 Lite484337342837.8%
Gemini 2.5 Flash Lite (Reasoning)444140343037.7%
Qwen 3.5 Plus (2026-02-15)474335323137.7%
Xiaomi MIMO v2.5 Pro413938363437.6%
GPT-5.4 Nano403939353537.6%
DeepSeek V3.2444035353337.5%
Gemini 3.5 Flash (Reasoning)433938363237.3%
Gemini 3 Flash (Preview)444036353237.3%
GPT-4o, May 13th (temp=0)444237333037.2%
Qwen3.7 Max464035333237.0%
Xiaomi MIMO v2.5434136343037.0%
Gemini 3 Pro (Preview)414037353236.9%
GPT-OSS 120B393836363436.6%
Ministral 3 14B393836363436.5%
GPT-5.2403936343336.5%
Ministral 3 8B393736363436.4%
Mistral Large453835343036.4%
Qwen 3.6 Flash453635333136.2%
Claude Opus 4.6403939333136.2%
Z.AI GLM 4.6393837372936.1%
Z.AI GLM 4.7433737342635.4%
DeepSeek V3.1393836333135.4%
Arcee AI: Trinity Mini393735343235.4%
GPT-5.4 Nano (Reasoning)373636353235.2%
Qwen 3.5 122B433535333135.2%
GPT-5.4 Nano (Reasoning, Low)373636363235.2%
ByteDance Seed 2.0 Mini454233292535.0%
Inception Mercury 2393837332734.7%
GPT-5383734313034.1%
Ministral 8B363635332934.1%
Qwen 3.5 Flash403934312633.9%
Ministral 3B363633323133.8%
GPT-5 Mini363535313133.7%
Gemini 3 Flash (Preview, Reasoning)383632323133.7%
Qwen 3.6 27B403634292933.6%
Qwen 3.5 27B413332323133.5%
Z.AI GLM 4.5 Air363535342733.5%
Gemma 4 31B373534313133.4%
Gemini 3.1 Flash Lite (Reasoning)373533313033.2%
Z.AI GLM 4.7 Flash353434333033.2%
Ministral 3 3B383534312733.0%
MiniMax M2.7353533303032.7%
Stealth: Hunter Alpha393633302632.7%
Gemma 4 31B (Reasoning)383433332532.5%
Gemma 4 26B (Reasoning)383531292832.2%
Qwen 3.5 397B A17B343332313031.9%
Stealth: Healer Alpha343232313031.8%
Nemotron 3 Nano353434292631.5%
Gemini 3.1 Flash Lite373232282731.2%
Stealth: Aurora Alpha363628272630.6%
Gemma 4 26B373330272530.5%
Qwen 3.5 9B333131302630.4%
Gemini 3.1 Flash Lite (Preview)343331272630.1%
Qwen 3.5 35B343330272529.7%
Inception Mercury403028252529.6%
GPT-5 Nano322726252527.0%
Mistral Small 3.2 24B252525252525.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o, Aug. 6th (temp=1)999585797386.2%
Grok 4.1 Fast1008781817685.1%
DeepSeek V3 (2025-03-24)1009493805584.5%
GPT-4o Mini (temp=1)898786847684.3%
Claude 3.5 Sonnet918787827183.6%
Llama 3.1 Nemotron 70B999081786782.7%
Llama 3.1 8B1009897683780.1%
GPT-4.1 Mini937978757279.6%
Cohere Command R+ (Aug. 2024)797977736875.3%
Hermes 3 405B867872707075.3%
Grok 4858074696774.9%
GPT-4.1 Nano867170696672.6%
Grok 4 Fast857573696072.5%
Claude 3.7 Sonnet797670676772.0%
Qwen 3 32B878073624970.2%
GPT-4o, May 13th (temp=1)797372675769.6%
WizardLM 2 8x22b927869565369.5%
Claude Opus 4827767605969.2%
Llama 3.1 70B947068605268.7%
Z.AI GLM 4.5817069675768.6%
Claude Sonnet 4.5767268645967.8%
DeepSeek V3 (2024-12-26)757469625366.7%
Claude Sonnet 4756766626065.9%
Arcee AI: Trinity Large (Preview)966258575665.6%
Grok 4.3 (Reasoning)776865635365.4%
Gemma 3 12B696666656065.3%
GPT-4.1716968675064.9%
Rocinante 12B988458433864.2%
DeepSeek-V2 Chat867559514863.7%
Claude 3 Haiku686662616063.5%
Z.AI GLM 5916859544563.3%
LFM2 24B716862605563.1%
Hermes 3 70B807665514262.8%
Gemma 3 27B716662555161.1%
ByteDance Seed 2.0 Mini886058524661.0%
Z.AI GLM 4.5 Air827156494760.9%
GPT-4o, Aug. 6th (temp=0)676461595260.7%
GPT-4o Mini (temp=0)737058554760.6%
Qwen 2.5 72B656460595560.3%
Claude Opus 4.7 (Reasoning)776360604060.0%
Claude Opus 4.7696059565359.3%
GPT-4o, May 13th (temp=0)726157565059.1%
Claude Haiku 4.5626060585458.8%
Gemma 3 4B636356565358.1%
Z.AI GLM 5.1745856564558.0%
Arcee AI: Trinity Mini676660503856.1%
Writer: Palmyra X5605855545055.6%
Qwen 3.5 Plus (2026-02-15)615954525155.6%
Claude Opus 4.5645757524755.4%
Gemini 2.5 Flash Lite (Reasoning)615757574154.5%
Claude Sonnet 4.6655854484654.5%
Mistral Large 2605959494554.4%
Mistral Small 4 (Reasoning)735852454454.2%
Grok 4.20 (Beta, Reasoning)655652494753.9%
MoonshotAI: Kimi K2.5656152484353.7%
Z.AI GLM 5 Turbo645655484653.6%
Grok 4.20 (Reasoning)655552514453.5%
Grok 4.20 (Beta)615353524953.5%
Gemini 2.5 Flash Lite585454524853.0%
Gemini 2.5 Flash (Reasoning)615454474752.7%
Qwen3 235B A22B Instruct 2507585555504552.6%
MiniMax M2.5595554514352.3%
Grok 4.20575754464552.1%
Mistral Medium 3.1595151504852.0%
GPT-5.1585652494551.9%
ByteDance Seed 2.0 Lite575452504551.7%
MiniMax M2.7545452504651.3%
Mistral Small Creative655048464550.8%
Grok 4.3625546464550.8%
Nemotron 3 Super625347474550.6%
Mistral Small 3.2 24B695747413750.3%
Mistral Large 3644947454449.9%
Gemini 2.5 Flash595148474349.6%
Claude Sonnet 4.6 (Reasoning)665747443449.4%
Ministral 3 14B565252464049.1%
Mistral Large605545444149.1%
MoonshotAI: Kimi K2.6605344444348.8%
Gemini 3.5 Flash (Reasoning)575248474048.7%
Mistral Small 4564948454448.4%
Nemotron 3 Nano604848434248.2%
Gemini 3.5 Flash (Reasoning, Minimal)575146434348.0%
Xiaomi MIMO v2.5535247474148.0%
Xiaomi MIMO v2.5 Pro594646434247.2%
o4 Mini515047454347.1%
Claude Opus 4.6525145444346.9%
o4 Mini High574645454146.9%
ByteDance Seed 1.6 Flash554745444246.8%
DeepSeek V4 Pro545047443946.6%
DeepSeek V4 Flash514945444346.4%
Qwen 3.6 27B625141393846.3%
DeepSeek V4 Flash (Reasoning)574847453546.3%
GPT-5.4474646454545.9%
GPT-5.4 (Reasoning)474645454445.7%
Gemini 3 Pro (Preview)534545434045.2%
Gemini 3 Flash (Preview, Reasoning)484847443945.0%
DeepSeek V4 Pro (Reasoning)534545443744.9%
GPT-5.4 Mini484744444244.9%
GPT-OSS 120B504643434244.8%
GPT-5.4 (Reasoning, Low)464545454344.8%
Gemma 4 26B474744434344.7%
Stealth: Hunter Alpha504846403844.5%
Ministral 3 3B484846414044.4%
Claude Opus 4.6 (Reasoning)494943433844.3%
Gemini 2.5 Pro544942393644.2%
GPT-5.5454544434344.0%
GPT-5.4 Mini (Reasoning, Low)454545434243.8%
Aion 2.0535341373543.7%
Stealth: Healer Alpha554543413343.5%
ByteDance Seed 1.6544441403643.0%
GPT-5.4 Mini (Reasoning)454443433942.8%
GPT-5.5 (Reasoning)444343424242.8%
GPT-5.5 (Reasoning, Low)444343424242.8%
Mistral NeMO645335332942.8%
Ministral 3B484442413842.6%
Inception Mercury 2454442414042.4%
GPT-5454441414042.3%
DeepSeek V3.2494743383442.3%
Stealth: Aurora Alpha464545383642.0%
GPT-5.2444242424042.0%
Ministral 8B454343413641.7%
Gemini 3.1 Pro (Preview)564140393341.6%
GPT-5.4 Nano (Reasoning)434342393941.5%
Ministral 3 8B434241413841.1%
Qwen3.7 Max504543353341.0%
GPT-5.4 Nano (Reasoning, Low)424241403940.9%
GPT-5.4 Nano444339393940.8%
Qwen 3.6 Flash514340383240.8%
Gemma 4 31B474240363640.2%
GPT-5 Mini434339383639.8%
Qwen 3.6 35B434139383739.7%
Gemini 3 Flash (Preview)444440363439.7%
Gemma 4 26B (Reasoning)533837363439.6%
DeepSeek V3.1424140373739.6%
Qwen3.6 Max Preview503937373239.1%
GPT-5 Nano413938383738.9%
Gemma 4 31B (Reasoning)414039383638.7%
Qwen 3.5 122B424038383338.4%
Z.AI GLM 4.6413939383338.0%
Z.AI GLM 4.7454238323137.9%
Qwen 3.5 35B434136363337.8%
Qwen 3.5 Plus (2026-04-20)463736353437.6%
Qwen 3.5 Flash393938353437.2%
Inception Mercury494535272636.6%
Gemini 3.1 Flash Lite (Preview)403837362936.2%
Z.AI GLM 4.7 Flash403932313034.4%
Qwen 3.5 27B413837302534.3%
Gemini 3.1 Flash Lite (Reasoning)393635302933.7%
Qwen 3.5 397B A17B373735302532.9%
Gemini 3.1 Flash Lite363433332732.7%
Qwen 3.5 9B393432282832.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B999898958695.2%
Claude 3.5 Sonnet999490908491.3%
GPT-4o Mini (temp=1)1009889848491.1%
Hermes 3 405B989691867689.5%
Grok 4.1 Fast978887858087.2%
Llama 3.1 8B999795706986.2%
Llama 3.1 Nemotron 70B978781797784.0%
Cohere Command R+ (Aug. 2024)988886845982.9%
GPT-4o, Aug. 6th (temp=1)908680787682.0%
Grok 4 Fast898379767580.3%
Claude Sonnet 4997979786680.0%
Claude Opus 4.7 (Reasoning)898379776779.1%
Claude Opus 4.7898483716578.5%
GPT-4o, May 13th (temp=1)898076737277.9%
Claude Sonnet 4.5878377747077.9%
Claude Opus 4937875736175.9%
Grok 4857974746475.3%
Qwen 3 32B988964605974.1%
DeepSeek V3 (2025-03-24)888279754273.1%
Z.AI GLM 4.5 Air827570706773.0%
MiniMax M2.7787473716972.9%
Claude 3.7 Sonnet818069676672.8%
GPT-4.1 Nano807674686472.5%
GPT-4.1 Mini877771685872.1%
ByteDance Seed 2.0 Lite968372614972.1%
Hermes 3 70B898278684171.5%
Z.AI GLM 4.5787470706471.1%
Claude 3 Haiku817870685370.2%
Claude Haiku 4.5858072684670.0%
GPT-4.1787472695669.9%
Gemma 3 27B946964595868.8%
Claude Sonnet 4.6767068646368.3%
GPT-5.1727068646467.5%
Arcee AI: Trinity Large (Preview)1007572474167.2%
ByteDance Seed 2.0 Mini847265585867.1%
GPT-4o Mini (temp=0)776867665767.0%
Z.AI GLM 5 Turbo796867645766.8%
Qwen 2.5 72B826865635666.8%
Arcee AI: Trinity Mini917769484666.3%
GPT-5.4 (Reasoning)756767625865.7%
WizardLM 2 8x22b836262615865.1%
Gemini 2.5 Flash Lite876763604965.1%
Grok 4.3797063575464.7%
DeepSeek-V2 Chat787165575264.7%
Claude Sonnet 4.6 (Reasoning)727063605964.7%
ByteDance Seed 1.6766866654864.7%
Claude Opus 4.5766967575464.5%
Qwen 3.5 Plus (2026-02-15)856860604964.3%
Gemini 3.5 Flash (Reasoning, Minimal)797459565364.1%
Z.AI GLM 5706767595763.9%
GPT-5.4676665635763.7%
Llama 3.1 70B766361595863.4%
Grok 4.20 (Reasoning)726662595663.0%
Gemini 2.5 Flash (Reasoning)736562605162.4%
Gemini 3.5 Flash (Reasoning)886560544562.3%
o4 Mini High736461575461.9%
Z.AI GLM 5.1746362585461.9%
MiniMax M2.5726561604961.6%
Gemma 3 12B776463594661.5%
LFM2 24B786961514961.4%
Gemini 2.5 Flash Lite (Reasoning)786965514461.1%
GPT-4o, Aug. 6th (temp=0)815756555460.6%
Writer: Palmyra X5786256524959.4%
Gemma 3 4B706860514759.2%
DeepSeek V4 Flash767555504059.0%
Gemini 3 Flash (Preview)686557534958.5%
GPT-5.4 (Reasoning, Low)636357565358.5%
o4 Mini686356554858.2%
Qwen3 235B A22B Instruct 2507685858555057.8%
GPT-5656256545057.3%
Gemini 2.5 Flash746654494257.2%
Claude Opus 4.6 (Reasoning)666553514957.0%
Gemini 2.5 Pro646360534456.7%
Grok 4.20 (Beta, Reasoning)626157525156.7%
Stealth: Healer Alpha646156554656.6%
DeepSeek V4 Pro675757534856.4%
Grok 4.20 (Beta)675954514655.4%
DeepSeek V3 (2024-12-26)755349494855.1%
Mistral Large 3645554515155.0%
DeepSeek V4 Pro (Reasoning)645453535255.0%
MoonshotAI: Kimi K2.6646453494454.9%
Gemma 4 31B635856534354.6%
Grok 4.3 (Reasoning)605956564354.6%
Mistral Large575555535354.5%
GPT-5.5616151504954.4%
Qwen3.7 Max666156484254.4%
DeepSeek V3.2746050454254.1%
MoonshotAI: Kimi K2.5585755524854.0%
GPT-4o, May 13th (temp=0)695352514453.8%
Xiaomi MIMO v2.5655453484753.4%
Stealth: Hunter Alpha626052494453.3%
Mistral NeMO715851473953.2%
Claude Opus 4.6635655494253.2%
Grok 4.20625655494353.0%
Gemini 3.1 Pro (Preview)625350494852.4%
Mistral Large 2575755464652.1%
GPT-5.5 (Reasoning, Low)595550494752.1%
DeepSeek V3.1616150474152.0%
Gemini 3 Flash (Preview, Reasoning)695554413951.6%
Aion 2.0585752494251.5%
Nemotron 3 Super585550494150.6%
Qwen3.6 Max Preview555453474250.4%
Mistral Small 4575450464350.1%
GPT-5.5 (Reasoning)545150504449.8%
Mistral Small 4 (Reasoning)625047464449.7%
Gemini 3 Pro (Preview)625050444049.2%
Qwen 3.6 27B555150454348.7%
Mistral Medium 3.1575049464148.6%
Qwen 3.5 397B A17B635244434148.5%
Gemma 4 26B (Reasoning)575249463848.4%
Mistral Small Creative594846454448.2%
Xiaomi MIMO v2.5 Pro635346443347.9%
Z.AI GLM 4.6575451403647.7%
Z.AI GLM 4.7525049474147.7%
Gemma 4 31B (Reasoning)565150443447.1%
Z.AI GLM 4.7 Flash575045414046.8%
GPT-5.4 Mini534947453946.6%
Nemotron 3 Nano625042403846.4%
Qwen 3.6 35B545146423846.4%
GPT-5.4 Mini (Reasoning)505046443845.5%
Ministral 3 8B534545434145.5%
Ministral 8B504746434045.5%
ByteDance Seed 1.6 Flash484646454345.3%
GPT-5.4 Mini (Reasoning, Low)494845434245.2%
Qwen 3.5 27B534947383744.7%
DeepSeek V4 Flash (Reasoning)574541413944.3%
Ministral 3B484842424144.2%
GPT-5.2494543424143.8%
GPT-5 Mini494443414143.8%
Qwen 3.5 Flash524441414043.5%
Qwen 3.5 Plus (2026-04-20)544441383842.9%
Gemma 4 26B554440383742.8%
Qwen 3.5 122B534939383442.8%
Gemini 3.1 Flash Lite514940373642.5%
Qwen 3.6 Flash534342403442.3%
Ministral 3 14B484342403842.0%
GPT-5.4 Nano (Reasoning)444441414041.9%
GPT-5.4 Nano (Reasoning, Low)444343393741.3%
Gemini 3.1 Flash Lite (Reasoning)454443403341.1%
Gemini 3.1 Flash Lite (Preview)484641373441.0%
GPT-5.4 Nano444240403941.0%
Ministral 3 3B484343402539.6%
Mistral Small 3.2 24B444242353439.3%
Qwen 3.5 9B564237313039.2%
GPT-OSS 120B464140363339.0%
Qwen 3.5 35B413939393638.8%
Inception Mercury 2393937373236.9%
GPT-5 Nano373736322834.1%
Stealth: Aurora Alpha373630292531.5%
Inception Mercury433625252530.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B1009794939094.7%
Rocinante 12B1009494946689.5%
GPT-4o, Aug. 6th (temp=1)959594856887.3%
GPT-4.1 Mini878475747278.3%
Grok 4.1 Fast868577747078.2%
Claude 3.5 Sonnet838278776777.4%
GPT-4o Mini (temp=1)828179786777.4%
Hermes 3 405B969380635377.1%
Llama 3.1 Nemotron 70B888381755776.6%
DeepSeek V3 (2025-03-24)908079686175.3%
GPT-4o, May 13th (temp=1)887975696575.1%
Hermes 3 70B988077674673.6%
Cohere Command R+ (Aug. 2024)867473696473.0%
Grok 4817571696171.5%
Claude Sonnet 4.5807766666570.8%
Claude 3 Haiku917964645370.2%
Claude Opus 4767568676469.9%
GPT-4.1 Nano847671605969.9%
Grok 4 Fast757470615968.0%
Claude Sonnet 4747270625967.3%
Qwen 3 32B907763564766.9%
Claude 3.7 Sonnet717065636266.2%
GPT-4o, Aug. 6th (temp=0)686867646366.0%
DeepSeek V3 (2024-12-26)766767665365.9%
Claude Haiku 4.5717065625764.9%
WizardLM 2 8x22b856767555164.8%
Z.AI GLM 4.5 Air747064555563.7%
DeepSeek-V2 Chat746363615563.3%
Llama 3.1 70B776457555461.5%
Grok 4.3 (Reasoning)866354535161.4%
Claude Opus 4.5666261585861.1%
Z.AI GLM 4.5706960555160.9%
LFM2 24B825957564860.2%
Gemma 3 12B656562594859.7%
GPT-4.1656161555559.6%
GPT-4o, May 13th (temp=0)676258555559.4%
ByteDance Seed 2.0 Lite776360603759.3%
Gemma 3 27B727059484759.1%
Z.AI GLM 5.1856053494658.6%
MiniMax M2.7686260554758.4%
Arcee AI: Trinity Mini806361523558.3%
Mistral Large736757454557.4%
Claude Opus 4.7615858565157.0%
Qwen3 235B A22B Instruct 2507686054525056.8%
Qwen 3.5 Plus (2026-02-15)626054515055.3%
Gemini 3.5 Flash (Reasoning, Minimal)605755525054.9%
Grok 4.20 (Reasoning)685952494554.7%
Mistral Large 2615855514754.6%
MoonshotAI: Kimi K2.5675252514954.3%
Gemini 2.5 Flash (Reasoning)646251464553.6%
Gemini 2.5 Flash Lite615453524753.6%
Gemini 2.5 Flash725351464453.4%
Writer: Palmyra X5625753464653.0%
Gemma 3 4B645956444252.9%
Grok 4.20 (Beta, Reasoning)565553534552.3%
Qwen 2.5 72B705450454252.3%
DeepSeek V3.1705250474151.8%
DeepSeek V4 Flash (Reasoning)585857463851.5%
Ministral 3 14B565352514651.4%
ByteDance Seed 2.0 Mini656350453351.3%
o4 Mini595848474351.1%
Mistral Large 3615349484451.0%
Grok 4.3625149474651.0%
Arcee AI: Trinity Large (Preview)764945454050.9%
Z.AI GLM 5565452474550.9%
Claude Sonnet 4.6 (Reasoning)595149494650.8%
GPT-4o Mini (temp=0)545250494850.6%
Gemini 2.5 Pro565349484650.4%
Aion 2.0595150494149.9%
Claude Opus 4.7 (Reasoning)605350473949.9%
GPT-5.1595049464449.6%
ByteDance Seed 1.6645649423749.6%
Mistral Small Creative555349474449.5%
Grok 4.20 (Beta)535251474249.1%
MoonshotAI: Kimi K2.6545349484249.0%
Mistral Medium 3.1585245444448.4%
o4 Mini High585346444148.4%
Gemini 3.5 Flash (Reasoning)525149454448.2%
Claude Opus 4.6535049493948.1%
Gemma 4 26B614949414048.1%
Grok 4.20545147444448.0%
Claude Sonnet 4.6565251433847.9%
DeepSeek V4 Flash535252483547.9%
Mistral Small 4525047464447.9%
Z.AI GLM 4.6545148463947.8%
Z.AI GLM 5 Turbo625251423147.7%
GPT-5.4554646464447.7%
Mistral NeMO794643393247.6%
DeepSeek V4 Pro (Reasoning)604948404047.4%
Ministral 3 8B644645424147.4%
Gemini 3 Flash (Preview)524949444247.3%
Xiaomi MIMO v2.5 Pro595045443746.9%
DeepSeek V3.2494947464246.5%
Gemini 2.5 Flash Lite (Reasoning)604645443646.2%
DeepSeek V4 Pro654842393646.1%
GPT-5.4 (Reasoning, Low)494746454346.0%
ByteDance Seed 1.6 Flash505047433945.9%
GPT-OSS 120B504644444345.6%
Claude Opus 4.6 (Reasoning)505045424145.5%
Mistral Small 4 (Reasoning)604443433845.5%
Nemotron 3 Super544444424245.3%
Xiaomi MIMO v2.5514945423845.2%
GPT-5.4 (Reasoning)464645454344.9%
GPT-5.4 Mini (Reasoning, Low)484743434344.8%
Ministral 3B474644424043.8%
Ministral 3 3B474645433843.7%
MiniMax M2.5564744413143.6%
GPT-5.5454443434243.4%
Mistral Small 3.2 24B484742413843.4%
GPT-5.4 Mini464443424143.3%
Gemini 3 Pro (Preview)524343413743.3%
GPT-5.5 (Reasoning, Low)444443434243.2%
GPT-5.2454443424143.2%
GPT-5.5 (Reasoning)444343424242.9%
Gemini 3 Flash (Preview, Reasoning)494443413842.9%
Stealth: Hunter Alpha474743393942.8%
GPT-5 Mini454443414042.7%
GPT-5.4 Nano (Reasoning, Low)464342414142.7%
GPT-5.4 Mini (Reasoning)454444413942.7%
Gemma 4 31B474443403942.5%
Gemma 4 26B (Reasoning)494442413642.4%
Qwen3.7 Max504640383842.2%
Qwen 3.5 Plus (2026-04-20)504342403542.1%
GPT-5474341403941.9%
Qwen 3.6 27B534140393641.9%
Gemma 4 31B (Reasoning)464441403841.6%
Qwen3.6 Max Preview484440393841.5%
GPT-5.4 Nano (Reasoning)464241393841.3%
Stealth: Aurora Alpha454440383740.8%
Gemini 3.1 Flash Lite (Reasoning)494342383240.7%
Ministral 8B494439363540.7%
Stealth: Healer Alpha444442403440.6%
GPT-5.4 Nano424141393940.2%
Gemini 3.1 Flash Lite (Preview)504441323139.4%
Inception Mercury 2464437363439.4%
GPT-5 Nano414039383839.1%
Gemini 3.1 Pro (Preview)523837372938.6%
Qwen 3.5 27B424038373538.4%
Z.AI GLM 4.7 Flash443938373338.2%
Gemini 3.1 Flash Lite484035343338.1%
Qwen 3.5 122B403938373638.0%
Qwen 3.5 Flash424236353437.8%
Qwen 3.5 397B A17B434039383037.7%
Z.AI GLM 4.7403837363537.1%
Qwen 3.6 Flash444038362636.9%
Nemotron 3 Nano434234343036.6%
Qwen 3.6 35B393735343235.1%
Qwen 3.5 9B393836312834.4%
Qwen 3.5 35B373434313033.2%
Inception Mercury383426252529.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100999999.7%
Grok 4.1 Fast1008783828086.5%
WizardLM 2 8x22b10010077736382.6%
Llama 3.1 Nemotron 70B918381767581.3%
GPT-4o, Aug. 6th (temp=1)948080787180.9%
DeepSeek V3 (2025-03-24)948078766378.3%
Claude 3.5 Sonnet888175716776.2%
Grok 4.3 (Reasoning)1008775655175.5%
Grok 4 Fast828173696974.6%
Claude Sonnet 4837575736173.4%
GPT-4o Mini (temp=1)828268676472.6%
Grok 41007970575472.1%
Claude 3 Haiku987667605571.0%
GPT-4.1 Mini767569696570.9%
Claude Opus 4.7747474715770.0%
LFM2 24B877571654969.3%
DeepSeek-V2 Chat857473634768.4%
GPT-4o, May 13th (temp=1)856661615966.7%
Claude Sonnet 4.5816864635566.2%
Claude Opus 4757163625966.1%
Claude 3.7 Sonnet776766595965.6%
Z.AI GLM 4.5797763574964.8%
Llama 3.1 70B767167624764.7%
Claude Opus 4.7 (Reasoning)797064605064.6%
GPT-5.4 (Reasoning, Low)686764645864.2%
DeepSeek V3 (2024-12-26)777166604663.9%
GPT-5.4756661605463.3%
ByteDance Seed 2.0 Lite896360535263.3%
MoonshotAI: Kimi K2.5806360595362.9%
Claude Haiku 4.5727160575362.6%
Hermes 3 405B787670474162.3%
Rocinante 12B836661544361.6%
MoonshotAI: Kimi K2.6727267544361.5%
Cohere Command R+ (Aug. 2024)716865604461.5%
GPT-4.1726762594761.2%
Gemma 3 4B686663604861.1%
Grok 4.20 (Reasoning)806156555361.0%
GPT-5.4 (Reasoning)646360595861.0%
Writer: Palmyra X5686657565159.7%
Claude Opus 4.5716259544858.8%
Grok 4.20646058565358.3%
Qwen3 235B A22B Instruct 2507676355535358.2%
Grok 4.20 (Beta, Reasoning)666561584258.2%
o4 Mini715958554657.8%
Gemma 3 12B656457545057.8%
GPT-4.1 Nano686456504957.4%
Hermes 3 70B1007542363357.4%
Arcee AI: Trinity Large (Preview)746650494556.8%
Z.AI GLM 4.5 Air696457484456.5%
GPT-5.1625956544955.9%
Grok 4.3646257504755.9%
Z.AI GLM 5 Turbo706050504655.4%
Gemini 2.5 Flash785949474555.4%
Gemini 3 Flash (Preview, Reasoning)736354503755.3%
GPT-5.4 Mini (Reasoning)645453515054.5%
Arcee AI: Trinity Mini716358413954.3%
Z.AI GLM 5.1605553525254.3%
Grok 4.20 (Beta)595754534553.4%
GPT-4o, May 13th (temp=0)705251484553.4%
Z.AI GLM 5676156443853.1%
GPT-5.5 (Reasoning)605250494952.2%
MiniMax M2.7605451494652.2%
Qwen3.6 Max Preview655349484652.1%
Qwen 3 32B625550464652.0%
MiniMax M2.5645352494051.7%
Qwen 3.6 27B635648464551.5%
Gemini 2.5 Flash Lite (Reasoning)655151464451.5%
DeepSeek V4 Pro595751484251.3%
DeepSeek V4 Pro (Reasoning)656346423650.5%
o4 Mini High585048484650.3%
Qwen 3.5 Plus (2026-02-15)615951473450.2%
Mistral Medium 3.1695245444050.0%
GPT-4o, Aug. 6th (temp=0)595049474449.9%
Gemini 3.5 Flash (Reasoning, Minimal)555550464149.5%
DeepSeek V4 Flash535149474749.4%
DeepSeek V3.1736139383649.3%
ByteDance Seed 1.6 Flash635047453948.7%
GPT-5.4 Mini (Reasoning, Low)535149474448.7%
GPT-4o Mini (temp=0)555049474148.6%
GPT-5.4 Mini555148464348.6%
Mistral Small 4595148453948.5%
Gemini 3.5 Flash (Reasoning)564847454448.1%
GPT-5.5 (Reasoning, Low)514948474247.5%
Gemini 2.5 Flash (Reasoning)585643433847.4%
GPT-5.5575045434247.3%
Mistral Small Creative575243423946.7%
Mistral Large545046434046.6%
Gemini 3.1 Pro (Preview)604944403946.5%
Qwen 3.6 Flash554949443446.2%
Gemma 3 27B545244433745.9%
DeepSeek V3.2544746453845.9%
ByteDance Seed 1.6615144393345.7%
Claude Opus 4.6555348422845.2%
Nemotron 3 Super494845453845.0%
Mistral Small 4 (Reasoning)535041404044.8%
Ministral 3B554643403944.6%
Claude Opus 4.6 (Reasoning)504944433644.6%
Claude Sonnet 4.6555342393344.5%
Mistral Large 2534545403944.5%
DeepSeek V4 Flash (Reasoning)484747443744.4%
Mistral Large 3594640383643.8%
Qwen 2.5 72B534242404043.6%
Gemini 3 Flash (Preview)524843383643.5%
ByteDance Seed 2.0 Mini484745423543.3%
GPT-5.2464444433842.9%
Xiaomi MIMO v2.5 Pro474644403642.7%
Aion 2.0524441403742.7%
Ministral 3 8B474443413742.5%
Nemotron 3 Nano504743373442.4%
Claude Sonnet 4.6 (Reasoning)554343373442.2%
Ministral 3 14B474241413942.1%
Gemma 4 31B564541382841.6%
Gemini 2.5 Flash Lite444343423641.6%
Qwen 3.5 Plus (2026-04-20)574739342941.5%
GPT-5.4 Nano (Reasoning)424242414041.5%
Ministral 3 3B464441383641.3%
Qwen 3.5 397B A17B504743333341.1%
Stealth: Hunter Alpha444342393740.8%
Qwen3.7 Max474140403640.8%
Gemini 2.5 Pro444342413340.6%
GPT-5.4 Nano (Reasoning, Low)454141393640.5%
Z.AI GLM 4.6534241333140.0%
Gemini 3 Pro (Preview)484539353440.0%
Gemini 3.1 Flash Lite (Reasoning)534336333239.3%
GPT-5.4 Nano434040363438.6%
Z.AI GLM 4.7 Flash513939352938.6%
GPT-OSS 120B444338383038.5%
GPT-5403939373638.3%
Qwen 3.6 35B513938322937.9%
Xiaomi MIMO v2.5453936333237.2%
Ministral 8B484035332836.7%
GPT-5 Mini414035343236.5%
Z.AI GLM 4.7403737363336.4%
Inception Mercury 2404036333136.1%
Qwen 3.5 9B503534322535.4%
Stealth: Healer Alpha393635343135.1%
Gemini 3.1 Flash Lite (Preview)473534322835.0%
Gemma 4 26B (Reasoning)433534342935.0%
Gemma 4 26B383535343234.9%
Gemma 4 31B (Reasoning)393634333234.6%
Mistral NeMO403934312533.8%
Gemini 3.1 Flash Lite363633322532.4%
Stealth: Aurora Alpha403331312532.2%
Qwen 3.5 27B363533312632.2%
Qwen 3.5 35B353231303031.7%
Qwen 3.5 Flash363431292831.7%
Qwen 3.5 122B343331302831.2%
Inception Mercury372825252528.0%
Mistral Small 3.2 24B372525252527.4%
GPT-5 Nano322727252527.1%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4o Mini (temp=1)989695907791.1%
Llama 3.1 8B989795818190.5%
Grok 4.1 Fast999485847587.4%
Rocinante 12B1009891796586.4%
Hermes 3 405B988682807283.7%
Claude Sonnet 4949284786783.2%
Claude Opus 4959287736382.1%
Claude Sonnet 4.6848383787079.5%
Claude Sonnet 4.5908079786077.4%
GPT-4o, Aug. 6th (temp=1)878475736877.3%
Llama 3.1 Nemotron 70B968583615976.9%
Claude Opus 4.7858278746476.4%
Claude Sonnet 4.6 (Reasoning)898777636175.6%
GPT-4o, May 13th (temp=1)948476635774.5%
Gemma 3 27B897975666474.4%
ByteDance Seed 2.0 Lite968674655074.3%
Claude 3.5 Sonnet857674696774.1%
Claude Opus 4.6 (Reasoning)858072716374.0%
Claude Opus 4.5807572706973.0%
GPT-4.1868078655572.7%
Gemma 3 12B887971705472.4%
Claude Opus 4.7 (Reasoning)847776715372.3%
Cohere Command R+ (Aug. 2024)867472666272.0%
Grok 4877575645671.2%
Gemma 3 4B757572676570.7%
GPT-4.1 Nano837170696070.5%
Z.AI GLM 5777671676170.3%
DeepSeek V4 Pro837467665769.5%
Arcee AI: Trinity Large (Preview)867270605669.1%
Claude 3 Haiku817065655968.1%
DeepSeek V3 (2025-03-24)1007459545167.6%
Grok 4 Fast737269636167.5%
Gemini 2.5 Flash Lite806864606066.3%
Gemini 2.5 Flash777472654466.3%
DeepSeek V4 Pro (Reasoning)826867595566.1%
Hermes 3 70B737068605865.9%
DeepSeek V4 Flash747266605765.9%
Claude 3.7 Sonnet817167575365.8%
GPT-4.1 Mini727168575764.9%
Claude Haiku 4.5796964595364.8%
Gemini 2.5 Pro897057564864.0%
GPT-5.1747168545163.6%
Gemini 3.5 Flash (Reasoning, Minimal)797367544363.3%
Z.AI GLM 5.1737067565063.2%
Z.AI GLM 4.5726361605963.1%
GPT-4o Mini (temp=0)716759595862.9%
WizardLM 2 8x22b726961575562.7%
Llama 3.1 70B786562624662.6%
DeepSeek V4 Flash (Reasoning)726562585662.5%
Grok 4.20736360585862.3%
DeepSeek V3.1985858514762.0%
Gemini 2.5 Flash (Reasoning)756957555462.0%
MiniMax M2.5666661595761.7%
GPT-5.4 (Reasoning, Low)776259565461.6%
Qwen3 235B A22B Instruct 2507766161555261.3%
Claude Opus 4.6776159545461.1%
Grok 4.20 (Beta, Reasoning)646259525157.7%
LFM2 24B766258514157.6%
Grok 4.20 (Reasoning)675958535057.3%
GPT-5.4646258535057.3%
Qwen 3 32B646363494757.3%
GPT-4o, Aug. 6th (temp=0)766156504357.1%
GPT-5.5 (Reasoning)625956565357.0%
Qwen 3.5 Plus (2026-02-15)636055555156.8%
MoonshotAI: Kimi K2.5735954494856.8%
DeepSeek V3 (2024-12-26)756052514356.2%
Gemini 3.1 Flash Lite636058544456.0%
Z.AI GLM 5 Turbo615755545255.8%
o4 Mini High636156524755.8%
ByteDance Seed 1.6646362464355.6%
o4 Mini605954535155.5%
Z.AI GLM 4.5 Air676157504355.5%
Writer: Palmyra X5595756554955.0%
Z.AI GLM 4.6706357433954.4%
GPT-5635453525054.3%
Mistral Small Creative675957464254.0%
ByteDance Seed 1.6 Flash585555544853.8%
Stealth: Healer Alpha595852514953.8%
Grok 4.3645950484853.6%
GPT-5.4 (Reasoning)635949494853.6%
Mistral Medium 3.1645654494553.5%
MiniMax M2.7706846453753.2%
GPT-5.5 (Reasoning, Low)555553524852.7%
GPT-5.4 Mini (Reasoning, Low)595551504852.6%
Xiaomi MIMO v2.5 Pro605251514752.1%
GPT-5.4 Mini (Reasoning)605653464651.9%
Gemini 2.5 Flash Lite (Reasoning)645553454251.9%
GPT-4o, May 13th (temp=0)645852434351.8%
Z.AI GLM 4.7605554494151.7%
Mistral Small 4676145444051.6%
ByteDance Seed 2.0 Mini635149454450.5%
Qwen 3.5 Plus (2026-04-20)685757412950.3%
DeepSeek V3.2575450494150.2%
Nemotron 3 Super545150484750.1%
Grok 4.20 (Beta)615247454549.8%
Grok 4.3 (Reasoning)665445424049.5%
Mistral NeMO605947433849.5%
MoonshotAI: Kimi K2.6575550434249.3%
Mistral Large535251464449.1%
GPT-5.5575048464448.9%
Qwen 3.6 27B655150423648.8%
Qwen 2.5 72B604847454448.8%
Arcee AI: Trinity Mini655348463148.7%
Aion 2.0535150474348.7%
Gemini 3 Pro (Preview)585444434348.4%
Ministral 8B555445454348.3%
Qwen3.7 Max555353423647.7%
Mistral Small 4 (Reasoning)634843424247.5%
Gemini 3.1 Flash Lite (Preview)565247443747.3%
Mistral Large 3575045424147.0%
DeepSeek-V2 Chat624545424046.7%
GPT-5.4 Mini514746454546.7%
Ministral 3 14B554945434146.7%
Ministral 3B704641383846.6%
Gemini 3.1 Flash Lite (Reasoning)524947444046.4%
Xiaomi MIMO v2.5494646454446.2%
Gemini 3.1 Pro (Preview)634642423746.0%
Mistral Large 2534745434245.9%
Gemma 4 31B (Reasoning)614942403845.8%
Qwen 3.6 Flash674140393945.3%
GPT-5 Mini534544434245.2%
Gemma 4 31B514947413845.2%
Stealth: Hunter Alpha494847423544.5%
Qwen3.6 Max Preview534644423644.4%
Ministral 3 8B594842383444.4%
Gemini 3 Flash (Preview)535042393844.3%
Gemma 4 26B (Reasoning)694337373644.2%
Z.AI GLM 4.7 Flash514742423844.0%
GPT-5.4 Nano (Reasoning, Low)464543434243.8%
GPT-5.2474443424143.5%
GPT-5.4 Nano (Reasoning)444443434243.3%
Gemini 3.5 Flash (Reasoning)484544403943.3%
Qwen 3.5 397B A17B504543413843.2%
GPT-5.4 Nano444343424142.6%
GPT-OSS 120B444443413942.1%
Mistral Small 3.2 24B464342413641.7%
Qwen 3.5 9B444441413741.5%
Ministral 3 3B484242413341.2%
Qwen 3.5 Flash514641343241.0%
Qwen 3.6 35B534739372840.6%
Gemini 3 Flash (Preview, Reasoning)454440393440.5%
Qwen 3.5 122B444341403340.1%
Qwen 3.5 27B434340373639.9%
Inception Mercury504138373139.3%
Gemma 4 26B414140393238.6%
Inception Mercury 2444339373038.5%
Stealth: Aurora Alpha453937373137.9%
Qwen 3.5 35B413937343236.7%
Nemotron 3 Nano383835322533.8%
GPT-5 Nano343131302530.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B1009998817690.6%
Grok 4.1 Fast1009389858490.0%
Llama 3.1 8B10010096886689.9%
GPT-4o, Aug. 6th (temp=1)1009183827886.8%
Hermes 3 70B1008071645874.7%
Cohere Command R+ (Aug. 2024)977570616172.7%
Llama 3.1 Nemotron 70B797669696471.5%
Claude Sonnet 4787572656069.8%
GPT-4o Mini (temp=1)747367666368.5%
Qwen 3.5 Plus (2026-04-20)857773614668.2%
Arcee AI: Trinity Large (Preview)837468635067.7%
Hermes 3 405B807468655267.7%
DeepSeek V3 (2025-03-24)817867585166.8%
Gemma 3 27B757462595865.5%
Grok 4736963585763.9%
Claude Opus 4796461605062.7%
GPT-4o, May 13th (temp=1)786760554761.3%
Grok 4 Fast706158565660.2%
Claude Sonnet 4.5676659565360.1%
Claude 3 Haiku686160565560.0%
Claude 3.5 Sonnet646160575759.8%
Z.AI GLM 5 Turbo716756534859.0%
GPT-4.1 Mini636360574758.2%
DeepSeek V4 Pro615956555557.0%
Claude Opus 4.7 (Reasoning)625858525256.6%
MoonshotAI: Kimi K2.61005148424056.1%
Claude Opus 4.7625755504553.9%
GPT-4.1615454544653.8%
Claude Sonnet 4.6 (Reasoning)605954514253.5%
Claude 3.7 Sonnet615551494752.7%
Gemma 3 12B635751504152.6%
Claude Haiku 4.5666450414152.5%
DeepSeek V3 (2024-12-26)885146423552.4%
GPT-5.1625751474451.9%
Qwen3 235B A22B Instruct 2507575653464150.7%
GPT-4o Mini (temp=0)545250504750.7%
Z.AI GLM 5585150464549.9%
Claude Opus 4.5575150494049.6%
Writer: Palmyra X5525151504349.5%
GPT-4o, Aug. 6th (temp=0)555150464549.4%
Mistral Small 4 (Reasoning)655044434349.0%
Grok 4.20 (Reasoning)625145444248.9%
GPT-5.4 (Reasoning, Low)545050474348.7%
GPT-5.4 (Reasoning)535249444448.5%
MoonshotAI: Kimi K2.5675447403548.5%
Llama 3.1 70B625050423848.4%
Claude Sonnet 4.6675050452948.4%
LFM2 24B674949393748.3%
GPT-5.4525247454548.3%
Claude Opus 4.6 (Reasoning)545447444147.9%
Mistral Small 4504948474347.3%
GPT-4.1 Nano604944424247.2%
Claude Opus 4.6564948443847.0%
Z.AI GLM 5.1664746433146.6%
o4 Mini544746424146.1%
Qwen 3 32B544644414145.2%
DeepSeek-V2 Chat574543414045.2%
o4 Mini High514744434045.0%
Grok 4.20 (Beta, Reasoning)524943423944.9%
Gemini 2.5 Flash Lite (Reasoning)514745443744.7%
MiniMax M2.5585737353444.3%
Grok 4.20554342424044.3%
Grok 4.3464645424244.3%
Gemini 3.5 Flash (Reasoning, Minimal)534842393844.0%
Z.AI GLM 4.5574442403743.8%
DeepSeek V4 Pro (Reasoning)584641383443.3%
Z.AI GLM 4.7524643393643.3%
Qwen 3.6 Flash644439363343.3%
Mistral Medium 3.1474544433743.1%
ByteDance Seed 1.6 Flash474643413843.1%
MiniMax M2.7544444373643.0%
Gemini 2.5 Flash Lite454444414042.8%
Grok 4.20 (Beta)464342424042.5%
GPT-4o, May 13th (temp=0)454545383741.9%
Mistral Large 2554140393441.8%
Z.AI GLM 4.5 Air653837343341.6%
GPT-5.4 Mini444242404041.6%
GPT-5.4 Mini (Reasoning, Low)454540393841.5%
GPT-5.5 (Reasoning, Low)464539393841.4%
Qwen3.7 Max494342363541.2%
DeepSeek V3.1594535343241.1%
Mistral Large524443402741.1%
Mistral Large 3484638363641.1%
GPT-5444240393941.0%
GPT-5.5474140393840.9%
Arcee AI: Trinity Mini524643372740.9%
Gemini 2.5 Pro524342353240.8%
Gemini 3 Pro (Preview)494039393640.8%
DeepSeek V4 Flash534140393240.8%
Mistral Small Creative564536353140.5%
GPT-5.4 Mini (Reasoning)424240393940.5%
Qwen 2.5 72B424140403940.4%
Ministral 3B534140353140.2%
Gemini 3.1 Flash Lite (Preview)494039383540.2%
Qwen 3.5 397B A17B494337363540.1%
Gemma 3 4B454039393840.1%
GPT-OSS 120B534039383140.0%
Stealth: Hunter Alpha464438383340.0%
Ministral 3 8B454040383739.9%
ByteDance Seed 2.0 Mini494539343339.9%
Ministral 3 14B534137353339.8%
Stealth: Aurora Alpha444240393539.8%
Gemini 3.5 Flash (Reasoning)494038363439.2%
GPT-5.5 (Reasoning)434038383639.1%
Aion 2.0484239372939.1%
GPT-5.2404039393739.0%
Xiaomi MIMO v2.5 Pro464539343139.0%
Inception Mercury 2464140382838.6%
Ministral 8B414039383538.6%
GPT-5.4 Nano (Reasoning)404038373738.4%
GPT-5 Mini434038353538.3%
Nemotron 3 Super424139353337.8%
Qwen 3.5 Plus (2026-02-15)464036343437.8%
Qwen 3.5 Flash393938363537.4%
WizardLM 2 8x22b463935343437.4%
DeepSeek V3.2423836353437.2%
Ministral 3 3B453838353037.1%
Gemini 2.5 Flash474039312837.0%
Gemini 3.1 Pro (Preview)423935353436.9%
Xiaomi MIMO v2.5434134323236.6%
Z.AI GLM 4.7 Flash393837353436.6%
Qwen3.6 Max Preview573634312536.5%
Mistral NeMO483737342536.4%
Gemini 3.1 Flash Lite (Reasoning)383838363136.2%
DeepSeek V4 Flash (Reasoning)423835353036.2%
Stealth: Healer Alpha393837353136.0%
Z.AI GLM 4.6473832313035.7%
Gemini 3 Flash (Preview, Reasoning)393736353135.7%
Gemini 2.5 Flash (Reasoning)393938372535.6%
GPT-5.4 Nano403535343435.5%
Gemini 3 Flash (Preview)423735352935.5%
ByteDance Seed 1.6434133313035.5%
ByteDance Seed 2.0 Lite463636302935.2%
Gemma 4 31B373636343235.0%
Qwen 3.6 35B433433323234.9%
Gemini 3.1 Flash Lite393834322934.4%
Grok 4.3 (Reasoning)463433292934.2%
Qwen 3.5 122B383736292733.5%
GPT-5.4 Nano (Reasoning, Low)373533323133.5%
Qwen 3.5 9B413533312533.1%
Gemma 4 31B (Reasoning)343433292831.6%
Qwen 3.5 27B373331292731.4%
Qwen 3.5 35B403330282531.2%
Qwen 3.6 27B383629262530.7%
Nemotron 3 Nano333231292730.5%
Gemma 4 26B (Reasoning)302929292929.3%
Mistral Small 3.2 24B3331252528.6%
Gemma 4 26B313027262527.7%
Inception Mercury342727252527.7%
GPT-5 Nano302827252527.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast999898969096.1%
Llama 3.1 8B100100100756287.3%
Llama 3.1 Nemotron 70B919087838186.2%
GPT-4o Mini (temp=1)949385827786.1%
Grok 4908787877785.5%
GPT-4o, Aug. 6th (temp=1)908984826982.8%
Cohere Command R+ (Aug. 2024)988888865182.0%
Hermes 3 405B888382827381.5%
Claude Sonnet 4948681805980.1%
Grok 4 Fast848277767578.7%
Hermes 3 70B989776744377.4%
Rocinante 12B1007974646175.5%
Claude 3.5 Sonnet897574726775.4%
Claude Opus 4808077676273.1%
Claude Sonnet 4.5977675595872.9%
Z.AI GLM 4.5947669665972.6%
GPT-4.1777771716772.5%
Claude 3.7 Sonnet867268686371.4%
LFM2 24B897372695371.3%
GPT-4o Mini (temp=0)937165646371.1%
GPT-4o, May 13th (temp=1)877068676471.1%
GPT-4.1 Mini857473714970.5%
Gemma 3 27B927868575670.2%
GPT-4.1 Nano827065656469.4%
Gemma 3 12B717169696468.9%
Arcee AI: Trinity Mini896963636168.9%
DeepSeek V3 (2024-12-26)857867575668.7%
Claude Opus 4.7747067635966.6%
Gemini 2.5 Flash Lite777165615866.2%
Qwen 3.5 Plus (2026-02-15)777267635166.1%
Z.AI GLM 5766967635566.1%
DeepSeek V3 (2025-03-24)797064635466.0%
Grok 4.20 (Reasoning)766763626165.7%
GPT-4o, Aug. 6th (temp=0)706662615963.5%
o4 Mini High817856505062.9%
Llama 3.1 70B1006058534262.4%
Grok 4.20696463605462.1%
DeepSeek-V2 Chat806463554761.9%
Grok 4.20 (Beta, Reasoning)816458574961.7%
Grok 4.20 (Beta)796560544961.6%
Mistral Medium 3.1726860555261.4%
Z.AI GLM 4.5 Air727059564660.4%
Z.AI GLM 5 Turbo716359555460.2%
Mistral Large 3706961603859.6%
Gemini 2.5 Flash757452524359.4%
Claude Opus 4.5666258565559.3%
GPT-4o, May 13th (temp=0)716757534558.7%
Mistral Large 2726058535158.6%
Mistral Small 4656558545058.5%
Claude Haiku 4.5636059585258.3%
Gemini 2.5 Flash Lite (Reasoning)636261604358.0%
Claude 3 Haiku636356555157.3%
Grok 4.3656159534656.9%
Claude Sonnet 4.6 (Reasoning)756054494656.6%
Qwen 3 32B666557484656.5%
Z.AI GLM 5.1785953504156.3%
Arcee AI: Trinity Large (Preview)796455512855.5%
Qwen 2.5 72B676055474755.1%
Gemini 2.5 Flash (Reasoning)705452514554.6%
Claude Sonnet 4.6625858573654.5%
DeepSeek V4 Flash835648444154.5%
Mistral Small 4 (Reasoning)615854514553.9%
DeepSeek V4 Pro625755504653.9%
Claude Opus 4.7 (Reasoning)695453474553.5%
Mistral NeMO685958463553.0%
Gemma 3 4B585655534252.8%
DeepSeek V4 Pro (Reasoning)545252514951.6%
WizardLM 2 8x22b695545444451.4%
Qwen3 235B A22B Instruct 2507645551444050.9%
Gemini 2.5 Pro585452474450.8%
o4 Mini545150484850.1%
Mistral Large565550474250.0%
MoonshotAI: Kimi K2.5565250474249.4%
Nemotron 3 Super585647434349.4%
Writer: Palmyra X5535249484449.0%
Ministral 3 8B555148464548.9%
Mistral Small Creative615250433848.9%
Gemini 3.5 Flash (Reasoning, Minimal)575151444148.9%
Claude Opus 4.6575248464048.8%
MiniMax M2.5555247464348.7%
Gemma 4 26B605150433948.6%
ByteDance Seed 1.6 Flash584846454448.4%
GPT-OSS 120B514948474548.1%
ByteDance Seed 2.0 Lite655747422947.9%
Stealth: Hunter Alpha545250443947.8%
GPT-5.1544847454347.2%
Xiaomi MIMO v2.5 Pro605043424147.2%
DeepSeek V4 Flash (Reasoning)504747464446.9%
MiniMax M2.7544946424146.5%
Claude Opus 4.6 (Reasoning)504947434246.3%
GPT-5.4 Mini (Reasoning, Low)484746454345.9%
Stealth: Aurora Alpha504949423945.8%
Inception Mercury 2504946444145.8%
Ministral 3 14B494948424045.8%
DeepSeek V3.1684442383645.8%
GPT-5.4474646454445.7%
DeepSeek V3.2594745393745.3%
GPT-5.5474646444345.3%
GPT-5.4 (Reasoning)464545454545.2%
GPT-5.4 Mini464646454445.1%
Qwen 3.6 27B645542392545.1%
Ministral 8B484846444045.1%
GPT-5.4 Mini (Reasoning)474545454244.9%
GPT-5.4 (Reasoning, Low)454545454344.7%
Xiaomi MIMO v2.5504843414044.3%
Gemini 3.1 Flash Lite (Preview)574442413844.3%
GPT-5.5 (Reasoning, Low)454545434344.3%
Gemma 4 26B (Reasoning)484746413844.1%
ByteDance Seed 2.0 Mini554543413644.0%
Gemini 3.5 Flash (Reasoning)504645423744.0%
Mistral Small 3.2 24B504949482544.0%
GPT-5.2464543434243.7%
GPT-5.5 (Reasoning)444443424243.3%
Grok 4.3 (Reasoning)605041363043.3%
Gemini 3.1 Pro (Preview)444443434243.2%
Aion 2.0474644423542.9%
GPT-5.4 Nano (Reasoning)454343424142.8%
Nemotron 3 Nano444444433842.7%
Stealth: Healer Alpha474441413942.7%
Ministral 3 3B454542423942.6%
Z.AI GLM 4.7484641403842.6%
Gemini 3 Pro (Preview)504941393342.3%
GPT-5.4 Nano (Reasoning, Low)424242424141.7%
Gemma 4 31B (Reasoning)464241413941.7%
Gemma 4 31B434342414141.7%
MoonshotAI: Kimi K2.6434342413841.6%
Ministral 3B494341413441.6%
Z.AI GLM 4.6504639363541.3%
Qwen3.7 Max494140393741.2%
GPT-5.4 Nano444141404040.9%
GPT-5 Mini434241413740.7%
GPT-5 Nano434138373538.9%
ByteDance Seed 1.6504836303038.9%
Gemini 3 Flash (Preview, Reasoning)434037373638.8%
Gemini 3 Flash (Preview)414039383638.6%
GPT-5423737363637.5%
Qwen 3.5 Flash443838373037.5%
Gemini 3.1 Flash Lite (Reasoning)484633312837.3%
Gemini 3.1 Flash Lite464039322937.0%
Qwen 3.5 397B A17B383737363436.4%
Qwen 3.5 27B434140332536.4%
Z.AI GLM 4.7 Flash453737323036.3%
Qwen 3.5 Plus (2026-04-20)543433332535.7%
Inception Mercury503938272535.7%
Qwen 3.5 122B383634343034.6%
Qwen 3.6 Flash443533312533.7%
Qwen 3.5 9B393838292533.7%
Qwen3.6 Max Preview403434312532.7%
Qwen 3.5 35B393832292532.5%
Qwen 3.6 35B353131252529.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude 3.5 Sonnet969594928892.9%
Grok 4.1 Fast969187878188.5%
Claude Opus 4959389828188.1%
GPT-4o, Aug. 6th (temp=1)1009982797987.7%
Cohere Command R+ (Aug. 2024)949386847786.9%
Rocinante 12B100100100716186.6%
Llama 3.1 Nemotron 70B908983837985.0%
Gemma 3 27B998980787784.5%
Hermes 3 405B1008581817584.3%
Grok 4 Fast959286836584.3%
DeepSeek V3.1969489726483.0%
GPT-4o Mini (temp=1)928885797182.9%
Claude Sonnet 4.6948581787582.9%
Arcee AI: Trinity Mini898684816982.0%
Claude 3.7 Sonnet968879747181.6%
Hermes 3 70B1009582745781.5%
GPT-4o, May 13th (temp=1)988780737081.4%
Claude Sonnet 4.5939085696981.3%
Claude Sonnet 4.6 (Reasoning)908976767581.2%
Claude Sonnet 4878382807381.1%
Z.AI GLM 4.5998175757080.0%
Z.AI GLM 5 Turbo908879766379.2%
Grok 4918774706978.2%
Claude Opus 4.7898676736377.4%
Claude Haiku 4.5848373737377.3%
DeepSeek V3 (2025-03-24)988870686277.1%
Llama 3.1 8B1009366595875.4%
GPT-4o Mini (temp=0)808074727075.2%
Claude Opus 4.7 (Reasoning)817875706874.6%
Z.AI GLM 5897371676773.5%
GPT-4.1 Mini867471696773.4%
Arcee AI: Trinity Large (Preview)907572695872.8%
GPT-4.1787874666572.4%
DeepSeek V4 Pro886969656571.0%
Claude 3 Haiku827667666370.9%
Gemma 3 12B817771656170.8%
Gemini 2.5 Flash Lite797674616170.2%
Llama 3.1 70B1006966585469.5%
MiniMax M2.5767369676169.2%
GPT-4.1 Nano827168635768.3%
Claude Opus 4.6 (Reasoning)777367626067.9%
GPT-5.1717170646167.5%
Gemini 2.5 Flash Lite (Reasoning)817067625667.4%
Qwen 3.5 Plus (2026-02-15)727164636366.6%
MoonshotAI: Kimi K2.5777364615766.4%
MiniMax M2.7896360605966.2%
Grok 4.20 (Reasoning)757366585866.0%
Z.AI GLM 5.1787068624965.4%
Claude Opus 4.6777666594765.2%
Gemini 3.5 Flash (Reasoning, Minimal)747266595565.1%
Qwen3 235B A22B Instruct 2507746464626265.0%
GPT-5.4 (Reasoning)686665626064.3%
Claude Opus 4.5736964605464.2%
Xiaomi MIMO v2.5696764625863.9%
Ministral 8B816858575663.9%
Z.AI GLM 4.5 Air676763626063.8%
Qwen 2.5 72B757161555363.0%
Mistral Large 2656564645662.8%
LFM2 24B836463564862.7%
GPT-4o, Aug. 6th (temp=0)816765505062.7%
GPT-4o, May 13th (temp=0)706564625362.7%
DeepSeek V4 Flash756564575062.0%
GPT-5.4736562595262.0%
Writer: Palmyra X5666661605461.3%
Gemini 2.5 Pro786258545461.1%
Ministral 3 14B797266464261.0%
Qwen3.7 Max676359575760.9%
Gemini 2.5 Flash (Reasoning)686362595360.9%
Ministral 3B716763624260.8%
DeepSeek-V2 Chat666662595160.8%
Mistral Small 4666459585660.7%
Gemini 2.5 Flash706762535060.6%
o4 Mini High676765564660.3%
DeepSeek V4 Pro (Reasoning)706859554859.9%
Mistral Large 3836357514559.8%
Stealth: Hunter Alpha696159575159.5%
Gemini 3.5 Flash (Reasoning)777056524359.5%
Stealth: Healer Alpha686660594459.4%
Aion 2.0716959524358.8%
Grok 4.20 (Beta, Reasoning)716560584058.8%
ByteDance Seed 1.6 Flash686362554658.7%
Gemma 3 4B686462594058.7%
DeepSeek V4 Flash (Reasoning)746561543958.6%
Gemini 3 Pro (Preview)696665504458.6%
GPT-5.5646057565558.3%
Grok 4.20696755554658.3%
Mistral Large686359584358.2%
DeepSeek V3 (2024-12-26)756953474658.0%
o4 Mini666058535257.8%
Z.AI GLM 4.7645957565057.1%
GPT-5.4 (Reasoning, Low)605959584856.8%
Nemotron 3 Super786150484656.5%
Xiaomi MIMO v2.5 Pro646254524956.3%
Qwen 3 32B636059514856.2%
ByteDance Seed 2.0 Mini716553504055.6%
GPT-5.5 (Reasoning, Low)585656545355.5%
Grok 4.3646059494555.3%
Z.AI GLM 4.6645756524755.1%
MoonshotAI: Kimi K2.6605655514854.0%
Grok 4.20 (Beta)595755504953.8%
WizardLM 2 8x22b625754504553.8%
Mistral Medium 3.1615856474753.7%
GPT-5.5 (Reasoning)595752514853.4%
ByteDance Seed 2.0 Lite625852524353.3%
GPT-5645852514253.1%
Mistral Small 4 (Reasoning)655954454253.0%
Mistral NeMO656254503052.4%
Qwen3.6 Max Preview645856473451.8%
Ministral 3 8B636053424151.7%
DeepSeek V3.2575549484650.9%
Qwen 3.5 Plus (2026-04-20)575554523450.5%
Gemini 3.1 Flash Lite (Preview)585450503849.9%
Ministral 3 3B645345454349.7%
Gemini 3.1 Pro (Preview)595149424248.6%
Gemini 3 Flash (Preview, Reasoning)605944413948.5%
GPT-5.4 Mini525047464547.9%
Gemini 3.1 Flash Lite (Reasoning)664946403847.7%
Qwen 3.6 Flash705144423047.2%
GPT-5.4 Mini (Reasoning, Low)514847454547.2%
Mistral Small Creative565046434047.2%
GPT-5.4 Mini (Reasoning)504848464246.9%
Gemini 3 Flash (Preview)614744424146.9%
Gemma 4 31B535346443846.8%
Qwen 3.6 35B695040393546.5%
Qwen 3.6 27B575351353546.3%
Z.AI GLM 4.7 Flash584543434145.9%
Gemma 4 26B (Reasoning)524944434145.7%
GPT-5 Mini504745444245.6%
Qwen 3.5 397B A17B585342383745.6%
GPT-OSS 120B634541403945.5%
GPT-5.2484646454044.8%
GPT-5.4 Nano (Reasoning)454543424143.2%
GPT-5.4 Nano434343424242.5%
GPT-5.4 Nano (Reasoning, Low)434343434042.5%
ByteDance Seed 1.6434343424142.3%
Qwen 3.5 Flash504240404042.3%
Qwen 3.5 27B494742403242.0%
Gemini 3.1 Flash Lite484741363641.5%
Gemma 4 31B (Reasoning)444440403440.4%
Grok 4.3 (Reasoning)454441403140.2%
Gemma 4 26B424139393839.9%
Mistral Small 3.2 24B585039252539.4%
Inception Mercury 2454239373339.2%
Qwen 3.5 122B484039353439.2%
Stealth: Aurora Alpha424239393539.1%
Inception Mercury424137343337.5%
Qwen 3.5 9B474541292537.3%
Qwen 3.5 35B414138362536.2%
Nemotron 3 Nano413736332634.7%
GPT-5 Nano373434333233.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 Nemotron 70B959587837987.8%
GPT-4o, Aug. 6th (temp=1)959584797786.2%
Claude 3.5 Sonnet908484847884.1%
Grok 4.1 Fast888786797883.7%
Claude Sonnet 4918985806882.4%
GPT-4o, May 13th (temp=1)938483777181.5%
Hermes 3 405B919075727280.0%
GPT-4o Mini (temp=1)868180777379.5%
Llama 3.1 8B1007675756978.8%
Gemma 3 27B938175696376.4%
Hermes 3 70B988276665976.2%
Rocinante 12B1009776504674.0%
Claude Opus 4877369686672.7%
GPT-4.1 Mini827772685871.4%
Claude Sonnet 4.5867574606071.1%
Grok 4757470676469.9%
DeepSeek V3 (2025-03-24)837367646169.4%
GPT-4.1 Nano847669645269.0%
Cohere Command R+ (Aug. 2024)797069646469.0%
LFM2 24B967165615068.6%
Grok 4.3786767656568.6%
Z.AI GLM 4.5727167666568.1%
Llama 3.1 70B886965635567.8%
Claude 3.7 Sonnet737066645966.2%
Claude Haiku 4.5747263615965.8%
GPT-4o Mini (temp=0)767062615464.7%
Gemma 3 12B786564615564.6%
Qwen 3 32B776665595564.3%
DeepSeek-V2 Chat766561595863.9%
Grok 4 Fast717062625463.5%
GPT-4.1766561595563.2%
DeepSeek V3 (2024-12-26)707060585762.8%
GPT-4o, Aug. 6th (temp=0)747355525261.4%
Claude Opus 4.7 (Reasoning)706659575561.2%
Claude Sonnet 4.6706761525060.3%
Qwen 2.5 72B646459585660.2%
Claude 3 Haiku746860554360.0%
Qwen 3.5 Plus (2026-02-15)756360604259.9%
Claude Opus 4.7756359524659.1%
Mistral Small 4706656554959.0%
Grok 4.3 (Reasoning)767365433758.8%
Gemma 3 4B656360535158.5%
Gemini 2.5 Flash (Reasoning)736260494858.3%
Z.AI GLM 4.5 Air646463534758.0%
Arcee AI: Trinity Large (Preview)676459534757.9%
Z.AI GLM 5666553525157.5%
Claude Sonnet 4.6 (Reasoning)626059564756.9%
Gemini 2.5 Flash656255544856.9%
Grok 4.20715753515056.4%
Mistral Large 2676058504856.4%
ByteDance Seed 1.6 Flash706457484155.9%
Claude Opus 4.5645757544755.7%
DeepSeek V4 Pro595957544755.3%
Mistral Medium 3.1655655514955.2%
GPT-4o, May 13th (temp=0)605757515055.1%
DeepSeek V4 Flash615858494855.0%
o4 Mini626154494854.8%
Gemini 3.5 Flash (Reasoning, Minimal)595655544453.8%
Gemini 2.5 Flash Lite716049454153.4%
Grok 4.20 (Reasoning)575552515053.0%
DeepSeek V4 Pro (Reasoning)595450504952.7%
Z.AI GLM 5.1655857503352.7%
Gemini 2.5 Pro655352484552.3%
Qwen3 235B A22B Instruct 2507595554484552.3%
Mistral Large595858454152.0%
Ministral 3 14B685451444351.9%
Grok 4.20 (Beta, Reasoning)565654474651.6%
MiniMax M2.7645852453851.5%
Ministral 3 8B615848464451.4%
Claude Opus 4.6 (Reasoning)575453454550.8%
DeepSeek V4 Flash (Reasoning)595648474150.3%
Mistral Small Creative625647434350.3%
Mistral Large 3585151474349.9%
Grok 4.20 (Beta)575450454349.8%
o4 Mini High634848464449.8%
MoonshotAI: Kimi K2.5545250494249.4%
WizardLM 2 8x22b605447444249.4%
Claude Opus 4.6515150504549.4%
Ministral 3 3B634848444349.3%
Mistral Small 4 (Reasoning)565451434049.1%
Writer: Palmyra X5585744434148.6%
MoonshotAI: Kimi K2.6555151444248.4%
Gemini 2.5 Flash Lite (Reasoning)535147464347.9%
Nemotron 3 Super584844444447.8%
GPT-5.1534946444447.3%
MiniMax M2.5514948464247.2%
Mistral NeMO674944423347.2%
GPT-5.4484747474646.9%
ByteDance Seed 2.0 Mini664543433846.9%
Z.AI GLM 4.6634947383546.5%
Inception Mercury 2494747454446.5%
Ministral 8B504948464046.5%
Qwen 3.6 Flash923736343346.4%
Z.AI GLM 5 Turbo544948404046.4%
GPT-5.4 (Reasoning, Low)504745444446.3%
Ministral 3B494847473845.9%
DeepSeek V3.2525146414045.8%
GPT-5.4 Mini474745454445.8%
GPT-5.4 (Reasoning)474646454545.8%
GPT-5.4 Mini (Reasoning)464646454445.5%
GPT-OSS 120B474747444245.4%
Arcee AI: Trinity Mini545345383745.3%
GPT-5.4 Mini (Reasoning, Low)464646444445.2%
GPT-5.5 (Reasoning)464645454445.1%
DeepSeek V3.1584343414044.9%
Xiaomi MIMO v2.5 Pro534845403844.8%
Xiaomi MIMO v2.5524642424044.3%
Qwen 3.6 35B774335343244.2%
GPT-5.2464544434344.0%
Gemini 3 Flash (Preview)474545443944.0%
GPT-5.5 (Reasoning, Low)454544444243.9%
Aion 2.0514743413743.8%
Stealth: Aurora Alpha494643414043.8%
GPT-5.5454444434243.7%
Z.AI GLM 4.7464644423743.0%
Qwen3.7 Max464643413842.9%
Stealth: Hunter Alpha504242423842.9%
Gemini 3.5 Flash (Reasoning)464444413942.8%
Gemini 3.1 Pro (Preview)444343424242.8%
Mistral Small 3.2 24B494743413442.5%
GPT-5.4 Nano (Reasoning)434343414042.1%
Stealth: Healer Alpha504241393842.0%
GPT-5.4 Nano (Reasoning, Low)444242423941.8%
Gemma 4 26B474241413841.8%
Gemini 3 Pro (Preview)474542393441.6%
GPT-5 Nano434241404041.0%
GPT-5.4 Nano424141414040.9%
Gemini 3 Flash (Preview, Reasoning)434141403940.8%
GPT-5434240403840.6%
Z.AI GLM 4.7 Flash474038383840.2%
Gemma 4 26B (Reasoning)434241373639.8%
GPT-5 Mini424140393739.6%
Qwen 3.6 27B544339352639.4%
ByteDance Seed 1.6503838373339.3%
Gemma 4 31B424141353438.8%
Qwen 3.5 27B464437343238.6%
Gemma 4 31B (Reasoning)423938373638.3%
Gemini 3.1 Flash Lite444436333237.7%
Qwen 3.5 122B414038343237.0%
Qwen3.6 Max Preview464533322536.3%
Gemini 3.1 Flash Lite (Reasoning)513634292935.8%
Nemotron 3 Nano434238292535.5%
Qwen 3.5 Plus (2026-04-20)484238252535.5%
Qwen 3.5 397B A17B403935342735.0%
Qwen 3.5 35B403735342934.9%
Qwen 3.5 Flash423832323034.7%
Gemini 3.1 Flash Lite (Preview)483829252533.1%
ByteDance Seed 2.0 Lite383736292533.0%
Qwen 3.5 9B373628272530.6%
Inception Mercury312925252527.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast100100100999298.2%
GPT-4o, Aug. 6th (temp=1)10010088807388.3%
Llama 3.1 Nemotron 70B10010092756987.1%
Claude Opus 4968876747481.6%
Rocinante 12B979786626080.7%
Llama 3.1 8B1009793605280.4%
Claude 3.5 Sonnet919174726478.5%
Claude Sonnet 4.5838278747177.6%
GPT-4o Mini (temp=1)867474726373.8%
Hermes 3 70B928074724873.1%
Claude Sonnet 4847878626072.6%
GPT-4.1 Nano847875695672.3%
Grok 4.20 (Reasoning)777573716371.8%
Claude 3.7 Sonnet867270676271.4%
Grok 4838379575471.1%
Hermes 3 405B938266644670.4%
Claude 3 Haiku787169686469.8%
Z.AI GLM 4.5777669666169.7%
Claude Opus 4.7847768655169.0%
DeepSeek V3 (2025-03-24)767170666268.8%
DeepSeek V4 Pro757371695468.6%
GPT-4.1 Mini767068656268.2%
GPT-4o, May 13th (temp=1)767269685668.1%
Z.AI GLM 5727168666368.0%
Z.AI GLM 5.1807369654866.9%
Claude Sonnet 4.6777068615766.6%
Grok 4.20 (Beta, Reasoning)807165575665.8%
Grok 4 Fast807264585365.4%
Claude Sonnet 4.6 (Reasoning)807661604764.8%
Gemma 3 27B696763616064.0%
Claude Opus 4.7 (Reasoning)816858565563.7%
Llama 3.1 70B767465594263.4%
Gemini 3.5 Flash (Reasoning, Minimal)796261595262.5%
Gemma 3 12B896954544762.5%
Cohere Command R+ (Aug. 2024)776866524962.5%
MoonshotAI: Kimi K2.5837958533962.3%
Z.AI GLM 5 Turbo726660575662.0%
Claude Opus 4.5656362615861.9%
Qwen3 235B A22B Instruct 2507696362605661.9%
Arcee AI: Trinity Large (Preview)666666575461.7%
GPT-4.1736660554960.9%
Writer: Palmyra X5726956555160.8%
MiniMax M2.5815858555160.7%
Grok 4.20696363555360.6%
GPT-5.4686459565460.2%
MoonshotAI: Kimi K2.6716160584859.7%
Gemini 2.5 Flash825956554659.4%
Z.AI GLM 4.5 Air656059565659.3%
Qwen3.6 Max Preview676058555559.0%
LFM2 24B735955555158.7%
DeepSeek-V2 Chat746461553958.4%
Grok 4.20 (Beta)765756524958.0%
DeepSeek V4 Pro (Reasoning)696056564958.0%
Mistral Small 4725654545357.7%
Claude Opus 4.6 (Reasoning)666057545157.6%
Claude Haiku 4.5635957545357.1%
GPT-5.1595858585257.1%
GPT-5.4 (Reasoning, Low)635856545256.6%
Gemini 3.1 Pro (Preview)676053505055.9%
Claude Opus 4.6686652484455.6%
ByteDance Seed 2.0 Mini626159524455.5%
Grok 4.3655755544755.4%
GPT-5.4 (Reasoning)665851515055.2%
Qwen 3.6 27B826648423754.7%
o4 Mini656055514254.7%
ByteDance Seed 1.6 Flash865349444154.7%
Qwen 3.5 Plus (2026-02-15)595654525154.4%
Ministral 3 14B636258444253.9%
Ministral 3B615954494653.8%
DeepSeek V3.1745652464153.6%
MiniMax M2.7685651514253.5%
DeepSeek V4 Flash (Reasoning)655853503953.2%
Gemini 2.5 Flash Lite675650484353.1%
DeepSeek V3.2655351504552.8%
Grok 4.3 (Reasoning)685653513252.1%
GPT-5.4 Mini615549474651.8%
Qwen 3 32B645250474651.7%
Gemini 2.5 Flash Lite (Reasoning)615854444151.5%
o4 Mini High555552504551.5%
ByteDance Seed 2.0 Lite865043403751.3%
Mistral Medium 3.1665848424251.2%
GPT-5.5 (Reasoning)565151494951.1%
Arcee AI: Trinity Mini695050444150.8%
GPT-5.4 Mini (Reasoning)565453494250.8%
Gemma 3 4B565449484750.7%
Mistral Large615249464550.7%
Gemini 3.1 Flash Lite625648454250.7%
GPT-5.4 Mini (Reasoning, Low)565449474650.4%
Gemini 3.1 Flash Lite (Reasoning)585856503050.3%
DeepSeek V4 Flash625853433450.1%
GPT-4o Mini (temp=0)545450464549.9%
Xiaomi MIMO v2.5 Pro575352454149.7%
WizardLM 2 8x22b605752423749.6%
Gemini 3.5 Flash (Reasoning)665347443849.5%
Z.AI GLM 4.7 Flash575553453749.4%
Gemini 2.5 Flash (Reasoning)645045444048.7%
Qwen 3.6 35B686142393348.5%
GPT-5.5524948474748.5%
DeepSeek V3 (2024-12-26)714644414048.5%
Gemini 3.1 Flash Lite (Preview)655351363648.4%
Mistral Large 3535247464448.3%
Aion 2.0565448434048.2%
Qwen 3.5 Plus (2026-04-20)705842413048.0%
Mistral Small 4 (Reasoning)545454413747.8%
Qwen 3.6 Flash584846453846.8%
GPT-5.5 (Reasoning, Low)515144444446.8%
Gemini 3 Pro (Preview)505048444146.6%
Z.AI GLM 4.6714939373345.7%
Ministral 8B524845434045.7%
Qwen3.7 Max555241403945.5%
Nemotron 3 Super484845434245.2%
GPT-4o, Aug. 6th (temp=0)544545423945.1%
GPT-5464645444144.4%
Xiaomi MIMO v2.5524943423544.3%
ByteDance Seed 1.6704536353544.1%
Ministral 3 3B464645414143.6%
Mistral Large 2454444434243.5%
Stealth: Hunter Alpha494945413343.5%
Mistral Small Creative474642423843.2%
Qwen 2.5 72B454443424042.7%
GPT-5.2454443413942.5%
Stealth: Aurora Alpha464444393742.1%
Gemini 2.5 Pro494340393842.0%
Mistral NeMO474342413541.5%
Qwen 3.5 35B494542383341.5%
GPT-5.4 Nano (Reasoning, Low)444242413741.3%
GPT-4o, May 13th (temp=0)564538343341.2%
Stealth: Healer Alpha504442393041.1%
Qwen 3.5 Flash444343403540.8%
GPT-OSS 120B484040383640.5%
Inception Mercury 2444340383540.1%
GPT-5.4 Nano434240393639.9%
GPT-5.4 Nano (Reasoning)424040403839.7%
Nemotron 3 Nano444139383639.6%
Qwen 3.5 27B464240363239.2%
Qwen 3.5 9B444442402539.0%
Qwen 3.5 397B A17B513939343138.6%
Gemini 3 Flash (Preview)414137363638.4%
Ministral 3 8B514038322938.1%
Gemini 3 Flash (Preview, Reasoning)414039363438.0%
Z.AI GLM 4.7414038373438.0%
GPT-5 Mini444239362837.8%
Gemma 4 26B413939363137.1%
Gemma 4 31B414037343337.0%
Gemma 4 31B (Reasoning)403736333335.9%
Qwen 3.5 122B373736353435.9%
GPT-5 Nano363535343334.5%
Gemma 4 26B (Reasoning)343433333333.3%
Mistral Small 3.2 24B423929252532.0%
Inception Mercury322525252526.4%