Name drop frequency

Test: Bad Writing Habits

Avg. Score
66.6%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-4.1 Nano94.8%$0.000713.3s81%
2Gemini 3.1 Flash Lite (Preview)90.2%$0.00308.4s66%
3Gemma 3 4B88.9%$0.000220.0s65%
4Gemini 2.5 Pro92.7%$0.03636.2s69%
5Stealth: Healer Alpha86.1%$0.000023.7s60%
6Grok 4 Fast86.6%$0.001724.1s60%
7Z.AI GLM 4.689.7%$0.006551.5s59%
8DeepSeek V3.190.6%$0.00201.8m69%
9Gemini 2.5 Flash Lite (Reasoning)84.6%$0.002830.8s56%
10Stealth: Hunter Alpha85.1%$0.000055.0s55%
11GPT-5 Nano85.9%$0.00421.4m63%
12Grok 4.20 (Beta, Reasoning)86.1%$0.03934.0s59%
13Stealth: Aurora Alpha77.7%$0.00009.8s48%
14Inception Mercury 276.8%$0.00327.0s48%
15GPT-4o Mini (temp=1)80.4%$0.001234.8s48%
16Nemotron 3 Nano83.3%$0.00101.1m52%
17Gemini 2.5 Flash80.2%$0.005210.6s42%
18Gemini 2.5 Flash Lite78.5%$0.00099.5s41%
19Grok 4.20 (Beta)81.8%$0.01815.8s45%
20GPT-5 Mini82.2%$0.010057.4s47%
21Nemotron 3 Super82.5%$0.00001.4m50%
22DeepSeek V3.283.5%$0.00141.9m54%
23ByteDance Seed 2.0 Lite86.1%$0.0122.2m59%
24Gemini 2.5 Flash (Reasoning)79.5%$0.01121.5s41%
25Arcee AI: Trinity Mini76.0%$0.00039.2s36%
26Aion 2.081.1%$0.00641.3m45%
27ByteDance Seed 1.684.3%$0.0132.5m53%
28Gemma 3 27B76.7%$0.000652.6s35%
29Claude Sonnet 4.681.4%$0.03139.3s37%
30Z.AI GLM 5 Turbo75.6%$0.008133.2s32%
31Inception Mercury72.7%$0.01117.6s33%
32Gemini 3 Flash (Preview, Reasoning)70.9%$0.01230.1s37%
33GPT-4o, Aug. 6th (temp=1)72.7%$0.01824.4s36%
34GPT-4.1 Mini71.9%$0.002719.0s30%
35Claude 3.5 Haiku69.1%$0.003510.8s32%
36Gemini 3 Flash (Preview)68.4%$0.007819.6s36%
37Z.AI GLM 577.1%$0.00841.2m37%
38Grok 4.1 Fast71.8%$0.001837.8s32%
39MiniMax M2.575.3%$0.00341.3m35%
40GPT-4.171.7%$0.01844.7s36%
41MiniMax M2.772.5%$0.00401.1m33%
42Mistral Medium 3.167.9%$0.004836.5s31%
43Z.AI GLM 4.7 Flash71.5%$0.00171.2m33%
44Claude Sonnet 4.6 (Reasoning)81.8%$0.0601.2m42%
45Qwen 3 32B67.6%$0.001554.6s32%
46LFM2 24B66.3%$0.000228.4s26%
47o4 Mini High73.6%$0.02547.2s31%
48Mistral Small Creative62.1%$0.00079.1s26%
49Claude Haiku 4.567.0%$0.01121.6s27%
50Grok 480.3%$0.0481.7m43%
51Gemma 3 12B68.7%$0.000441.3s25%
52Qwen 3.5 Plus (2026-02-15)64.3%$0.006031.5s28%
53Ministral 3 14B59.1%$0.000711.7s27%
54Claude Sonnet 469.2%$0.03243.7s34%
55o4 Mini67.0%$0.01525.7s26%
56ByteDance Seed 2.0 Mini88.4%$0.00454.9m58%
57Mistral Large 361.2%$0.003330.3s28%
58Qwen3 235B A22B Instruct 250770.8%$0.001159.2s23%
59Ministral 3 8B59.7%$0.000819.6s26%
60Mistral Small 458.8%$0.001418.2s27%
61DeepSeek V3 (2024-12-26)64.7%$0.002154.6s26%
62Writer: Palmyra X564.7%$0.01122.0s21%
63Ministral 8B57.3%$0.000410.4s22%
64Z.AI GLM 4.769.1%$0.0101.4m29%
65GPT-4o Mini (temp=0)61.3%$0.001234.8s22%
66Llama 3.1 8B67.8%$0.00031.3m24%
67WizardLM 2 8x22b69.3%$0.00261.8m30%
68ByteDance Seed 1.6 Flash57.8%$0.001327.3s24%
69Claude Opus 4.6 (Reasoning)79.4%$0.0881.4m42%
70Rocinante 12B59.4%$0.001438.4s24%
71Cohere Command R+ (Aug. 2024)64.2%$0.02052.5s28%
72DeepSeek-V2 Chat61.4%$0.002153.3s25%
73Mistral NeMO54.7%$0.000510.1s21%
74Ministral 3B53.1%$0.00018.1s22%
75Claude Opus 4.677.6%$0.0781.2m37%
76GPT-5.4 Mini (Reasoning)61.2%$0.02228.1s24%
77Mistral Large 257.5%$0.01329.4s25%
78Mistral Small 4 (Reasoning)56.5%$0.002230.2s22%
79Mistral Large57.1%$0.01430.9s25%
80GPT-5.4 Mini54.6%$0.01516.8s25%
81Hermes 3 70B59.7%$0.00101.2m25%
82DeepSeek V3 (2025-03-24)55.6%$0.001439.4s22%
83Llama 3.1 Nemotron 70B53.2%$0.003831.7s24%
84GPT-580.8%$0.0652.8m45%
85Claude Sonnet 4.562.8%$0.03538.1s23%
86GPT-4o, Aug. 6th (temp=0)55.7%$0.02322.7s22%
87Arcee AI: Trinity Large (Preview)55.2%$0.000043.6s19%
88Claude Opus 4.567.7%$0.07053.4s32%
89Ministral 3 3B47.8%$0.000511.1s19%
90Qwen 3.5 Flash55.3%$0.002547.5s19%
91MoonshotAI: Kimi K2.573.1%$0.0193.2m36%
92GPT-5.4 Mini (Reasoning, Low)52.2%$0.01516.8s18%
93Llama 3.1 70B47.7%$0.001529.4s19%
94GPT-4o, May 13th (temp=1)54.0%$0.03314.4s18%
95GPT-5.463.0%$0.0491.4m28%
96GPT-5.171.7%$0.0541.8m23%
97Gemini 3 Pro (Preview)61.3%$0.05554.4s22%
98Hermes 3 405B50.4%$0.003253.2s15%
99Qwen 2.5 72B40.8%$0.001036.7s21%
100Claude 3.5 Sonnet55.0%$0.04835.5s21%
101GPT-5.4 (Reasoning, Low)61.4%$0.0551.4m22%
102Z.AI GLM 4.542.0%$0.005142.1s14%
103GPT-5.4 Nano (Reasoning)39.1%$0.006124.5s12%
104GPT-5.4 Nano (Reasoning, Low)36.5%$0.005520.6s13%
105Qwen 3.5 35B49.7%$0.0181.0m11%
106Qwen 3.5 122B51.5%$0.0251.1m12%
107Claude 3 Haiku30.6%$0.002514.9s16%
108GPT-5.4 Nano36.9%$0.005726.3s12%
109Qwen 3.5 397B A17B57.9%$0.0143.0m22%
110GPT-4o, May 13th (temp=0)42.3%$0.03514.1s9%
111Qwen 3.5 9B42.1%$0.00111.4m7%
112Gemini 3.1 Pro (Preview)64.3%$0.1071.8m25%
113GPT-5.4 (Reasoning)64.8%$0.0892.6m26%
114Claude 3.7 Sonnet37.5%$0.04246.7s14%
115Qwen 3.5 27B44.2%$0.0201.6m8%
116GPT-5.228.0%$0.0561.5m6%
117Mistral Small 3.2 24B52.8%$0.00695.7m15%
118Claude Opus 456.3%$0.2091.4m22%
66.60%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4.1 Nano100100100989498.3%
Inception Mercury100100100918895.9%
Claude Sonnet 4.6 (Reasoning)100100100958395.6%
Gemma 3 4B100100100928395.2%
ByteDance Seed 2.0 Mini1001001001006793.3%
ByteDance Seed 2.0 Lite10010094838191.7%
Z.AI GLM 4.710010097836989.8%
Grok 4 Fast999189868389.7%
Gemini 2.5 Flash (Reasoning)1009088887087.1%
Claude Sonnet 4.61008683838387.1%
Gemini 3.1 Pro (Preview)10010083836786.7%
Llama 3.1 8B10010083806685.8%
DeepSeek V3.21009483836785.5%
GPT-4.1 Mini10010080786785.0%
Grok 4929185836883.8%
GPT-510010083835083.3%
GPT-5 Nano10010083676783.3%
Stealth: Aurora Alpha1009783676782.7%
Inception Mercury 2898887826782.6%
Gemma 3 12B1009898675082.5%
Stealth: Hunter Alpha918785826782.4%
Nemotron 3 Super958877766981.0%
Claude 3.5 Haiku1009489644979.1%
GPT-5.11008383785078.9%
Gemini 2.5 Pro10010091673378.3%
Gemini 3 Pro (Preview)10010083634377.8%
ByteDance Seed 1.6838282717077.6%
Grok 4.20 (Beta)968979675076.1%
Gemini 3.1 Flash Lite (Preview)10010068615176.1%
DeepSeek V3.11008478675075.8%
Ministral 3 3B938779695075.7%
Z.AI GLM 4.6969483673374.8%
Gemini 2.5 Flash Lite (Reasoning)1008583505073.6%
GPT-5 Mini1009867505073.0%
Nemotron 3 Nano998382673372.8%
GPT-5.4 Mini837867676572.0%
Arcee AI: Trinity Large (Preview)10010067563671.7%
Grok 4.1 Fast827567676771.6%
Claude Opus 4.6878276625071.4%
Z.AI GLM 510010062613371.2%
Stealth: Healer Alpha837168676671.0%
Gemini 3 Flash (Preview)838371675070.8%
Gemini 3 Flash (Preview, Reasoning)877968675070.2%
Claude Opus 4.6 (Reasoning)1008367673370.0%
Ministral 3B949367484669.5%
Gemini 2.5 Flash Lite848267645069.4%
GPT-5.4 (Reasoning)838366645069.3%
GPT-4o Mini (temp=1)848367625069.0%
GPT-5.4846767675868.4%
Gemini 2.5 Flash838176505068.2%
Gemma 3 27B1007967613368.0%
Claude Haiku 4.5826765645967.2%
Llama 3.1 Nemotron 70B1006767613966.7%
Z.AI GLM 5 Turbo916767585066.4%
Grok 4.20 (Beta, Reasoning)797265625065.7%
MiniMax M2.7936760585065.6%
Hermes 3 70B887850504963.1%
Ministral 3 8B906865503962.5%
Z.AI GLM 4.7 Flash948358413361.9%
Aion 2.0918367333361.5%
Mistral Large837550504761.0%
Mistral Small 3.2 24B89816967061.0%
Ministral 8B1006756483360.8%
Qwen 3.5 Plus (2026-02-15)100786659060.6%
MiniMax M2.5837750484259.9%
LFM2 24B767062494159.9%
GPT-5.4 Mini (Reasoning)796464504259.7%
ByteDance Seed 1.6 Flash676160565559.7%
Qwen3 235B A22B Instruct 2507837667501658.5%
MoonshotAI: Kimi K2.5726864543358.2%
o4 Mini636057555257.5%
WizardLM 2 8x22b100675850055.0%
GPT-5.4 Mini (Reasoning, Low)897560331754.9%
Claude Sonnet 4675856504254.5%
Mistral Large 2676056553454.5%
Hermes 3 405B93906819054.1%
Llama 3.1 70B82716250053.2%
Mistral Large 3675550494553.2%
Arcee AI: Trinity Mini726763372753.1%
DeepSeek-V2 Chat685351503952.3%
Mistral Medium 3.1666160422751.3%
Mistral NeMO82675944050.3%
Claude Opus 4645450503350.2%
Rocinante 12B94675033048.8%
Cohere Command R+ (Aug. 2024)100504846048.8%
Claude Sonnet 4.5706844441447.9%
GPT-4o, Aug. 6th (temp=1)675949451847.4%
Mistral Small Creative676638333147.0%
Qwen 2.5 72B635550431946.0%
DeepSeek V3 (2025-03-24)635642382945.6%
Writer: Palmyra X567595350045.6%
Mistral Small 481635033045.4%
Ministral 3 14B676449271744.9%
GPT-4.1545050353344.4%
o4 Mini High835148211744.1%
Qwen 3.5 35B67555044043.0%
GPT-5.4 (Reasoning, Low)67645033042.8%
Claude Opus 4.5675044331742.4%
Qwen 3 32B605050331441.3%
Qwen 3.5 Flash67604231040.0%
DeepSeek V3 (2024-12-26)655442221639.6%
Mistral Small 4 (Reasoning)62504533038.0%
GPT-4o Mini (temp=0)67563333037.9%
Claude 3.5 Sonnet454140321935.4%
GPT-5.4 Nano (Reasoning, Low)474233331734.4%
Qwen 3.5 397B A17B60474517033.7%
GPT-4o, May 13th (temp=0)724224171333.5%
GPT-5.28350330033.3%
GPT-4o, May 13th (temp=1)60503517032.3%
GPT-4o, Aug. 6th (temp=0)59353433032.2%
GPT-5.4 Nano (Reasoning)46433333031.1%
GPT-5.4 Nano423329221728.6%
Z.AI GLM 4.550332928028.1%
Qwen 3.5 27B973300026.1%
Claude 3 Haiku4133290020.6%
Claude 3.7 Sonnet3320170014.0%
Qwen 3.5 122B2800005.6%
Qwen 3.5 9B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Grok 41001001001009999.9%
Aion 2.01001001001009999.9%
GPT-4o, Aug. 6th (temp=1)1001001001009999.8%
LFM2 24B1001001001009999.8%
Nemotron 3 Super1001001001009699.2%
Z.AI GLM 5100100100979698.7%
Nemotron 3 Nano1001001001009298.4%
Inception Mercury 210010099989197.6%
Grok 4 Fast100100100968997.0%
GPT-5.4 (Reasoning)1001001001008396.7%
MiniMax M2.71001001001008396.7%
Stealth: Hunter Alpha1001001001008396.7%
Writer: Palmyra X51001001001008396.7%
Gemma 3 4B1001001001008396.7%
DeepSeek V3.21001001001008396.7%
Gemini 3 Pro (Preview)1001001001008396.7%
Grok 4.20 (Beta, Reasoning)1001001001008196.3%
Grok 4.1 Fast100100100978496.1%
GPT-5.4 Mini (Reasoning)10010098978395.8%
GPT-4.1 Mini100100100928395.0%
WizardLM 2 8x22b100100100908394.7%
Qwen 3.5 122B100100100868393.8%
GPT-5 Mini1001001001006793.3%
o4 Mini High100100100838393.3%
Claude Opus 4.5100100100838393.3%
Z.AI GLM 4.7 Flash100100100838393.3%
Gemini 2.5 Flash1001001001006793.3%
Qwen3 235B A22B Instruct 25071001001001006793.3%
Claude 3.5 Haiku100100100937293.1%
Grok 4.20 (Beta)100100100836790.0%
Mistral Large 310010094847089.7%
Mistral Small Creative10010090837188.8%
GPT-5.4 (Reasoning, Low)100100100915088.1%
Qwen 3.5 397B A17B1009689866988.0%
Gemini 3 Flash (Preview)1009083837886.9%
MiniMax M2.51001001001003386.7%
GPT-5.4 Nano1008383838386.7%
Qwen 3.5 35B10010083836786.7%
DeepSeek V3 (2024-12-26)1009794746786.4%
Ministral 3 3B1009793855485.7%
Claude Haiku 4.510010098953385.3%
GPT-5.4 Nano (Reasoning)10010083766785.2%
Claude Sonnet 4.510010083756785.1%
Qwen 3.5 Plus (2026-02-15)979583836785.0%
GPT-5.4 Mini (Reasoning, Low)10010099923384.8%
Inception Mercury10010080746984.6%
GPT-4.110010089835084.5%
DeepSeek V3 (2025-03-24)10010098675684.2%
Stealth: Aurora Alpha1009983835484.0%
Mistral Large10010094765083.9%
Gemini 3 Flash (Preview, Reasoning)1008683816783.3%
Hermes 3 70B100100100872883.1%
Arcee AI: Trinity Large (Preview)100100100634882.3%
o4 Mini10010071706180.5%
Claude Sonnet 41009893624980.2%
MoonshotAI: Kimi K2.51009383794680.2%
GPT-4o Mini (temp=0)10010083675080.0%
Mistral Medium 3.1968379746779.8%
DeepSeek-V2 Chat1008374706778.9%
Mistral Large 2959191675078.7%
Ministral 3 14B999778635678.7%
Ministral 8B1009292911778.4%
GPT-5.4 Nano (Reasoning, Low)908380676777.3%
GPT-5.21008683833377.2%
Arcee AI: Trinity Mini10010085831777.0%
Llama 3.1 8B1008383605876.9%
Mistral Small 410010083484875.8%
Qwen 3.5 Flash1008383674475.6%
Ministral 3 8B828075706774.7%
GPT-5.4 Mini999184673374.7%
ByteDance Seed 1.6 Flash838376676274.3%
Mistral Small 4 (Reasoning)1008383505073.2%
Mistral NeMO1009172673372.7%
Ministral 3B1009875731772.5%
Cohere Command R+ (Aug. 2024)1008371545071.8%
Mistral Small 3.2 24B1009583423771.3%
GPT-4o, Aug. 6th (temp=0)1008483561768.1%
GPT-5.4838367505066.7%
Z.AI GLM 4.5827871673366.1%
GPT-4o, May 13th (temp=1)959470481764.6%
Claude Opus 4766767635064.6%
Claude 3 Haiku726867645164.3%
Llama 3.1 70B94888158064.1%
Qwen 2.5 72B757167565063.7%
Claude 3.5 Sonnet1007463503163.6%
Llama 3.1 Nemotron 70B686867594661.6%
Claude 3.7 Sonnet838181311758.8%
Qwen 3 32B95836733055.6%
GPT-4o, May 13th (temp=0)1009036331755.2%
Qwen 3.5 27B676763502854.8%
Hermes 3 405B806748392251.3%
Qwen 3.5 9B100625833050.7%
Rocinante 12B8141170027.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemma 3 12B100100100100100100.0%
GPT-5 Nano1001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
Arcee AI: Trinity Mini100100100938295.1%
DeepSeek V3.11001001001006793.3%
GPT-4o Mini (temp=0)100100100838393.3%
Gemma 3 4B100100100838393.3%
Stealth: Healer Alpha100100100976792.8%
Llama 3.1 8B100100100836790.0%
Gemini 3.1 Flash Lite (Preview)10010083837287.7%
Gemini 2.5 Pro1001001001003286.5%
Mistral Large 210010096835085.9%
GPT-5 Mini100100100675083.3%
Claude Opus 41008383757182.5%
Gemma 3 27B10010083676282.4%
Mistral Large10010078705881.2%
Z.AI GLM 4.7100100100673079.3%
Grok 4.20 (Beta)1009983803379.3%
Aion 2.010010010093078.6%
DeepSeek V3 (2024-12-26)1009689644278.3%
GPT-4.1 Mini1009469665977.6%
GPT-4o Mini (temp=1)10010080673376.0%
Z.AI GLM 510010010080076.0%
Gemini 2.5 Flash1001009483075.4%
Stealth: Hunter Alpha10010083671773.3%
Nemotron 3 Super1008277672870.7%
Claude 3.5 Sonnet10010096411770.6%
Hermes 3 70B1008380483970.2%
LFM2 24B100998367069.6%
Gemini 2.5 Flash Lite10010083461769.1%
Qwen 3 32B838367575068.1%
Stealth: Aurora Alpha916767615067.1%
Mistral Medium 3.1908267633367.0%
Gemini 2.5 Flash Lite (Reasoning)100986767066.3%
Gemini 3 Flash (Preview, Reasoning)838367613365.5%
Grok 4100787571064.9%
Rocinante 12B827967504163.9%
Z.AI GLM 4.61001009317061.9%
Claude Opus 4.51008376351461.7%
GPT-4o, Aug. 6th (temp=1)98917541061.1%
Grok 4 Fast838373461660.2%
ByteDance Seed 2.0 Mini10010033333360.0%
ByteDance Seed 2.0 Lite836750505060.0%
Nemotron 3 Nano836767503360.0%
ByteDance Seed 1.61008365331759.7%
Inception Mercury 2837254503158.2%
GPT-583836750056.7%
DeepSeek V3.210010050171756.7%
MiniMax M2.51008367171656.5%
Grok 4.20 (Beta, Reasoning)95807820856.1%
Llama 3.1 Nemotron 70B926758332955.6%
Inception Mercury836764451855.6%
DeepSeek-V2 Chat100635757055.4%
Z.AI GLM 4.7 Flash836760441854.5%
Claude Opus 4.6 (Reasoning)9983780052.1%
Ministral 3 8B836650332651.7%
Gemini 3 Flash (Preview)675050503350.0%
Cohere Command R+ (Aug. 2024)917735331249.4%
Mistral Small 4 (Reasoning)83675042048.5%
Claude Sonnet 4100100330046.7%
Arcee AI: Trinity Large (Preview)10083500046.7%
GPT-4o, May 13th (temp=1)67676732046.3%
MiniMax M2.794833617046.0%
Grok 4.1 Fast9874500044.3%
Hermes 3 405B100553332044.0%
Gemini 3 Pro (Preview)83673333043.3%
GPT-5.48383500043.3%
MoonshotAI: Kimi K2.5676333331141.5%
Mistral Small 410050500040.0%
Ministral 3 14B7269500038.3%
GPT-4.11008300036.7%
Qwen 3.5 Plus (2026-02-15)67503333036.7%
WizardLM 2 8x22b1008300036.7%
GPT-5.4 Mini67505017036.7%
Qwen3 235B A22B Instruct 25071008100036.1%
Gemini 3.1 Pro (Preview)675025171735.0%
Qwen 3.5 9B8667170033.8%
Z.AI GLM 5 Turbo838300033.3%
GPT-5.11006700033.3%
o4 Mini1006500033.0%
Mistral Large 367671714032.8%
GPT-5.4 Nano (Reasoning, Low)50453333032.3%
o4 Mini High1005800031.5%
Writer: Palmyra X59048170030.9%
DeepSeek V3 (2025-03-24)5750433030.5%
Qwen 3.5 Flash505017171730.0%
Claude 3 Haiku45393328029.0%
Claude Sonnet 4.6 (Reasoning)1004400028.8%
Claude 3.7 Sonnet5450330027.4%
Ministral 8B9817170026.3%
Qwen 3.5 122B8031170025.5%
Ministral 3B676000025.2%
Mistral NeMO35333317023.6%
Mistral Small 3.2 24B5733226023.5%
Claude Haiku 4.51001700023.3%
Claude Opus 4.66733110022.3%
GPT-5.4 (Reasoning, Low)6717170020.0%
Llama 3.1 70B33252417019.9%
Ministral 3 3B494500018.7%
Mistral Small Creative5017177018.0%
ByteDance Seed 1.6 Flash83600017.9%
Gemini 2.5 Flash (Reasoning)33171717016.7%
Claude 3.5 Haiku602200016.4%
Qwen 3.5 397B A17B3331170016.3%
Claude Sonnet 4.5671300015.9%
GPT-5.4 Mini (Reasoning)452600014.2%
Claude Sonnet 4.6331700010.0%
Qwen 3.5 27B331700010.0%
GPT-4o, Aug. 6th (temp=0)331700010.0%
GPT-5.4 Nano (Reasoning)331700010.0%
Qwen 2.5 72B331700010.0%
GPT-5.4 Nano331700010.0%
GPT-4o, May 13th (temp=0)4100008.3%
Z.AI GLM 4.52300004.5%
GPT-5.4 Mini (Reasoning, Low)1000002.1%
GPT-5.4 (Reasoning)000000.0%
GPT-5.2000000.0%
Qwen 3.5 35B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100868093.1%
ByteDance Seed 2.0 Mini10010090838391.3%
Z.AI GLM 4.610010093777789.3%
Stealth: Hunter Alpha10010092836588.1%
Gemma 3 27B10010091915988.1%
Gemini 2.5 Flash Lite (Reasoning)10010090836788.0%
Claude Opus 4.610010093836488.0%
GPT-4.1 Nano1009285838087.8%
ByteDance Seed 2.0 Lite1009184838087.7%
Gemma 3 4B100100100835086.7%
Gemini 3.1 Flash Lite (Preview)1008483836883.8%
Gemini 2.5 Pro868583837983.6%
Arcee AI: Trinity Large (Preview)10010083676582.9%
Stealth: Healer Alpha1008583676780.3%
GPT-5 Nano10010083675080.0%
Mistral Large 3908379676776.9%
ByteDance Seed 1.6988367676375.7%
Claude Opus 4.6 (Reasoning)1008377675075.5%
DeepSeek V3.1918683554872.6%
GPT-5968383673372.6%
Qwen3 235B A22B Instruct 250797938782071.8%
GPT-4o Mini (temp=1)897673635070.4%
GPT-5.4 (Reasoning, Low)838367675070.0%
Gemini 3 Flash (Preview, Reasoning)957467655070.0%
Gemini 2.5 Flash Lite837979733469.7%
GPT-5.1838382505069.7%
Mistral Small 4838367565067.9%
GPT-4.1857267665067.8%
Z.AI GLM 5976762615067.2%
Mistral Small Creative838379503766.7%
Claude Sonnet 4.6 (Reasoning)1008367503366.6%
Grok 4.20 (Beta, Reasoning)988878353366.5%
DeepSeek V3.2837467673364.9%
Aion 2.0837767623364.5%
Gemini 2.5 Flash (Reasoning)838067503362.5%
Mistral Large897950504462.4%
Grok 4 Fast838374333361.5%
MiniMax M2.5836967503560.9%
Hermes 3 70B100676666059.7%
MiniMax M2.7908367331758.0%
Claude Sonnet 4.5836952443857.2%
GPT-5.4838350333356.7%
GPT-5.4 Mini (Reasoning)777667501356.5%
Z.AI GLM 5 Turbo84815952055.3%
Qwen 3 32B716657552454.7%
Gemini 3 Flash (Preview)635850505054.2%
Qwen 3.5 397B A17B786160451752.2%
Z.AI GLM 4.7786650333352.2%
Stealth: Aurora Alpha896746431752.1%
Mistral Medium 3.167676359051.1%
Mistral Small 4 (Reasoning)87675349051.0%
Inception Mercury 2765452393150.6%
MoonshotAI: Kimi K2.5866456331350.4%
GPT-5.4 (Reasoning)836550331749.6%
Qwen 3.5 Plus (2026-02-15)745150501748.5%
WizardLM 2 8x22b685250363348.0%
Arcee AI: Trinity Mini76675045047.5%
Gemini 3.1 Pro (Preview)835050332047.4%
Gemma 3 12B74675037947.2%
Nemotron 3 Super625951461747.0%
Nemotron 3 Nano80535050146.8%
Claude Haiku 4.5676750311846.5%
Claude Opus 4.5675645441946.3%
Writer: Palmyra X583706314045.9%
ByteDance Seed 1.6 Flash555252501244.2%
Ministral 3 14B715042341943.1%
DeepSeek V3 (2025-03-24)524946382943.1%
GPT-4o, Aug. 6th (temp=1)615235332641.3%
Gemini 2.5 Flash70584629040.5%
Rocinante 12B65544833040.2%
Gemini 3 Pro (Preview)484646441740.2%
Mistral Large 277504033040.0%
Inception Mercury10068310039.6%
Ministral 8B71544326038.7%
Claude 3.5 Haiku80553816037.9%
Z.AI GLM 4.7 Flash59505017936.9%
GPT-5 Mini675033171736.7%
o4 Mini High504734331836.4%
Claude Sonnet 496392010433.9%
Cohere Command R+ (Aug. 2024)77393318033.5%
Ministral 3 8B57563717033.3%
Grok 4.1 Fast723326171732.8%
Claude Opus 4573330261732.8%
Qwen 3.5 27B8350170030.0%
Hermes 3 405B1004260029.7%
Grok 452413812529.5%
Qwen 3.5 9B1003000026.0%
LFM2 24B52332416024.9%
Mistral NeMO49411717024.7%
GPT-5.4 Mini6733170023.3%
Llama 3.1 8B675000023.3%
Qwen 2.5 72B46381616023.0%
Ministral 3 3B35332321022.3%
GPT-4o, May 13th (temp=1)5033220021.1%
GPT-5.4 Nano (Reasoning)33331716019.9%
Ministral 3B671286218.9%
Claude 3.7 Sonnet761700018.6%
GPT-5.4 Nano503000016.0%
Grok 4.20 (Beta)671200015.8%
Claude 3.5 Sonnet393810015.7%
DeepSeek V3 (2024-12-26)56000011.2%
DeepSeek-V2 Chat401400010.6%
GPT-5.4 Mini (Reasoning, Low)331700010.0%
GPT-4.1 Mini32170009.7%
o4 Mini171713009.3%
Qwen 3.5 Flash27170008.8%
GPT-4o Mini (temp=0)4300008.7%
GPT-4o, Aug. 6th (temp=0)27160008.6%
Qwen 3.5 35B3700007.5%
Llama 3.1 70B3430007.5%
GPT-5.217170006.7%
GPT-5.4 Nano (Reasoning, Low)17170006.7%
Qwen 3.5 122B3200006.4%
Z.AI GLM 4.51700003.3%
Mistral Small 3.2 24B1700003.3%
Llama 3.1 Nemotron 70B940002.5%
GPT-4o, May 13th (temp=0)000000.0%
Claude 3 Haiku000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Aion 2.0100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Grok 4.20 (Beta)1001001001009699.2%
Grok 4 Fast1001001001009599.0%
Gemini 3 Flash (Preview)1001001001009498.7%
Gemini 3.1 Pro (Preview)1001001001009198.3%
GPT-4o Mini (temp=1)1001001001008496.8%
Claude Opus 4.6 (Reasoning)1001001001008396.7%
Claude Sonnet 4.6 (Reasoning)1001001001008396.7%
Z.AI GLM 51001001001008396.7%
o4 Mini High1001001001008396.7%
ByteDance Seed 2.0 Mini1001001001008396.7%
Gemma 3 4B1001001001008396.7%
MiniMax M2.7100100100988496.4%
GPT-4o Mini (temp=0)1001001001008096.1%
MoonshotAI: Kimi K2.51001001001007895.6%
Nemotron 3 Super100100100938395.2%
Grok 4.20 (Beta, Reasoning)10010097958395.0%
Claude Opus 4.510010096948394.6%
GPT-5 Mini100100100838393.3%
GPT-51001001001006793.3%
Claude Sonnet 4.6100100100838393.3%
MiniMax M2.51001001001006793.3%
Hermes 3 405B1001001001006693.2%
GPT-4.110010097907993.1%
Z.AI GLM 5 Turbo10010094838392.1%
Mistral Medium 3.110010099837691.8%
Z.AI GLM 4.710010094906790.2%
Qwen3 235B A22B Instruct 250710010097916289.9%
Gemini 3 Pro (Preview)10010088837889.8%
DeepSeek V3 (2024-12-26)100100100786889.1%
LFM2 24B1009493807488.3%
Z.AI GLM 4.7 Flash100100100915088.3%
GPT-5.4 (Reasoning)1009583837988.1%
Arcee AI: Trinity Mini979692896788.0%
Claude Opus 4.610010083836786.7%
Gemini 3 Flash (Preview, Reasoning)100100100676786.7%
Grok 41001001001003386.7%
GPT-5.110010083836786.7%
Z.AI GLM 4.610010083836786.7%
Claude 3.5 Haiku10010092736886.5%
Gemini 2.5 Flash (Reasoning)100100100834886.3%
ByteDance Seed 1.61009583816985.6%
Writer: Palmyra X510010092924385.4%
Claude Sonnet 410010096675984.4%
Stealth: Hunter Alpha100100100675083.3%
Gemini 2.5 Flash Lite100100100833383.3%
GPT-4o, Aug. 6th (temp=0)1009581815582.3%
Qwen 3 32B10010075696782.2%
GPT-5 Nano10010093675082.0%
Mistral Large 210010083754580.6%
Mistral Small 410010070676680.4%
ByteDance Seed 1.6 Flash1008383676780.0%
Claude Haiku 4.510010097505079.5%
GPT-5.4 (Reasoning, Low)918380746779.1%
Ministral 3B10010083644277.8%
GPT-5.41008883675077.6%
o4 Mini1009887613877.0%
GPT-4o, Aug. 6th (temp=1)958383674274.1%
GPT-4.1 Mini838181793371.7%
WizardLM 2 8x22b1008383671770.0%
Claude 3.5 Sonnet1009955543468.2%
Llama 3.1 8B1001008357068.1%
GPT-4o, May 13th (temp=1)1009283381164.9%
GPT-5.4 Mini (Reasoning, Low)837367504763.9%
Qwen 3.5 Flash1008367333363.3%
Nemotron 3 Nano1001006750063.3%
Inception Mercury 2776964575063.3%
GPT-5.4 Mini908360503363.3%
Rocinante 12B10010010014062.9%
Cohere Command R+ (Aug. 2024)1001008033062.7%
GPT-5.4 Mini (Reasoning)956656504361.9%
Qwen 3.5 397B A17B1007567501761.6%
Stealth: Aurora Alpha676763605061.3%
Claude 3.7 Sonnet837567651560.9%
DeepSeek-V2 Chat1001001000060.0%
Llama 3.1 Nemotron 70B1006763333359.5%
Grok 4.1 Fast938350501758.6%
Claude Sonnet 4.593876744058.4%
Qwen 3.5 Plus (2026-02-15)91835049054.7%
Ministral 3 14B958955171754.4%
Ministral 3 3B1001003533053.7%
Claude Opus 4100100670053.3%
Mistral Small 4 (Reasoning)1006750331753.3%
Hermes 3 70B81706747052.9%
Mistral Large100833532049.9%
Ministral 3 8B8884770049.8%
Inception Mercury100676713049.2%
Mistral Small 3.2 24B83673330042.6%
Mistral NeMO905033171741.3%
Llama 3.1 70B504945421941.2%
Mistral Large 3100563312040.4%
DeepSeek V3 (2025-03-24)63484641039.7%
Claude 3 Haiku75593323038.0%
Qwen 3.5 27B83503316036.4%
Qwen 2.5 72B554933331136.3%
Qwen 3.5 122B9950310036.0%
Ministral 8B996700033.2%
GPT-5.4 Nano (Reasoning, Low)67463313031.7%
GPT-4o, May 13th (temp=0)663327171130.8%
Arcee AI: Trinity Large (Preview)9040170029.3%
Mistral Small Creative6462170028.6%
Qwen 3.5 35B6750170026.7%
Z.AI GLM 4.56738170024.2%
GPT-5.4 Nano50331717023.3%
GPT-5.4 Nano (Reasoning)3733170017.4%
Qwen 3.5 9B33140009.5%
GPT-5.2171711008.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash1001001001009298.4%
Grok 4 Fast100100100979498.4%
GPT-4o Mini (temp=1)100100100999097.9%
Z.AI GLM 4.61001001001008396.7%
Gemma 3 27B1001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
Gemini 2.5 Flash (Reasoning)100100100938395.3%
Grok 4.1 Fast10010095938394.3%
GPT-5100100100838393.3%
Claude Sonnet 4.6 (Reasoning)10010097838192.3%
ByteDance Seed 2.0 Lite10010094838392.0%
Aion 2.01009993838391.9%
GPT-5 Mini100100100836790.0%
ByteDance Seed 2.0 Mini100100100836790.0%
GPT-5 Nano10010083838390.0%
Gemini 2.5 Flash Lite100100100836790.0%
Writer: Palmyra X5100100100836790.0%
Stealth: Healer Alpha100100100826789.8%
Claude Opus 4.610010096836789.3%
MiniMax M2.51009998836288.6%
Claude Opus 4.6 (Reasoning)10010092836788.3%
o4 Mini1008886838187.6%
Z.AI GLM 5 Turbo1001001001003386.7%
Stealth: Hunter Alpha10010099835086.5%
Mistral Medium 3.1100100100805086.0%
Inception Mercury989488816585.2%
Claude Opus 4.5938683837684.5%
Z.AI GLM 4.7 Flash10010083766384.4%
Grok 4.20 (Beta, Reasoning)878383828183.4%
MoonshotAI: Kimi K2.510010083795082.6%
Gemma 3 4B1009683676782.5%
Claude Haiku 4.51009289676181.7%
Rocinante 12B10010098901780.9%
o4 Mini High10010080755080.8%
Nemotron 3 Super1008180766780.8%
DeepSeek V3.2100100100831780.0%
Qwen3 235B A22B Instruct 25071008383676780.0%
Mistral NeMO100100863379.9%
Grok 41009488833379.6%
Grok 4.20 (Beta)1009978675078.8%
Z.AI GLM 510010091831778.1%
Hermes 3 70B10010010086377.9%
ByteDance Seed 1.6838378766777.4%
LFM2 24B1007372716776.5%
Gemini 3 Flash (Preview, Reasoning)838381676776.2%
ByteDance Seed 1.6 Flash928372676776.1%
Mistral Small Creative10010079505075.9%
GPT-4.1 Mini888381675675.1%
GPT-4o, Aug. 6th (temp=1)1008078675075.0%
Gemini 3 Pro (Preview)10010067565074.6%
Claude Sonnet 4.51008380565073.9%
GPT-4o, May 13th (temp=1)977977674973.8%
Mistral Small 4 (Reasoning)100998383073.1%
Nemotron 3 Nano928574674772.9%
Mistral Small 41008375673371.8%
DeepSeek-V2 Chat888380555271.7%
GPT-4o Mini (temp=0)1008858555571.5%
Claude Sonnet 4958883592570.1%
Gemma 3 12B1006767675070.0%
Claude 3.5 Haiku1006663635870.0%
WizardLM 2 8x22b10010067601768.7%
Stealth: Aurora Alpha1007261605068.6%
Mistral Large 3988676671768.6%
DeepSeek V3 (2024-12-26)1007368653367.9%
MiniMax M2.71008267503166.0%
Ministral 3 14B856767585366.0%
GPT-5.4907567633365.6%
Cohere Command R+ (Aug. 2024)1007767503365.5%
GPT-5.4 Nano836766605065.1%
Mistral Large 2676767656065.0%
GPT-5.4 (Reasoning, Low)816767595064.8%
Mistral Small 3.2 24B10010067391764.5%
Arcee AI: Trinity Large (Preview)10010052501763.7%
GPT-5.4 Mini (Reasoning)826362605063.4%
Inception Mercury 2787067544763.1%
GPT-5.4 (Reasoning)796767505062.5%
Mistral Large837750505062.0%
Gemini 3 Flash (Preview)736767505061.2%
Z.AI GLM 4.5676661565060.1%
Z.AI GLM 4.7100838333060.0%
Gemini 3.1 Pro (Preview)1005050505060.0%
Ministral 8B676665504458.5%
GPT-5.4 Mini (Reasoning, Low)756767671758.4%
Qwen 3 32B837467501758.2%
Arcee AI: Trinity Mini1007250472057.9%
GPT-5.4 Nano (Reasoning, Low)835650505057.9%
Qwen 3.5 Plus (2026-02-15)766767501755.3%
Ministral 3 3B817154383154.9%
DeepSeek V3 (2025-03-24)96666039753.5%
Ministral 3B767354342151.7%
GPT-4.1715750501749.0%
Qwen 3.5 397B A17B946733301748.1%
GPT-4o, Aug. 6th (temp=0)83655633047.4%
Llama 3.1 8B100675015046.2%
Qwen 2.5 72B67625233042.8%
Qwen 3.5 122B585250331741.9%
Ministral 3 8B67673733040.7%
GPT-5.4 Nano (Reasoning)675533271739.8%
Claude 3.5 Sonnet74633124038.4%
GPT-5.4 Mini61503333035.5%
GPT-4o, May 13th (temp=0)93461917035.0%
Hermes 3 405B563533311834.8%
Llama 3.1 Nemotron 70B49473931033.3%
Claude Opus 467331717026.8%
Qwen 3.5 Flash65331717026.3%
Claude 3.7 Sonnet48311717022.4%
Qwen 3.5 9B782800021.2%
Llama 3.1 70B474071019.0%
Qwen 3.5 27B3333170016.7%
Qwen 3.5 35B4125120015.6%
Claude 3 Haiku2322174013.2%
GPT-5.23300006.7%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini1001001001009398.6%
Stealth: Aurora Alpha100100100959497.9%
Z.AI GLM 4.61001001001008396.7%
Nemotron 3 Super1001001001008396.7%
Gemma 3 4B1001001001008396.7%
Inception Mercury 210010099928394.8%
ByteDance Seed 2.0 Lite10010095958394.8%
Gemini 2.5 Flash (Reasoning)10010093919094.7%
GPT-4.1 Nano100100100888394.0%
Nemotron 3 Nano10010094838392.1%
Arcee AI: Trinity Mini10010091837790.2%
Grok 4.20 (Beta, Reasoning)100100100836790.0%
GPT-4.1 Mini1008989878189.3%
Gemini 3.1 Flash Lite (Preview)1009994925087.2%
ByteDance Seed 1.6999189807787.1%
Inception Mercury10010085836787.0%
LFM2 24B969190847586.9%
Llama 3.1 Nemotron 70B10010091756486.0%
WizardLM 2 8x22b959281807684.8%
Gemini 2.5 Flash Lite (Reasoning)1008383817083.5%
GPT-51008383836783.3%
Stealth: Hunter Alpha10010098833383.0%
ByteDance Seed 2.0 Mini10010098675082.9%
Grok 4.20 (Beta)968583836782.8%
Llama 3.1 8B100100100644982.7%
Grok 4918983836782.6%
Llama 3.1 70B1009380765981.7%
Grok 4.1 Fast948783736981.3%
GPT-4o Mini (temp=1)1008376766780.3%
Claude Sonnet 4.6 (Reasoning)10010067676780.0%
GPT-5.110010083675080.0%
Claude Sonnet 41008381785579.7%
Gemini 2.5 Flash1008383676579.6%
Gemini 2.5 Pro1009793743379.4%
DeepSeek V3.1968383676779.2%
Ministral 8B868583696778.0%
Grok 4 Fast848376767177.9%
Claude Opus 4.6 (Reasoning)958376676777.6%
o4 Mini High878383756077.6%
GPT-4o, Aug. 6th (temp=0)878681676677.4%
GPT-5.4 Mini (Reasoning)838383676776.7%
Qwen 3.5 Flash838376716876.4%
Qwen 2.5 72B917971676774.9%
DeepSeek V3.21009183831774.8%
Claude 3.5 Haiku968574625774.6%
Z.AI GLM 5 Turbo838383675574.3%
GPT-4.1918373725073.8%
Claude Opus 4.61008967625073.5%
Qwen 3.5 35B807977676373.1%
Stealth: Healer Alpha1008382673373.0%
Qwen 3.5 397B A17B837673676572.7%
Gemini 3 Flash (Preview, Reasoning)978375505071.1%
Claude Sonnet 4.61008373653370.9%
GPT-4o, Aug. 6th (temp=1)787668675969.5%
MiniMax M2.5838067675069.4%
Gemini 3 Flash (Preview)817267626168.7%
Mistral Small 3.2 24B878367673968.5%
GPT-5.4 (Reasoning)837567675068.4%
Cohere Command R+ (Aug. 2024)898483503368.0%
Gemma 3 12B807467645067.0%
Claude Haiku 4.5837977484566.5%
GPT-5.4837967673365.8%
o4 Mini737066635064.3%
Qwen 3 32B797366614063.9%
GPT-5.4 Mini (Reasoning, Low)976765503362.5%
GPT-5.4 Mini676765615061.8%
Ministral 3B866463504461.3%
Ministral 3 3B836559584261.3%
Gemini 3 Pro (Preview)786761505061.1%
Z.AI GLM 4.7676765575061.0%
Mistral Medium 3.1847150505060.9%
ByteDance Seed 1.6 Flash837567413860.7%
Rocinante 12B99807549060.5%
DeepSeek V3 (2024-12-26)826761543860.4%
DeepSeek-V2 Chat736858554660.0%
MiniMax M2.7676767505060.0%
GPT-5 Nano838367501760.0%
Claude Sonnet 4.5796758504459.3%
Gemini 3.1 Pro (Preview)836763503359.3%
GPT-4o, May 13th (temp=1)876657503358.8%
Mistral NeMO806855503858.2%
GPT-4o, May 13th (temp=0)846059553358.0%
Z.AI GLM 5837367333358.0%
Gemma 3 27B776963473257.6%
Arcee AI: Trinity Large (Preview)876766501757.4%
Hermes 3 70B815753504356.7%
Gemini 2.5 Flash Lite1006767331756.7%
Ministral 3 8B83776252055.0%
Qwen 3.5 122B736362393354.1%
Z.AI GLM 4.7 Flash675350505053.9%
DeepSeek V3 (2025-03-24)745957532553.7%
GPT-5.4 (Reasoning, Low)836767331753.3%
Aion 2.0676767333353.3%
Mistral Small 4836750501753.3%
Claude 3.5 Sonnet857344332852.8%
GPT-5.4 Nano (Reasoning)575550505052.5%
Mistral Large 2746056502352.4%
Qwen 3.5 Plus (2026-02-15)776751333352.3%
GPT-4o Mini (temp=0)665250493350.1%
Claude Opus 4.5835050333350.0%
GPT-5.4 Nano675048453348.7%
MoonshotAI: Kimi K2.5675050433348.6%
Ministral 3 14B785750361747.4%
Hermes 3 405B83766017047.2%
Claude Opus 4625855332847.2%
Writer: Palmyra X5605050333345.4%
GPT-5.4 Nano (Reasoning, Low)635033333342.5%
Mistral Large735033252340.9%
Qwen 3.5 9B545042391740.4%
Mistral Large 362524542040.2%
Z.AI GLM 4.5545042282639.9%
Mistral Small Creative565033331737.8%
Claude 3.7 Sonnet58503027033.1%
Mistral Small 4 (Reasoning)673332171733.0%
Qwen 3.5 27B423933301732.0%
Qwen3 235B A22B Instruct 25075050330026.7%
Claude 3 Haiku5627170019.8%
GPT-5.2331700010.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.2100100100100100100.0%
Llama 3.1 8B10010010010010099.9%
Claude Opus 4.51001001001009999.9%
Grok 4 Fast1001001001009999.8%
Gemma 3 12B1001001001009899.6%
Qwen 3.5 9B1001001001009799.4%
o4 Mini High1001001001009799.3%
Gemini 3 Flash (Preview)1001001001009398.7%
Mistral Small 3.2 24B1001001001009298.5%
Qwen 3 32B100100100969598.2%
Claude Sonnet 4.510010099969497.8%
MoonshotAI: Kimi K2.5100100100998997.6%
Gemini 3.1 Pro (Preview)100100100949397.3%
Grok 4.1 Fast100100100958896.7%
Grok 4.20 (Beta, Reasoning)1001001001008396.7%
GPT-5.4 (Reasoning, Low)1001001001008396.7%
Qwen 3.5 35B1001001001008396.7%
ByteDance Seed 2.0 Lite1001001001008396.7%
GPT-5 Nano1001001001008396.7%
Writer: Palmyra X51001001001008396.7%
GPT-4o Mini (temp=0)1001001001008396.7%
Arcee AI: Trinity Mini1001001001008396.7%
Qwen 3.5 Plus (2026-02-15)1001001001008396.7%
Mistral Large 3100100100988496.4%
DeepSeek V3 (2024-12-26)10010099968696.2%
Gemma 3 27B100100100988396.2%
GPT-4.1 Mini1001001001007795.5%
o4 Mini100100100978095.4%
Claude 3.5 Sonnet100100100918595.1%
GPT-4o, Aug. 6th (temp=1)100100100898394.5%
Qwen3 235B A22B Instruct 2507100100100888394.3%
Stealth: Hunter Alpha1001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
GPT-5.4 Mini100100100966792.5%
Inception Mercury100100100876690.6%
GPT-4.1100100100837090.6%
Inception Mercury 210010095916790.5%
Gemini 3 Pro (Preview)1001001001005090.0%
Cohere Command R+ (Aug. 2024)1001001001005090.0%
GPT-5.4 Nano1009997836789.3%
Ministral 3 8B1009391827988.8%
Mistral Large 21009990876788.6%
Mistral Large1009685827888.1%
Ministral 3B999583818087.6%
GPT-5.4 Nano (Reasoning, Low)989483837987.4%
Claude Sonnet 410010083836786.7%
Z.AI GLM 4.51009588846786.6%
Stealth: Aurora Alpha10010092676785.1%
Ministral 3 14B939188836784.5%
ByteDance Seed 1.6 Flash10010082676783.1%
WizardLM 2 8x22b1008383737382.4%
Mistral Small 410010091675081.6%
DeepSeek V3 (2025-03-24)938988726481.2%
GPT-4o, Aug. 6th (temp=0)1008584696781.0%
Llama 3.1 Nemotron 70B1008279746680.3%
Mistral NeMO10010092713379.3%
DeepSeek-V2 Chat10010085674278.8%
Mistral Small Creative1008883675077.6%
GPT-4o, May 13th (temp=0)100100100503376.7%
Ministral 8B10010082673376.4%
Mistral Small 4 (Reasoning)10010083503373.3%
Arcee AI: Trinity Large (Preview)989467584872.9%
Ministral 3 3B1008682593372.1%
Mistral Medium 3.1998167615071.5%
Claude Opus 410010083333370.0%
Llama 3.1 70B1007864594969.8%
Hermes 3 70B919064504968.8%
Hermes 3 405B10010056433266.2%
Rocinante 12B100796660061.0%
Qwen 2.5 72B93776733054.0%
Claude 3.7 Sonnet745555333350.2%
Claude 3 Haiku746340363349.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 4.6100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
o4 Mini1001001001009899.7%
Qwen3 235B A22B Instruct 2507100100100999398.3%
Z.AI GLM 5 Turbo1001001001008396.7%
Grok 41001001001008396.7%
Stealth: Aurora Alpha1001001001008396.7%
DeepSeek V3.21001001001008396.7%
LFM2 24B10010098928394.8%
GPT-4.1 Nano100100100987294.1%
Grok 4.20 (Beta)1001001001006793.3%
DeepSeek V3.11001001001006793.3%
Gemini 2.5 Flash100100100837992.4%
ByteDance Seed 2.0 Mini1001001001005090.0%
Claude 3.5 Haiku1009891827689.3%
Ministral 3 8B10010098975089.2%
Qwen 3 32B10010097836088.0%
Ministral 3 14B1009283818087.2%
Llama 3.1 Nemotron 70B10010088806787.1%
o4 Mini High1001001001003386.7%
Mistral Small Creative100100100676786.7%
Grok 4.20 (Beta, Reasoning)100100100835086.7%
Qwen 3.5 35B1009583836785.6%
Grok 4.1 Fast100100100943385.4%
Z.AI GLM 4.71008383836783.3%
Stealth: Hunter Alpha100100100675083.3%
Nemotron 3 Super10010083676783.3%
Mistral Large 31008383826783.0%
Nemotron 3 Nano10010094833382.0%
Llama 3.1 8B10010073686781.7%
Mistral Medium 3.11008983676781.1%
GPT-4o Mini (temp=1)1009590833380.2%
Ministral 8B10010095723380.1%
Claude Sonnet 41008383835080.0%
Grok 4 Fast10010083833380.0%
Mistral Small 4 (Reasoning)100100100505080.0%
GPT-4.1100100100100079.9%
GPT-4.1 Mini10010099781778.9%
Qwen 3.5 Plus (2026-02-15)908383676778.0%
Qwen 3.5 122B10010083673376.7%
Aion 2.010010010083076.7%
GPT-4o, Aug. 6th (temp=1)10010010083076.7%
Gemini 3.1 Flash Lite (Preview)1009983831776.5%
GPT-5 Mini10010067673373.3%
Gemini 2.5 Flash Lite (Reasoning)10010010067073.3%
GPT-5.4838383675073.3%
Qwen 3.5 9B1008367675073.3%
Z.AI GLM 4.7 Flash1008367675073.3%
Claude Haiku 4.51008380672571.0%
ByteDance Seed 1.610010087333370.6%
DeepSeek V3 (2024-12-26)868383673370.5%
Stealth: Healer Alpha10010010050070.0%
MoonshotAI: Kimi K2.510010083501269.0%
Gemini 2.5 Flash Lite1009167671768.2%
Rocinante 12B1008367503867.7%
Gemini 3.1 Pro (Preview)838367673366.7%
Mistral Small 4836767675066.7%
Gemini 3 Flash (Preview)838367663366.4%
DeepSeek V3 (2025-03-24)1008067671766.1%
WizardLM 2 8x22b1006761505065.5%
MiniMax M2.510010010017063.3%
Gemma 3 27B1008367501763.3%
GPT-4o Mini (temp=0)83838067062.6%
Cohere Command R+ (Aug. 2024)786767525062.5%
DeepSeek-V2 Chat100676767060.0%
Gemini 2.5 Flash (Reasoning)1001008316059.9%
Claude Sonnet 4.610010033331756.7%
Qwen 3.5 27B1001006717056.7%
GPT-5.4 Mini83836750056.7%
Llama 3.1 70B1001006717056.7%
Hermes 3 405B100895042056.3%
Claude 3.5 Sonnet838376171755.2%
Claude Opus 492838316054.9%
Inception Mercury93834645053.4%
GPT-5.4 (Reasoning, Low)100836717053.3%
ByteDance Seed 2.0 Lite836767331753.3%
Writer: Palmyra X510098670052.8%
Ministral 3B83676050052.1%
Arcee AI: Trinity Large (Preview)838233332852.0%
Claude Sonnet 4.6 (Reasoning)1005050331750.0%
Gemini 3 Flash (Preview, Reasoning)836750331750.0%
Qwen 3.5 Flash83825033049.7%
Claude 3.7 Sonnet77676733048.7%
GPT-4o, Aug. 6th (temp=0)100913317048.2%
Claude 3 Haiku895840321647.2%
GPT-5.4 Nano675050333346.7%
ByteDance Seed 1.6 Flash9983500046.6%
Qwen 3.5 397B A17B826733331746.4%
Z.AI GLM 5100484133044.4%
GPT-5.4 Nano (Reasoning)695050331743.8%
GPT-5.4 Nano (Reasoning, Low)67505050043.3%
GPT-5100503317040.0%
MiniMax M2.71003333171740.0%
Mistral Large8367500040.0%
GPT-4o, May 13th (temp=0)1009900039.8%
GPT-5.4 (Reasoning)1008300036.7%
Claude Opus 4.58367330036.7%
Mistral Large 210067170036.7%
Mistral NeMO67503333036.7%
Claude Opus 4.6100333317036.5%
Hermes 3 70B763328231234.6%
Claude Opus 4.6 (Reasoning)50503333033.3%
Claude Sonnet 4.51005060031.2%
Gemini 3 Pro (Preview)6750330030.0%
GPT-5.4 Mini (Reasoning, Low)836700030.0%
Mistral Small 3.2 24B6750290029.1%
Qwen 2.5 72B7850170028.9%
Ministral 3 3B78331711027.8%
Z.AI GLM 4.567331717026.7%
GPT-4o, May 13th (temp=1)6733120022.5%
GPT-5.250000010.0%
GPT-5.4 Mini (Reasoning)3300006.7%
Gemma 3 12B17160006.5%
GPT-5.1000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4.1 Nano100100100868393.7%
Grok 4 Fast999783836785.8%
Nemotron 3 Nano1009283796784.1%
ByteDance Seed 2.0 Mini10010083676783.3%
GPT-5 Nano10010083676783.3%
MiniMax M2.51009383835081.8%
Stealth: Aurora Alpha1008579776781.6%
ByteDance Seed 2.0 Lite988383676779.6%
Nemotron 3 Super10010067675076.7%
ByteDance Seed 1.6838383805076.0%
Claude Sonnet 4.5827972676172.0%
Gemini 3.1 Flash Lite (Preview)838375675071.7%
Qwen 3 32B858575625071.5%
Inception Mercury928383673171.1%
Grok 4.20 (Beta)10010097331769.4%
Hermes 3 70B1008362554168.3%
Inception Mercury 2838076505067.9%
MiniMax M2.7948357505066.7%
Gemini 2.5 Pro747367575064.1%
Mistral Medium 3.1100996750063.1%
Grok 4.20 (Beta, Reasoning)836767603362.0%
GPT-5 Mini837667503361.9%
Stealth: Healer Alpha837567503361.8%
Mistral NeMO1008149413360.8%
Qwen 3.5 Flash838367501760.0%
Rocinante 12B787866641059.2%
Qwen 3.5 27B856655503357.9%
Mistral Large 3817167501757.0%
Claude Sonnet 4.6 (Reasoning)100836733056.7%
DeepSeek V3.1755650505056.2%
Z.AI GLM 5948450331755.5%
Qwen 3.5 Plus (2026-02-15)837750353055.1%
GPT-4.194816733055.1%
Gemma 3 4B726761521753.6%
Gemini 2.5 Flash Lite (Reasoning)676750503353.3%
GPT-5.4 (Reasoning)1006767171753.3%
MoonshotAI: Kimi K2.5665852504053.1%
Z.AI GLM 4.6746450453152.7%
Claude Sonnet 4836648332951.9%
GPT-4o Mini (temp=1)817750331751.5%
Aion 2.0706748333350.2%
WizardLM 2 8x22b79726436050.2%
GPT-5100833333050.0%
Claude Sonnet 4.683675050050.0%
Stealth: Hunter Alpha1006733331750.0%
Gemini 2.5 Flash (Reasoning)806750331749.3%
Gemma 3 27B656350333349.0%
Claude Opus 4.6100615033048.9%
ByteDance Seed 1.6 Flash796833333048.9%
DeepSeek V3.2745050333348.1%
GPT-4o, Aug. 6th (temp=1)837350171748.0%
Mistral Large 275744738046.9%
Mistral Small 3.2 24B67675150046.9%
Gemma 3 12B10067670046.7%
Grok 4806733331745.9%
Gemini 2.5 Flash Lite625050333345.6%
Grok 4.1 Fast83634433044.7%
GPT-4o, May 13th (temp=1)614941333343.7%
Claude Opus 4.6 (Reasoning)676733331743.3%
Z.AI GLM 5 Turbo83673333043.3%
GPT-5.4 Mini83673333043.3%
GPT-5.183505033043.3%
Ministral 3 14B675033333343.3%
GPT-4o Mini (temp=0)68515048043.3%
Ministral 8B755933311743.0%
Gemini 3 Flash (Preview, Reasoning)675048331742.9%
Arcee AI: Trinity Mini67575528041.2%
Cohere Command R+ (Aug. 2024)67564833040.7%
GPT-5.4 Mini (Reasoning)835033171740.0%
LFM2 24B675728281438.8%
GPT-5.4 Mini (Reasoning, Low)833333171736.7%
Claude 3.5 Haiku604734221836.3%
Mistral Small Creative515033291736.1%
Gemini 3 Flash (Preview)673333281735.5%
Llama 3.1 8B59503328134.4%
GPT-5.467503317033.3%
Mistral Small 463543316033.3%
Qwen 3.5 397B A17B72392824032.8%
DeepSeek V3 (2025-03-24)56523321032.5%
Mistral Large65503017032.3%
Ministral 3 3B50433826031.5%
o4 Mini534922201331.3%
Claude Opus 4.552503317030.5%
Qwen 3.5 35B50333328730.2%
Mistral Small 4 (Reasoning)67501717030.0%
GPT-5.4 (Reasoning, Low)8150170029.5%
o4 Mini High50503017029.4%
Ministral 3B48453316028.6%
Gemini 3.1 Pro (Preview)786300028.2%
GPT-4o, Aug. 6th (temp=0)493023171025.8%
DeepSeek-V2 Chat423229171025.7%
GPT-4.1 Mini50461710024.6%
Claude 3.5 Sonnet6033270023.8%
GPT-5.2333317171723.3%
GPT-5.4 Nano (Reasoning)33333317023.3%
Claude Haiku 4.539302422023.1%
Qwen 3.5 122B4735330023.1%
Arcee AI: Trinity Large (Preview)832740022.9%
Ministral 3 8B584780022.5%
Gemini 2.5 Flash33331715019.6%
Claude Opus 4762010019.4%
Qwen 3.5 9B504340019.2%
Z.AI GLM 4.7 Flash33281717019.0%
Hermes 3 405B493372018.3%
Qwen 2.5 72B27251917017.4%
Z.AI GLM 4.7503300016.7%
Writer: Palmyra X55017170016.7%
Llama 3.1 Nemotron 70B541870015.8%
Llama 3.1 70B482440015.3%
DeepSeek V3 (2024-12-26)3317173014.0%
GPT-5.4 Nano (Reasoning, Low)471700012.7%
GPT-4o, May 13th (temp=0)19190007.6%
Qwen3 235B A22B Instruct 25073300006.7%
Claude 3 Haiku16110005.4%
Gemini 3 Pro (Preview)1780004.9%
Claude 3.7 Sonnet1700003.3%
GPT-5.4 Nano1700003.3%
Z.AI GLM 4.5400000.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001009999.8%
Gemma 3 4B1001001001009999.8%
GPT-4o Mini (temp=1)1001001001009799.4%
GPT-4.1 Mini1001001001009699.2%
Claude Opus 4100100100989598.5%
Claude Sonnet 4.61001001001009298.4%
LFM2 24B1001001001008496.8%
Z.AI GLM 5 Turbo1001001001008396.7%
MoonshotAI: Kimi K2.51001001001008396.7%
Z.AI GLM 4.61001001001008396.7%
GPT-4.11001001001008396.7%
Qwen 3.5 35B1001001001008396.7%
Grok 4 Fast1001001001008396.7%
DeepSeek V3.11001001001008396.7%
DeepSeek V3.21001001001008396.7%
Z.AI GLM 51001001001008396.7%
ByteDance Seed 2.0 Lite1001001001008396.7%
GPT-5 Nano100100100998396.5%
Inception Mercury 2100100100988396.2%
GPT-5.4100100100988396.2%
Grok 4.20 (Beta)100100100878394.1%
GPT-4o, Aug. 6th (temp=1)100100100957593.9%
Qwen 3.5 Plus (2026-02-15)10010096908393.8%
Qwen 3.5 27B1009998878393.5%
Claude Opus 4.61001001001006793.3%
GPT-51001001001006793.3%
MiniMax M2.7100100100838393.3%
Claude Sonnet 4.51001001001006793.3%
Stealth: Aurora Alpha1001001001006793.3%
ByteDance Seed 1.6 Flash1001001001006793.3%
Llama 3.1 8B100100100838393.3%
Arcee AI: Trinity Mini100100100907693.2%
Claude Opus 4.510010091918393.0%
Gemini 3.1 Flash Lite (Preview)100100100986792.9%
Claude 3.5 Haiku100100100897492.6%
Claude 3.5 Sonnet1009491878391.3%
GPT-5.4 Mini (Reasoning)1009990838391.2%
GPT-5.110010083838390.0%
Qwen 3.5 Flash10010083838390.0%
Gemini 2.5 Flash Lite100100100836790.0%
Gemma 3 12B100100100975089.3%
Claude Sonnet 4100100100836389.2%
GPT-4o Mini (temp=0)100100100955088.9%
Claude Haiku 4.510010083827788.4%
Qwen 3.5 122B988883838387.1%
ByteDance Seed 1.610010083836786.7%
Gemini 2.5 Flash Lite (Reasoning)10010083836786.7%
Gemini 2.5 Flash10010083836786.7%
Z.AI GLM 4.71008383838386.7%
GPT-5.4 (Reasoning, Low)100100100835086.5%
Mistral Medium 3.110010096676785.8%
Inception Mercury1001001001002785.3%
o4 Mini1009383836785.2%
Qwen 3.5 9B10010099923384.8%
Gemini 3.1 Pro (Preview)838383838383.3%
GPT-5.4 (Reasoning)1008383836783.3%
Gemini 3 Flash (Preview, Reasoning)1008782816683.2%
Qwen 3.5 397B A17B1009583676782.4%
Gemini 3 Pro (Preview)1009383676781.9%
GPT-4o, Aug. 6th (temp=0)1009483755281.0%
Z.AI GLM 4.5928383796780.6%
Ministral 3 14B979782785080.6%
Gemini 3 Flash (Preview)858383836780.3%
DeepSeek V3 (2024-12-26)1009483675078.8%
Rocinante 12B1009086763377.1%
WizardLM 2 8x22b1008375695776.8%
GPT-5.4 Mini (Reasoning, Low)1008383675076.7%
Qwen 3 32B1009990831076.4%
Hermes 3 405B1001009083074.7%
GPT-4o, May 13th (temp=0)969283831774.3%
Ministral 3B978567674471.7%
Cohere Command R+ (Aug. 2024)918383663371.4%
Mistral Large 31009570524071.3%
DeepSeek-V2 Chat1001008367070.0%
Claude 3.7 Sonnet1007167535068.2%
Arcee AI: Trinity Large (Preview)10010010033066.7%
Mistral Small 41001008350066.7%
Llama 3.1 70B1007964453364.4%
Mistral Small 4 (Reasoning)837967671762.5%
GPT-4o, May 13th (temp=1)100837950062.5%
GPT-5.4 Nano (Reasoning)756767673361.7%
Ministral 3 3B797470473861.5%
DeepSeek V3 (2025-03-24)95946750061.3%
Hermes 3 70B1008367322260.8%
Llama 3.1 Nemotron 70B817769423360.6%
GPT-5.4 Mini836767503360.0%
Gemma 3 27B1001001000060.0%
Mistral Small 3.2 24B776750494156.7%
Qwen 2.5 72B81676550052.5%
Mistral NeMO855050421748.8%
Ministral 3 8B98625917047.2%
GPT-5.4 Nano (Reasoning, Low)675050331743.3%
Claude 3 Haiku565445362643.2%
GPT-5.2725033331741.1%
Ministral 8B10069330040.5%
Mistral Small Creative10067190037.1%
GPT-5.4 Nano503333171730.0%
Mistral Large 26450330029.5%
Mistral Large545200021.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Qwen 3.5 35B1001001001009999.8%
Grok 4 Fast1001001001009999.8%
Grok 4.20 (Beta, Reasoning)1001001001009899.6%
GPT-4.1 Nano1001001001009298.3%
Nemotron 3 Nano1001001001009198.1%
MiniMax M2.71001001001009098.1%
Z.AI GLM 5100100100969297.7%
GPT-4o, Aug. 6th (temp=1)100100100969197.4%
Gemini 2.5 Pro1001001001008396.7%
Gemini 2.5 Flash Lite (Reasoning)1001001001008396.7%
ByteDance Seed 2.0 Lite1001001001008396.7%
Nemotron 3 Super1001001001008396.7%
DeepSeek V3.11001001001008396.7%
Gemini 2.5 Flash Lite1001001001008396.7%
Qwen 3.5 397B A17B10010098949096.5%
Inception Mercury1001001001007995.8%
MoonshotAI: Kimi K2.510010099968395.5%
Gemini 2.5 Flash10010096958394.7%
Qwen 3.5 122B10010098918394.3%
Z.AI GLM 5 Turbo100100100838393.3%
GPT-5100100100838393.3%
Grok 4.1 Fast100100100838393.3%
MiniMax M2.5100100100838393.3%
ByteDance Seed 2.0 Mini1001001001006793.3%
Stealth: Aurora Alpha1001001001006793.3%
GPT-5 Nano100100100838393.3%
ByteDance Seed 1.61009797838392.1%
GPT-4o Mini (temp=1)10010098906790.9%
Claude Opus 4.6 (Reasoning)100100100836790.0%
Claude Opus 4.6100100100836790.0%
Stealth: Healer Alpha1001001001005090.0%
Writer: Palmyra X5100100100836790.0%
Llama 3.1 8B1001001001005090.0%
Grok 4.20 (Beta)1009883838389.6%
DeepSeek V3.210010097836789.4%
Qwen 3.5 Plus (2026-02-15)10010097836789.4%
Stealth: Hunter Alpha1009997836789.1%
Claude Sonnet 41009889887089.0%
Qwen 3.5 27B10010092836788.4%
Claude Sonnet 4.51009489866787.2%
Grok 41009886846787.0%
Aion 2.01008583838387.0%
GPT-5.1100100100835086.7%
GPT-4.110010083836786.7%
Qwen3 235B A22B Instruct 250710010083836786.7%
Gemma 3 27B100100100835086.7%
o4 Mini High1009983836786.4%
Ministral 3 8B10010088875084.9%
Gemma 3 4B1009382826784.7%
Qwen 3.5 9B1001001001001783.3%
o4 Mini888383837783.1%
GPT-5.4 (Reasoning, Low)838383838082.8%
DeepSeek V3 (2025-03-24)1008779796782.4%
GPT-4.1 Mini908383817582.4%
LFM2 24B898583787682.4%
Gemma 3 12B1009383835081.9%
Inception Mercury 210010067676780.0%
GPT-5.41009867676779.6%
GPT-5.4 Mini (Reasoning, Low)988383676278.6%
GPT-4o, Aug. 6th (temp=0)958378686778.1%
Gemini 3.1 Pro (Preview)838180796778.0%
GPT-5.4 Mini (Reasoning)838383726777.7%
GPT-5.4 (Reasoning)918383805077.6%
Qwen 3 32B959167666576.7%
Claude Opus 4.51008367676776.7%
Z.AI GLM 4.7 Flash10010067675076.7%
Hermes 3 70B1008378626076.6%
Mistral Small 4838383676776.6%
Arcee AI: Trinity Mini1009671674976.5%
GPT-4o, May 13th (temp=1)1009788504876.4%
Mistral Small Creative1009476625076.3%
GPT-5.4 Nano1008374675074.8%
Claude Haiku 4.5918383625073.8%
Mistral Small 4 (Reasoning)948382763373.7%
Claude 3.5 Sonnet847571716673.4%
Cohere Command R+ (Aug. 2024)1007567675873.3%
Gemini 3 Flash (Preview, Reasoning)837767676070.7%
Ministral 8B10010010050070.0%
Ministral 3 3B948368673369.1%
DeepSeek-V2 Chat1008463504668.6%
Rocinante 12B1008067534468.5%
GPT-5.4 Nano (Reasoning)838367673366.7%
Mistral Large 2967169613366.1%
Mistral Medium 3.11008364631765.6%
Mistral Large 3987455505065.5%
GPT-4o Mini (temp=0)736767655264.8%
Claude 3.5 Haiku777568584564.5%
Claude Opus 4817470633364.3%
Gemini 3 Flash (Preview)838367503363.3%
Z.AI GLM 4.7816759505061.2%
Mistral NeMO86838350060.5%
GPT-5.2836767503360.0%
GPT-5.4 Mini836767503360.0%
WizardLM 2 8x22b1001005050060.0%
Mistral Large726767583359.4%
Z.AI GLM 4.5998754331757.9%
Ministral 3 14B825650505057.6%
DeepSeek V3 (2024-12-26)706753484656.7%
Mistral Small 3.2 24B995050503356.5%
Ministral 3B975858501756.1%
Gemini 3 Pro (Preview)675050505053.3%
GPT-5.4 Nano (Reasoning, Low)836767331753.2%
Llama 3.1 Nemotron 70B726964322151.6%
Claude 3 Haiku785842402648.7%
Llama 3.1 70B83554943046.2%
Qwen 2.5 72B94505033045.4%
GPT-4o, May 13th (temp=0)65615917040.3%
Arcee AI: Trinity Large (Preview)90504317039.9%
Hermes 3 405B93483017037.5%
ByteDance Seed 1.6 Flash6733330026.7%
Claude 3.7 Sonnet36231717018.3%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100938996.4%
Grok 4.20 (Beta, Reasoning)100100100958395.7%
Nemotron 3 Nano1009796938393.8%
Z.AI GLM 4.6100100100887793.1%
Claude Sonnet 4.6 (Reasoning)100100100838293.0%
Gemini 2.5 Pro100100100976792.7%
Gemini 3.1 Flash Lite (Preview)10010097838392.6%
ByteDance Seed 2.0 Mini100100100897392.4%
Gemini 2.5 Flash100100100926791.7%
ByteDance Seed 2.0 Lite10010083838390.0%
GPT-510010086837889.6%
Gemma 3 4B10010083838089.3%
Nemotron 3 Super1009383837887.5%
Inception Mercury 21009392836786.9%
GPT-5.1100100100834986.5%
Gemini 2.5 Flash (Reasoning)1008885836985.0%
Claude Opus 4.61008383836783.4%
GPT-5 Nano10010083835083.3%
Llama 3.1 8B100100100674782.8%
Grok 4919190717082.6%
Gemini 2.5 Flash Lite1008380736881.0%
Stealth: Aurora Alpha1009077706780.7%
GPT-4.1 Nano868382816980.2%
GPT-5 Mini10010083833380.0%
DeepSeek V3.1958681676778.9%
Grok 4 Fast918178736677.7%
Gemma 3 12B1008871676077.1%
Arcee AI: Trinity Mini1008377764877.0%
Grok 4.20 (Beta)928277676776.9%
Claude Opus 4.6 (Reasoning)888383676176.2%
Gemini 3 Flash (Preview)967675676575.6%
Gemini 3 Flash (Preview, Reasoning)898370676774.9%
Claude 3.5 Haiku979173644474.1%
Llama 3.1 Nemotron 70B827373706672.6%
Stealth: Hunter Alpha928370675072.4%
Arcee AI: Trinity Large (Preview)929178673372.1%
MiniMax M2.5857768656271.2%
ByteDance Seed 1.6858367675070.4%
GPT-5.4 (Reasoning, Low)978367555070.4%
Gemini 2.5 Flash Lite (Reasoning)1006767675070.0%
Claude Opus 4.5908374505069.4%
Aion 2.01008167643368.9%
Qwen3 235B A22B Instruct 2507837666655068.0%
Mistral Small 3.2 24B10010067581367.5%
o4 Mini High837769565067.2%
Inception Mercury94818179067.1%
Qwen 3.5 397B A17B838383503366.5%
MiniMax M2.7996767505066.4%
WizardLM 2 8x22b877867504765.6%
DeepSeek V3 (2024-12-26)817372574365.6%
Grok 4.1 Fast848063515065.6%
Stealth: Healer Alpha766767674864.8%
GPT-4.1966761505064.6%
Mistral NeMO837370633364.5%
Ministral 3 8B767067585064.2%
GPT-5.4 Mini (Reasoning, Low)786767613361.1%
GPT-5.4 Mini (Reasoning)866765503360.2%
GPT-5.4 (Reasoning)767367503359.8%
DeepSeek-V2 Chat676259535158.5%
DeepSeek V3.2676759505058.5%
Ministral 3 14B796059504258.1%
Gemini 3 Pro (Preview)865450505057.9%
Hermes 3 70B876756562357.6%
GPT-5.4695959505057.2%
Ministral 3B736959473556.7%
GPT-4o, Aug. 6th (temp=1)956552462556.4%
ByteDance Seed 1.6 Flash626160504856.2%
Z.AI GLM 583817833055.3%
Claude Opus 4746752483355.0%
Writer: Palmyra X5806763461754.5%
GPT-4.1 Mini837946333154.4%
Gemma 3 27B726552424054.2%
Claude Haiku 4.5676653503354.0%
Claude 3.7 Sonnet656155503953.9%
Ministral 8B736560521753.1%
Mistral Medium 3.1715050474753.1%
Mistral Large676763412752.9%
Llama 3.1 70B746450393352.1%
Qwen 2.5 72B615453504151.8%
Hermes 3 405B676652502351.7%
Claude Sonnet 4626258551851.1%
MoonshotAI: Kimi K2.5815849333350.9%
Z.AI GLM 4.7 Flash706750333350.6%
LFM2 24B805650472050.5%
Qwen 3 32B715450373349.2%
Ministral 3 3B575550493549.2%
GPT-4o, Aug. 6th (temp=0)715143413848.8%
o4 Mini675047443348.2%
GPT-4o Mini (temp=1)805044412547.9%
Claude Sonnet 4.5655652382947.9%
DeepSeek V3 (2025-03-24)705247452347.1%
Z.AI GLM 5 Turbo836733331746.7%
Z.AI GLM 4.5775337333346.6%
Mistral Large 2585150393346.1%
Claude 3.5 Sonnet555147443345.9%
Claude 3 Haiku715550331645.0%
Mistral Large 3584947362643.1%
Mistral Small 467645033042.8%
Mistral Small Creative635033322841.1%
Rocinante 12B76504337041.0%
Gemini 3.1 Pro (Preview)585043331740.2%
Cohere Command R+ (Aug. 2024)494747421640.0%
Qwen 3.5 Plus (2026-02-15)835033171740.0%
Z.AI GLM 4.778503330038.4%
Mistral Small 4 (Reasoning)564640271837.3%
Qwen 3.5 122B7853490035.8%
GPT-5.4 Mini444241331735.5%
GPT-5.4 Nano504843171734.9%
Qwen 3.5 Flash8750330034.1%
GPT-4o, May 13th (temp=1)433831292833.8%
GPT-4o, May 13th (temp=0)672727261732.7%
GPT-5.4 Nano (Reasoning, Low)60333317028.6%
GPT-4o Mini (temp=0)59501713027.8%
Qwen 3.5 9B6950160027.1%
Qwen 3.5 35B646400025.6%
GPT-5.4 Nano (Reasoning)3326160014.9%
Qwen 3.5 27B2900005.9%
GPT-5.21600003.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Arcee AI: Trinity Mini1001001001009999.8%
Qwen 3.5 397B A17B1001001001009699.3%
Gemini 3 Pro (Preview)100100100989698.8%
Z.AI GLM 51001001001009398.6%
GPT-5.4 Mini (Reasoning)100100100999398.3%
MoonshotAI: Kimi K2.51001001001009198.3%
Gemini 3.1 Pro (Preview)1001001001009098.0%
Grok 4 Fast1001001001008997.7%
Cohere Command R+ (Aug. 2024)1001001001008797.4%
Hermes 3 405B100100100959197.3%
GPT-4.110010098969196.9%
ByteDance Seed 1.61001001001008396.7%
o4 Mini High1001001001008396.7%
Claude Opus 4.51001001001008396.7%
o4 Mini1001001001008396.7%
Stealth: Hunter Alpha1001001001008396.7%
Stealth: Healer Alpha1001001001008396.7%
DeepSeek V3.21001001001008396.7%
Gemini 2.5 Flash Lite1001001001008396.7%
Qwen3 235B A22B Instruct 25071001001001008396.7%
Claude Sonnet 4100100100998396.4%
GPT-5.4 Mini (Reasoning, Low)100100100988396.3%
MiniMax M2.7100100100968395.9%
Grok 4.20 (Beta)10010098968395.5%
Qwen 3.5 27B100100100977594.5%
Claude Sonnet 4.510010097948094.3%
Gemini 3 Flash (Preview, Reasoning)1001001001007093.9%
MiniMax M2.51001001001006793.3%
DeepSeek-V2 Chat100100100868193.3%
Grok 4.1 Fast100100100838192.9%
Gemma 3 27B100100100837992.5%
Claude 3.5 Haiku100100100887392.3%
GPT-5.4 (Reasoning, Low)10010094838392.2%
Nemotron 3 Super1001001001006192.1%
LFM2 24B1009493918292.0%
Qwen 3.5 122B10010091838391.6%
GPT-4o Mini (temp=1)100100100906791.3%
GPT-4o, Aug. 6th (temp=1)10010098856990.5%
Writer: Palmyra X510010099836789.9%
Gemini 3 Flash (Preview)10010099836789.7%
Claude Haiku 4.510010092896789.5%
Qwen 3.5 9B10010083837488.1%
Llama 3.1 Nemotron 70B10010097885086.8%
GPT-5.4 (Reasoning)1001001001003386.7%
Llama 3.1 8B100100100835086.7%
Mistral Medium 3.11009584807486.4%
Claude 3.5 Sonnet958884827685.1%
Ministral 8B958585827884.8%
Qwen 3 32B10010090775083.4%
ByteDance Seed 2.0 Lite10010083835083.3%
GPT-5.41001001001001783.3%
Inception Mercury10010087755483.2%
Ministral 3 14B1009081747083.0%
Claude Opus 410010099793382.3%
Stealth: Aurora Alpha1009283676581.4%
Qwen 3.5 Flash1008883676780.9%
DeepSeek V3 (2024-12-26)1009483765180.6%
Mistral Small 3.2 24B10010010099079.7%
GPT-5.4 Mini1009883823379.5%
Z.AI GLM 4.51008383815079.4%
Ministral 3B1008479795679.4%
GPT-5.4 Nano (Reasoning, Low)1008380676779.4%
GPT-4o, May 13th (temp=0)10010079675079.1%
Gemma 3 12B1008377676778.8%
Mistral Large948483825078.7%
Mistral Small Creative1008882725078.3%
Ministral 3 8B988685774277.6%
Mistral Large 2918377696677.1%
Inception Mercury 21007772686776.7%
Mistral Large 3938778675576.1%
GPT-5.4 Nano (Reasoning)838377676775.4%
Mistral Small 4 (Reasoning)1001009283075.0%
Qwen 2.5 72B838075676774.4%
DeepSeek V3 (2025-03-24)868376725474.2%
GPT-4o, Aug. 6th (temp=0)1007873675073.6%
GPT-4o, May 13th (temp=1)1008377743373.5%
Rocinante 12B918976674373.3%
Claude 3.7 Sonnet838279635973.2%
Qwen 3.5 35B10010050505070.0%
GPT-5.4 Nano838367675070.0%
ByteDance Seed 1.6 Flash1008382671769.7%
Ministral 3 3B828270574968.1%
GPT-4o Mini (temp=0)1008867503367.5%
Mistral NeMO1008367661766.5%
GPT-5.2838175593366.3%
Llama 3.1 70B1006359564163.9%
Hermes 3 70B676257565058.3%
Arcee AI: Trinity Large (Preview)1004744332249.2%
Mistral Small 483675033046.7%
Claude 3 Haiku65635245045.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 4.6100100100100100100.0%
Grok 4100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Gemini 2.5 Pro1001001001008396.7%
Arcee AI: Trinity Mini1001001001008396.6%
Inception Mercury 210010094888192.7%
Grok 4 Fast10010097838392.7%
Nemotron 3 Nano10010094838392.1%
DeepSeek V3 (2024-12-26)10010090838291.2%
ByteDance Seed 1.61001001001005090.0%
DeepSeek V3.2100100100836790.0%
GPT-5 Mini1001001001003386.7%
Grok 4.20 (Beta, Reasoning)100100100676085.3%
Stealth: Aurora Alpha10010083776284.6%
Gemma 3 27B1009683756583.7%
Mistral Small Creative10010092913383.1%
Cohere Command R+ (Aug. 2024)10010098872682.2%
Qwen 3 32B10010097565080.5%
Gemini 2.5 Flash Lite (Reasoning)1009392833380.4%
Stealth: Hunter Alpha100100100100080.0%
Mistral Large 21008972696779.4%
GPT-4o Mini (temp=1)10010068675077.0%
MoonshotAI: Kimi K2.510010083831776.7%
Aion 2.010010010083076.7%
GPT-5.4 Mini10010083505076.7%
Z.AI GLM 4.7 Flash1008383803376.0%
Mistral Large 310010083564075.8%
GPT-4.1 Mini838375676775.1%
o4 Mini10010010067073.3%
WizardLM 2 8x22b10010079671772.5%
Llama 3.1 70B1009469534672.3%
Gemini 3.1 Flash Lite (Preview)1009967523370.1%
DeepSeek-V2 Chat100968371070.0%
Qwen3 235B A22B Instruct 25071001008367070.0%
GPT-4o Mini (temp=0)817864646169.8%
o4 Mini High10010010048069.5%
GPT-5 Nano1008375503368.3%
GPT-4o, May 13th (temp=1)99908367067.9%
ByteDance Seed 2.0 Lite10010065571767.7%
Ministral 3 14B1009279501767.6%
Grok 4.1 Fast838367673366.7%
GPT-4o, May 13th (temp=0)93837867064.3%
LFM2 24B898975333364.0%
GPT-4o, Aug. 6th (temp=0)1006751505063.6%
Inception Mercury1001007043062.6%
Gemini 2.5 Flash100928333061.6%
Ministral 3B906750505061.4%
Claude Sonnet 41001006733060.0%
Z.AI GLM 4.5100676767060.0%
Mistral Small 4100838333059.8%
Mistral Medium 3.183838150059.6%
Gemini 2.5 Flash Lite1001006725058.4%
Claude Opus 4.6 (Reasoning)1005050503356.7%
Claude Opus 4.61006750501756.7%
Gemini 3 Flash (Preview)836767501756.7%
Claude Haiku 4.51001006717056.7%
Hermes 3 405B1005050503356.7%
Mistral Large100936717055.2%
ByteDance Seed 2.0 Mini100995017053.2%
Ministral 8B10098670052.9%
Gemini 3 Flash (Preview, Reasoning)676350503352.7%
GPT-4o, Aug. 6th (temp=1)96675050052.5%
Rocinante 12B90676134751.8%
Mistral NeMO835850501751.6%
Nemotron 3 Super99954617051.2%
Claude Sonnet 4.6 (Reasoning)1008333171750.0%
Ministral 3 8B10083670050.0%
Mistral Small 4 (Reasoning)836767171750.0%
Claude 3.5 Haiku76625644849.2%
Z.AI GLM 4.79583670049.1%
Z.AI GLM 5100754129049.1%
Claude 3 Haiku706356331948.2%
Hermes 3 70B10083500046.7%
Qwen 3.5 397B A17B100833317046.7%
GPT-5100675017046.7%
Llama 3.1 8B100673332046.4%
Gemma 3 12B9867670046.2%
Qwen 3.5 122B83683333043.6%
Gemini 2.5 Flash (Reasoning)100633317042.6%
GPT-5.4676333331742.6%
GPT-4.183613333042.3%
Gemini 3 Pro (Preview)83673317040.0%
Writer: Palmyra X583505017040.0%
Qwen 3.5 Flash8067470038.6%
Arcee AI: Trinity Large (Preview)9075260038.1%
Qwen 3.5 Plus (2026-02-15)67505017036.7%
Llama 3.1 Nemotron 70B643333331736.0%
Z.AI GLM 5 Turbo10050177034.8%
MiniMax M2.78350330033.3%
DeepSeek V3 (2025-03-24)10050110032.2%
ByteDance Seed 1.6 Flash100331710032.0%
Claude 3.7 Sonnet56503317031.2%
Claude 3.5 Sonnet756730028.9%
Gemini 3.1 Pro (Preview)45333317025.6%
Mistral Small 3.2 24B8217170023.1%
Qwen 3.5 35B6732170023.0%
Ministral 3 3B4333330022.0%
Claude Opus 4831700020.0%
Claude Opus 4.550171717020.0%
Qwen 2.5 72B91000018.2%
MiniMax M2.533171717617.9%
Qwen 3.5 9B83600017.8%
Claude Sonnet 4.583000016.7%
Claude Sonnet 4.64817170016.3%
GPT-5.4 Nano (Reasoning)292600011.0%
Qwen 3.5 27B3300006.7%
GPT-5.4 (Reasoning)1700003.3%
GPT-5.4 Nano1700003.3%
GPT-5.1000000.0%
GPT-5.4 (Reasoning, Low)000000.0%
GPT-5.4 Mini (Reasoning)000000.0%
GPT-5.2000000.0%
GPT-5.4 Mini (Reasoning, Low)000000.0%
GPT-5.4 Nano (Reasoning, Low)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.20 (Beta, Reasoning)1001001001009198.3%
Claude Sonnet 4.61001001001008396.6%
ByteDance Seed 2.0 Mini10010094838292.0%
Stealth: Hunter Alpha10010090838391.3%
Gemini 2.5 Pro1009690838390.6%
ByteDance Seed 1.610010083838390.0%
Gemini 3.1 Flash Lite (Preview)10010098836789.6%
DeepSeek V3.11009790836787.4%
Z.AI GLM 4.610010087795984.9%
GPT-5 Nano1008383836783.3%
Nemotron 3 Nano1008583676680.2%
ByteDance Seed 2.0 Lite10010083673376.7%
Claude Opus 4.6 (Reasoning)10010083505076.7%
GPT-4.1 Nano1009076675076.5%
DeepSeek V3.2917878785175.2%
Cohere Command R+ (Aug. 2024)999887513373.6%
Stealth: Healer Alpha838383675073.3%
GPT-510010067505073.3%
Claude Opus 4.61009867505073.0%
Gemma 3 4B10010067623372.4%
Gemini 2.5 Flash Lite (Reasoning)1007867625071.5%
Z.AI GLM 5 Turbo838079574869.6%
Mistral Medium 3.1908867602766.1%
Grok 4 Fast967166503363.3%
Gemini 2.5 Flash (Reasoning)786767545063.1%
Claude Sonnet 4.6 (Reasoning)836764505062.8%
GPT-4.1838350504862.8%
Mistral Small Creative956662503361.3%
Gemini 2.5 Flash Lite866967463360.1%
GPT-4o, Aug. 6th (temp=1)906762503360.1%
GPT-5.4 (Reasoning)836750505060.0%
Stealth: Aurora Alpha676458505057.7%
GPT-5.4 Mini (Reasoning)756767443357.1%
Inception Mercury 2826150503355.2%
Mistral Small 3.2 24B837367331754.7%
Claude Sonnet 4.5756766421753.4%
GPT-5.11006750331753.3%
Z.AI GLM 51008350171753.3%
Grok 4.20 (Beta)98985017052.6%
Mistral Large656255453352.1%
Aion 2.0956550331751.9%
Gemini 2.5 Flash676451413351.3%
Rocinante 12B83726238151.3%
o4 Mini High77756733050.4%
Qwen 3.5 Plus (2026-02-15)845751411749.8%
Mistral NeMO985450281749.3%
Inception Mercury505050504649.3%
Ministral 3 8B74626246048.8%
Nemotron 3 Super72705740047.6%
Claude Sonnet 4635849373147.6%
WizardLM 2 8x22b67675450047.4%
MiniMax M2.7100505033046.7%
Claude Opus 4.569655045346.3%
MiniMax M2.583675017043.3%
Claude Haiku 4.5676738261743.0%
GPT-4o Mini (temp=1)565336333342.4%
GPT-5.467615033042.2%
Grok 467675026041.9%
Claude 3.5 Haiku776930181241.2%
Gemini 3 Flash (Preview, Reasoning)6767500036.7%
Ministral 8B6862485036.6%
Claude Opus 4524342331036.2%
Arcee AI: Trinity Large (Preview)7460310033.1%
Gemma 3 12B8339259632.5%
Ministral 3B6854328032.4%
Qwen 3 32B84421717031.7%
Mistral Small 4 (Reasoning)86331717030.6%
Gemini 3 Flash (Preview)50493317029.9%
DeepSeek-V2 Chat67501710028.6%
Hermes 3 70B7347167028.5%
DeepSeek V3 (2024-12-26)6042308027.9%
Claude 3.5 Sonnet44422717727.6%
MoonshotAI: Kimi K2.556333311026.9%
GPT-5 Mini67331717026.7%
Mistral Large 253323216026.7%
GPT-5.4 Mini (Reasoning, Low)67331714026.1%
Writer: Palmyra X5834700026.0%
GPT-5.4 (Reasoning, Low)6333330025.9%
Mistral Large 35744179025.4%
GPT-4o, Aug. 6th (temp=0)655900024.9%
o4 Mini7433170024.7%
Arcee AI: Trinity Mini46392610023.9%
Qwen 3.5 397B A17B35333317023.7%
Z.AI GLM 4.7 Flash42322417023.0%
ByteDance Seed 1.6 Flash8117140022.4%
Qwen3 235B A22B Instruct 25076727170022.1%
Mistral Small 433332417021.4%
Llama 3.1 70B673300020.0%
Ministral 3 14B494720019.8%
Gemma 3 27B671600016.5%
GPT-5.4 Mini393370015.9%
Llama 3.1 8B502900015.7%
Gemini 3.1 Pro (Preview)31171710015.0%
Z.AI GLM 4.7502400014.8%
Grok 4.1 Fast471780014.3%
Ministral 3 3B22201611013.8%
Gemini 3 Pro (Preview)432300013.2%
GPT-5.217171715013.0%
Hermes 3 405B2920150012.8%
GPT-4.1 Mini57000011.4%
LFM2 24B311900010.1%
Qwen 3.5 Flash50000010.0%
Qwen 2.5 72B25170008.4%
DeepSeek V3 (2025-03-24)26140007.9%
Qwen 3.5 35B3300006.7%
GPT-4o, May 13th (temp=1)17120005.8%
Claude 3.7 Sonnet2800005.5%
Qwen 3.5 9B2800005.5%
Llama 3.1 Nemotron 70B2310004.9%
Qwen 3.5 27B1700003.3%
GPT-4o, May 13th (temp=0)1700003.3%
Claude 3 Haiku1300002.6%
Z.AI GLM 4.51000001.9%
Qwen 3.5 122B000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
GPT-5.4 Nano (Reasoning, Low)000000.0%
GPT-4o Mini (temp=0)000000.0%
GPT-5.4 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.20 (Beta)10010098969698.2%
Arcee AI: Trinity Mini1001001001009098.1%
GPT-5.11001001001008396.7%
Z.AI GLM 51001001001008396.7%
Claude Sonnet 4.61001001001008396.7%
Grok 41001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
ByteDance Seed 1.61001001001008396.7%
MiniMax M2.51001001001008396.7%
Claude Opus 4.6 (Reasoning)1001001001008396.6%
Claude Opus 4.6100100100988396.3%
Gemma 3 27B100100100978396.1%
Gemini 3.1 Flash Lite (Preview)100100100978396.0%
Nemotron 3 Super100100100958395.7%
MoonshotAI: Kimi K2.5100100100948195.1%
Nemotron 3 Nano10010098978195.0%
Mistral Small Creative10010097938494.8%
GPT-5 Mini100100100838393.3%
o4 Mini100100100838393.3%
Claude Opus 4100100100838393.3%
Stealth: Healer Alpha1001001001006793.3%
Gemini 2.5 Flash Lite (Reasoning)100100100838393.3%
DeepSeek V3.21001001001006793.3%
Gemini 2.5 Flash100100100838393.3%
GPT-4o Mini (temp=0)100100100838393.3%
Claude Opus 4.510010095838392.3%
o4 Mini High1001001001006192.3%
GPT-4.11009694868492.0%
GPT-5.4 (Reasoning)10010096837891.6%
DeepSeek V3 (2025-03-24)10010087838390.8%
Gemma 3 4B1009995936790.7%
ByteDance Seed 2.0 Mini10010083838390.0%
GPT-4o, May 13th (temp=0)1001001001005090.0%
ByteDance Seed 2.0 Lite10010083838390.0%
Gemini 2.5 Flash Lite10010083838390.0%
WizardLM 2 8x22b100100100836790.0%
Z.AI GLM 4.710010096837089.8%
Gemini 2.5 Flash (Reasoning)100100100985089.7%
Gemini 3 Flash (Preview, Reasoning)10010092837289.5%
Gemini 3 Pro (Preview)1009583837887.9%
GPT-4o, Aug. 6th (temp=1)100100100835587.6%
GPT-5100100100835086.7%
Z.AI GLM 4.61001001001003386.6%
Grok 4.20 (Beta, Reasoning)1009583836785.6%
GPT-4o, Aug. 6th (temp=0)10010079786684.6%
GPT-4.1 Mini1009895795084.3%
GPT-5.4 (Reasoning, Low)1008983836584.1%
Gemini 3 Flash (Preview)959486835983.4%
Z.AI GLM 5 Turbo1008983815681.9%
DeepSeek-V2 Chat10010078675680.0%
ByteDance Seed 1.6 Flash1009489833379.9%
Qwen3 235B A22B Instruct 250710010010099079.8%
Qwen 3 32B10010082655079.4%
Claude 3.5 Haiku868585805778.7%
Mistral Small 4 (Reasoning)969271676778.5%
Mistral Small 41009970675077.3%
Claude Sonnet 41009089832176.7%
Qwen 3.5 Flash838381676776.1%
Qwen 3.5 Plus (2026-02-15)967978735075.4%
Claude Haiku 4.510010087701774.8%
Grok 4.1 Fast1001008883074.2%
Writer: Palmyra X51001009567072.3%
GPT-5.4 Mini (Reasoning)827469676771.7%
Inception Mercury1001007568970.3%
Claude 3.5 Sonnet877773595470.0%
Llama 3.1 8B1009767582869.8%
Stealth: Aurora Alpha837975615069.6%
Llama 3.1 70B878367615069.5%
GPT-5.4 Mini958367505069.1%
GPT-4o, May 13th (temp=1)979265563368.7%
Cohere Command R+ (Aug. 2024)908383552467.1%
Mistral NeMO94847976066.7%
Ministral 3 8B939158483364.7%
Ministral 8B777567663363.8%
Mistral Large1007769532063.8%
Qwen 3.5 122B100836767063.3%
GPT-5 Nano1006750505063.3%
DeepSeek V3 (2024-12-26)1001006746263.0%
Ministral 3 14B1007353503161.3%
Qwen 3.5 35B100838333060.0%
Claude Sonnet 4.5100927420057.2%
Inception Mercury 2686751505057.2%
Gemini 3.1 Pro (Preview)766967433057.0%
Llama 3.1 Nemotron 70B676355524456.0%
Claude 3.7 Sonnet827250373354.8%
Gemma 3 12B1006749331753.2%
Rocinante 12B10088670050.8%
GPT-5.4 Mini (Reasoning, Low)635650503350.6%
Qwen 3.5 27B1006750171750.0%
Arcee AI: Trinity Large (Preview)80735037749.4%
Qwen 3.5 9B10092330045.1%
GPT-5.49281500044.6%
Qwen 2.5 72B76554732042.2%
Hermes 3 405B8272560042.1%
Z.AI GLM 4.565605029040.8%
Mistral Large 38766470040.2%
LFM2 24B614638281537.5%
Hermes 3 70B8365217035.1%
Mistral Large 2835900028.5%
Ministral 3 3B10017130025.9%
Mistral Small 3.2 24B49332323025.6%
Qwen 3.5 397B A17B5041330024.9%
Mistral Medium 3.11001750024.4%
Claude 3 Haiku5333320023.5%
GPT-5.24329150017.5%
GPT-5.4 Nano331700010.0%
Ministral 3B3830008.1%
GPT-5.4 Nano (Reasoning)19170007.2%
GPT-5.4 Nano (Reasoning, Low)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.210010010010010099.9%
Gemini 2.5 Flash (Reasoning)1001001001009899.6%
GPT-51001001001009699.2%
Aion 2.0100100100979798.8%
MoonshotAI: Kimi K2.51001001001009398.7%
Gemini 2.5 Flash1001001001008396.7%
Claude Opus 4.6 (Reasoning)100100100988396.3%
Z.AI GLM 4.7 Flash100100100968395.8%
Z.AI GLM 5100100100997895.4%
Z.AI GLM 4.6100100100908695.3%
Qwen3 235B A22B Instruct 2507100100100938395.2%
GPT-4.1 Nano100100100848393.5%
Stealth: Healer Alpha100100100838393.3%
ByteDance Seed 2.0 Lite1001001001006793.3%
Z.AI GLM 5 Turbo1001001001005090.0%
Gemma 3 27B100100100836790.0%
GPT-5 Mini100100100826789.7%
Gemini 2.5 Flash Lite100100100836188.8%
Grok 4.20 (Beta, Reasoning)1009789797387.6%
DeepSeek V3 (2024-12-26)10010095836087.6%
Claude Opus 4.610010083836786.7%
Gemini 2.5 Flash Lite (Reasoning)100100100835086.7%
GPT-5 Nano10010083836786.7%
o4 Mini High10010083836786.7%
GPT-5.110010097835086.2%
Writer: Palmyra X51009483836785.4%
GPT-4.1 Mini1009383816784.9%
ByteDance Seed 1.610010088835084.4%
Claude Opus 4.51009183806784.3%
DeepSeek-V2 Chat1008380787783.6%
ByteDance Seed 2.0 Mini10010083835083.3%
Rocinante 12B10010090765083.3%
Nemotron 3 Super10010094665683.1%
Claude Haiku 4.5959383806382.5%
Inception Mercury10010099674782.5%
Mistral Small Creative939183766782.1%
Stealth: Hunter Alpha1008381806782.1%
Grok 4 Fast999483676782.1%
GPT-4o, Aug. 6th (temp=1)10010093833382.0%
Claude Sonnet 4.51008480796781.9%
Gemini 3.1 Pro (Preview)10010081675981.4%
Grok 4.20 (Beta)908978747080.4%
Grok 4.1 Fast999184675979.8%
WizardLM 2 8x22b1009883675079.5%
GPT-5.4 (Reasoning)949083785078.9%
MiniMax M2.510010074715078.9%
Mistral Medium 3.1958379676778.1%
Mistral Small 3.2 24B1001009983076.5%
GPT-4o Mini (temp=1)999473684976.4%
Hermes 3 70B10010083504876.2%
MiniMax M2.710010083791775.7%
DeepSeek V3 (2025-03-24)888878724674.3%
Nemotron 3 Nano1009367625074.2%
GPT-4.11008272675074.2%
GPT-5.4 Mini (Reasoning)907267646271.0%
Z.AI GLM 4.71009567593370.9%
Inception Mercury 21008367504669.2%
Ministral 3 8B837774615068.8%
ByteDance Seed 1.6 Flash100948267068.6%
Claude Sonnet 4888376613368.1%
Grok 4826767636168.0%
Claude 3.5 Haiku817670565567.5%
GPT-4o, Aug. 6th (temp=0)898158535066.4%
Mistral Large 3807464644866.2%
o4 Mini836767625065.8%
GPT-5.4 (Reasoning, Low)826764625065.0%
Gemini 3 Pro (Preview)89837767063.1%
Arcee AI: Trinity Large (Preview)787460515162.6%
Gemma 3 4B836767504562.6%
GPT-4o, May 13th (temp=1)1006550494662.1%
Claude Opus 4847561484262.0%
Gemini 3 Flash (Preview)837667503361.9%
Stealth: Aurora Alpha836658505061.5%
Cohere Command R+ (Aug. 2024)1006764332858.4%
Gemma 3 12B83756767058.3%
Gemini 3 Flash (Preview, Reasoning)736767503358.0%
Qwen 3 32B808050433357.3%
Arcee AI: Trinity Mini87836550056.9%
Ministral 8B837956501656.8%
Mistral Small 4 (Reasoning)916152423355.9%
Mistral NeMO676454493353.4%
Mistral Large 2905048482351.9%
Mistral Small 4595950494251.8%
GPT-4o Mini (temp=0)725947433450.9%
Qwen 3.5 397B A17B696750501750.6%
Qwen 3.5 122B75675350048.9%
Qwen 3.5 Plus (2026-02-15)836150331748.8%
Claude 3.5 Sonnet645754412848.8%
Z.AI GLM 4.5696740333348.5%
Ministral 3 14B786746331748.3%
Ministral 3 3B715050501747.5%
Qwen 3.5 35B83665533047.5%
Llama 3.1 8B100675019047.1%
GPT-5.4 Mini (Reasoning, Low)83675033046.6%
Ministral 3B874934302745.5%
Qwen 2.5 72B535049383344.8%
Mistral Large635150292443.5%
GPT-5.4 Nano (Reasoning)67505050043.3%
Llama 3.1 70B575544312441.9%
Llama 3.1 Nemotron 70B564643302840.3%
Hermes 3 405B55484844039.0%
GPT-5.4675033171736.7%
Qwen 3.5 Flash8382170036.4%
Claude 3.7 Sonnet505033171633.3%
Qwen 3.5 9B44434038033.0%
GPT-4o, May 13th (temp=0)50433929032.1%
Qwen 3.5 27B7167220032.0%
GPT-5.4 Mini67331717026.7%
GPT-5.4 Nano (Reasoning, Low)6733330026.7%
GPT-5.250333217026.3%
GPT-5.4 Nano50391717024.4%
LFM2 24B39292210019.9%
Claude 3 Haiku3742008.7%