Passive voice overuse

Test: Bad Writing Habits

Avg. Score
96.6%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast99.9%$0.001837.8s99%
2Grok 4 Fast99.6%$0.001724.1s96%
3o4 Mini99.5%$0.01525.7s97%
4o4 Mini High99.8%$0.02547.2s99%
5GPT-4o Mini (temp=1)99.0%$0.001234.8s96%
6Writer: Palmyra X598.8%$0.01122.0s96%
7Qwen3 235B A22B Instruct 250799.0%$0.001159.2s96%
8Mistral Small 4 (Reasoning)98.7%$0.002230.2s95%
9DeepSeek V3 (2025-03-24)98.8%$0.001439.4s94%
10GPT-4.199.3%$0.01844.7s95%
11Mistral Small 498.2%$0.001418.2s93%
12LFM2 24B98.2%$0.000228.4s93%
13Gemini 2.5 Flash (Reasoning)98.1%$0.01121.5s93%
14GPT-4.1 Mini98.1%$0.002719.0s91%
15Mistral Small Creative97.3%$0.00079.1s92%
16GPT-5.4 Nano (Reasoning, Low)97.7%$0.005520.6s92%
17GPT-5.4 Nano (Reasoning)98.0%$0.006124.5s91%
18GPT-5.4 Mini (Reasoning, Low)97.9%$0.01516.8s92%
19GPT-4o, May 13th (temp=1)98.2%$0.03314.4s93%
20Grok 4.20 (Beta)98.1%$0.01815.8s91%
21Ministral 3 3B97.3%$0.000511.1s90%
22GPT-5.4 Mini97.8%$0.01516.8s91%
23Hermes 3 405B97.9%$0.003253.2s93%
24Stealth: Aurora Alpha97.0%$0.00009.8s90%
25DeepSeek V3 (2024-12-26)97.8%$0.002154.6s92%
26GPT-5.4 Nano97.3%$0.005726.3s91%
27GPT-4.1 Nano97.0%$0.000713.3s90%
28GPT-4o Mini (temp=0)97.4%$0.001234.8s91%
29Gemini 2.5 Flash97.3%$0.005210.6s89%
30Qwen 3 32B97.6%$0.001554.6s92%
31Qwen 3.5 Plus (2026-02-15)97.6%$0.006031.5s90%
32Mistral Medium 3.197.5%$0.004836.5s91%
33Qwen 3.5 Flash97.5%$0.002547.5s91%
34Gemma 3 4B96.7%$0.000220.0s90%
35GPT-4o, Aug. 6th (temp=1)97.9%$0.01824.4s90%
36DeepSeek-V2 Chat97.5%$0.002153.3s91%
37Gemma 3 12B97.1%$0.000441.3s91%
38Gemma 3 27B97.2%$0.000652.6s91%
39Ministral 3B96.5%$0.00018.1s89%
40Claude Sonnet 4.598.1%$0.03538.1s92%
41Ministral 3 14B96.5%$0.000711.7s89%
42Qwen 3.5 122B98.1%$0.0251.1m93%
43ByteDance Seed 1.6 Flash96.6%$0.001327.3s90%
44Qwen 3.5 9B97.9%$0.00111.4m92%
45Ministral 3 8B96.3%$0.000819.6s90%
46Inception Mercury 296.8%$0.00327.0s87%
47Claude Haiku 4.596.9%$0.01121.6s90%
48Grok 4.20 (Beta, Reasoning)98.0%$0.03934.0s92%
49Claude 3.7 Sonnet98.1%$0.04246.7s93%
50Z.AI GLM 4.597.0%$0.005142.1s90%
51GPT-5.199.5%$0.0541.8m96%
52Z.AI GLM 5 Turbo97.0%$0.008133.2s89%
53Mistral Large96.8%$0.01430.9s90%
54Mistral NeMO95.9%$0.000510.1s88%
55Claude 3.5 Haiku96.7%$0.003510.8s87%
56GPT-5.498.8%$0.0491.4m95%
57GPT-5.4 Mini (Reasoning)97.5%$0.02228.1s89%
58Z.AI GLM 597.1%$0.00841.2m91%
59Gemini 2.5 Flash Lite (Reasoning)96.2%$0.002830.8s89%
60Claude 3 Haiku95.9%$0.002514.9s88%
61Grok 499.1%$0.0481.7m94%
62Qwen 3.5 27B97.8%$0.0201.6m92%
63Arcee AI: Trinity Large (Preview)96.1%$0.000043.6s89%
64GPT-5.298.7%$0.0561.5m94%
65MiniMax M2.596.5%$0.00341.3m91%
66MiniMax M2.796.7%$0.00401.1m90%
67Gemini 2.5 Flash Lite95.4%$0.00099.5s87%
68GPT-5 Nano96.7%$0.00421.4m91%
69Ministral 8B95.6%$0.000410.4s86%
70GPT-4o, May 13th (temp=0)96.1%$0.03514.1s89%
71GPT-5.4 (Reasoning, Low)98.4%$0.0551.4m93%
72Mistral Large 395.5%$0.003330.3s87%
73Claude 3.5 Sonnet97.1%$0.04835.4s90%
74Arcee AI: Trinity Mini95.3%$0.00039.2s85%
75Mistral Large 296.2%$0.01329.4s86%
76Rocinante 12B95.9%$0.001438.4s86%
77Claude Sonnet 496.6%$0.03243.7s89%
78Stealth: Hunter Alpha95.8%$0.000055.0s87%
79Stealth: Healer Alpha94.9%$0.000023.7s85%
80Gemini 3 Pro (Preview)96.9%$0.05554.4s91%
81Gemini 3.1 Flash Lite (Preview)94.5%$0.00308.4s84%
82Z.AI GLM 4.796.0%$0.0101.4m88%
83GPT-4o, Aug. 6th (temp=0)95.6%$0.02322.7s85%
84Hermes 3 70B95.5%$0.00101.2m86%
85GPT-5 Mini95.3%$0.010057.4s86%
86Qwen 3.5 397B A17B98.2%$0.0143.0m91%
87Claude Opus 4.597.1%$0.07053.4s90%
88Gemini 3 Flash (Preview)94.1%$0.007819.6s85%
89Cohere Command R+ (Aug. 2024)95.5%$0.02052.5s86%
90GPT-598.8%$0.0652.8m95%
91Llama 3.1 8B95.4%$0.00031.3m85%
92Qwen 3.5 35B96.1%$0.0181.0m84%
93Z.AI GLM 4.7 Flash94.5%$0.00171.2m85%
94Gemini 3 Flash (Preview, Reasoning)94.4%$0.01230.1s83%
95GPT-5.4 (Reasoning)98.9%$0.0892.6m95%
96DeepSeek V3.295.3%$0.00141.9m86%
97MoonshotAI: Kimi K2.597.9%$0.0193.2m89%
98Nemotron 3 Super95.2%$0.00001.4m83%
99Aion 2.094.3%$0.00641.3m84%
100Gemini 3.1 Pro (Preview)98.3%$0.1071.8m91%
101Z.AI GLM 4.693.7%$0.006551.5s83%
102Qwen 2.5 72B93.3%$0.001036.7s81%
103Claude Opus 4.696.4%$0.0781.2m88%
104Gemini 2.5 Pro94.1%$0.03636.2s84%
105Nemotron 3 Nano94.4%$0.00101.1m80%
106Llama 3.1 Nemotron 70B93.2%$0.003831.7s80%
107Claude Sonnet 4.693.7%$0.03139.3s83%
108Claude Opus 4.6 (Reasoning)96.2%$0.0881.4m88%
109WizardLM 2 8x22b94.8%$0.00261.8m80%
110Llama 3.1 70B92.5%$0.001529.4s78%
111Claude Sonnet 4.6 (Reasoning)94.6%$0.0601.2m84%
112DeepSeek V3.193.2%$0.00201.8m81%
113Claude Opus 498.5%$0.2091.4m94%
114Inception Mercury90.1%$0.01117.6s71%
115Mistral Small 3.2 24B95.8%$0.00695.7m84%
116ByteDance Seed 1.692.1%$0.0132.5m76%
117ByteDance Seed 2.0 Lite89.5%$0.0122.2m71%
118ByteDance Seed 2.0 Mini87.5%$0.00454.9m70%
96.61%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)10010010010010099.9%
Writer: Palmyra X51001001001009999.9%
GPT-4.1 Nano1001001001009999.9%
GPT-5.4 (Reasoning, Low)1001001001009999.9%
Grok 4 Fast1001001001009999.9%
Qwen 3.5 27B1001001001009999.8%
GPT-5.41001001001009999.8%
Claude Sonnet 4.51001001001009999.7%
Inception Mercury 21001001001009999.7%
Claude Opus 4.61001001001009899.7%
Qwen 3.5 Plus (2026-02-15)1001001001009899.6%
GPT-5.4 Nano1001001001009899.6%
Claude Sonnet 4100100100999999.6%
LFM2 24B1001001001009899.6%
Claude Haiku 4.51001001001009899.5%
Claude 3.7 Sonnet1001001001009799.5%
Gemini 2.5 Flash1001001001009799.5%
Arcee AI: Trinity Mini100100100999899.5%
Z.AI GLM 4.7100100100999899.4%
Claude Opus 4.6 (Reasoning)100100100999899.4%
GPT-5.4 Mini (Reasoning)1001001001009799.3%
Mistral Small 41001001001009799.3%
Mistral Medium 3.11001001001009699.3%
Claude Sonnet 4.6 (Reasoning)100100100989899.3%
Qwen 3.5 35B1001001001009699.3%
Qwen 3.5 397B A17B100100100989899.3%
Rocinante 12B100100100999899.2%
GPT-4o, May 13th (temp=1)1001001001009699.2%
GPT-5.4 Nano (Reasoning)1001001001009699.2%
MoonshotAI: Kimi K2.51001001001009699.2%
GPT-4.1 Mini1001001001009599.1%
GPT-5 Nano100100100999699.0%
Gemini 2.5 Flash (Reasoning)100100100989799.0%
Ministral 8B1001001001009599.0%
Ministral 3 14B100100100999699.0%
Qwen 3.5 Flash1001001001009598.9%
Stealth: Aurora Alpha1001001001009498.9%
Z.AI GLM 51001001001009498.8%
Gemma 3 27B1001001001009498.7%
Qwen 3 32B100100100999498.7%
Gemini 2.5 Flash Lite (Reasoning)1001001001009398.6%
Nemotron 3 Super100100100989598.6%
GPT-5 Mini10010099999598.6%
Gemma 3 4B100100100969698.5%
Llama 3.1 8B100100100979598.5%
Mistral Large100100100999398.5%
Aion 2.0999998989898.3%
Hermes 3 405B100100100989398.3%
Stealth: Hunter Alpha100100100979498.2%
Ministral 3B100100100969598.2%
GPT-4o, Aug. 6th (temp=0)100100100999298.2%
Ministral 3 8B100100100969498.1%
WizardLM 2 8x22b100100100989298.0%
GPT-5.4 Nano (Reasoning, Low)10010099989297.8%
Gemini 3 Flash (Preview, Reasoning)100100100989197.7%
MiniMax M2.5100100100959397.6%
ByteDance Seed 1.6100100100989097.5%
Gemma 3 12B100100100949397.5%
Gemini 3 Pro (Preview)100100100949397.5%
Gemini 3.1 Flash Lite (Preview)100100100979097.4%
DeepSeek V3.21009998979397.4%
Stealth: Healer Alpha100100100988997.3%
Claude 3.5 Sonnet100100100988997.3%
Mistral Small Creative100100100998897.2%
Mistral Large 31009996959396.8%
Claude Opus 41009796959596.6%
Nemotron 3 Nano10010096959296.5%
Arcee AI: Trinity Large (Preview)100100100958796.5%
Gemini 2.5 Pro10010096949396.5%
ByteDance Seed 1.6 Flash1009896959396.4%
Gemini 3 Flash (Preview)10010098959096.4%
Claude 3.5 Haiku1001001001008296.4%
GPT-4o, May 13th (temp=0)1001001001008196.2%
Hermes 3 70B10010095949196.1%
Ministral 3 3B100100100968496.0%
Inception Mercury1001001001007995.8%
Mistral Large 21009996968895.6%
DeepSeek V3.11009696959195.6%
Z.AI GLM 4.610010096938895.6%
Claude Sonnet 4.610010096948494.9%
Claude 3 Haiku10010097918594.7%
Llama 3.1 Nemotron 70B10010096948394.6%
Cohere Command R+ (Aug. 2024)100100100888494.4%
Gemini 2.5 Flash Lite1009693918994.0%
Mistral NeMO1009998957893.9%
Z.AI GLM 4.7 Flash10010092918493.5%
Qwen 2.5 72B10010099967193.1%
Mistral Small 3.2 24B1009792898792.8%
ByteDance Seed 2.0 Mini10010099877892.7%
Llama 3.1 70B1009796837089.2%
ByteDance Seed 2.0 Lite989679737283.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3.7 Sonnet10010010010010099.9%
Writer: Palmyra X510010010010010099.9%
MiniMax M2.710010010010010099.9%
GPT-5.4 Nano (Reasoning)10010010010010099.9%
Qwen 3.5 397B A17B1001001001009999.8%
Z.AI GLM 51001001001009999.8%
Claude Sonnet 4.51001001001009999.8%
GPT-5.4 (Reasoning)1001001001009999.8%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Ministral 3 3B1001001001009999.8%
Qwen 3.5 122B1001001001009999.8%
Mistral Small 4 (Reasoning)1001001001009999.7%
GPT-5.4 Mini (Reasoning, Low)1001001001009899.7%
Qwen3 235B A22B Instruct 2507100100100999999.7%
Mistral Large 31001001001009899.6%
GPT-5 Mini1001001001009999.6%
Qwen 3.5 35B1001001001009899.6%
Qwen 3.5 Flash1001001001009899.6%
MiniMax M2.51001001001009999.6%
Mistral Large 21001001001009899.6%
Claude Opus 4.61001001001009899.6%
GPT-4.1 Mini1001001001009799.4%
Claude 3.5 Sonnet1001001001009799.4%
GPT-4o, May 13th (temp=1)100100100999899.4%
Qwen 3.5 9B1001001001009799.4%
Claude Opus 4.5100100100989899.3%
Nemotron 3 Super1001001001009699.3%
Qwen 3.5 27B1001001001009699.3%
Gemini 2.5 Flash (Reasoning)1001001001009699.3%
GPT-5.4 Nano (Reasoning, Low)1001001001009699.2%
Hermes 3 405B1001001001009699.2%
Gemini 2.5 Flash100100100989799.1%
Mistral Small Creative100100100979798.9%
Rocinante 12B100100100989798.9%
Gemma 3 12B10010099999698.9%
Claude Haiku 4.5100100100979698.7%
GPT-4o Mini (temp=1)1001001001009398.7%
Claude Opus 4.6 (Reasoning)10010099979698.3%
Gemma 3 27B1009999979698.3%
Stealth: Hunter Alpha10010099989598.2%
Arcee AI: Trinity Large (Preview)1001001001009198.1%
DeepSeek V3 (2024-12-26)100100100959598.1%
DeepSeek V3.210010099979598.0%
Qwen 3 32B10010099969498.0%
Arcee AI: Trinity Mini1001001001009097.9%
Mistral Medium 3.110010099989397.9%
Claude 3.5 Haiku1001001001008997.9%
GPT-4o Mini (temp=0)1001001001008997.7%
Claude Sonnet 4.6 (Reasoning)1009998959597.5%
Inception Mercury 210010099979297.5%
DeepSeek-V2 Chat1001001001008897.5%
Mistral Large100100100969297.5%
Gemini 2.5 Flash Lite (Reasoning)100100100949397.3%
Gemini 3 Pro (Preview)10010098989197.3%
WizardLM 2 8x22b1009797969597.3%
Aion 2.010010098949397.1%
Stealth: Aurora Alpha10010098949397.0%
Gemini 3 Flash (Preview, Reasoning)1001001001008596.9%
Z.AI GLM 4.510010096959296.7%
Cohere Command R+ (Aug. 2024)10010096939396.5%
GPT-4o, May 13th (temp=0)10010099939096.4%
Gemini 2.5 Pro1009796969396.3%
Z.AI GLM 4.71009998968796.1%
GPT-5 Nano1009998949096.1%
Ministral 3 8B10010095939296.0%
Llama 3.1 70B10010096939095.9%
Stealth: Healer Alpha989897969195.8%
ByteDance Seed 1.6 Flash1009895939295.8%
Z.AI GLM 4.7 Flash1009997938995.6%
Gemini 3 Flash (Preview)10010096938995.6%
Ministral 3B1001001001007895.5%
Nemotron 3 Nano10010098918995.4%
Gemini 3.1 Flash Lite (Preview)10010098968295.4%
Ministral 8B1009997958695.1%
Gemma 3 4B1009897919095.1%
Claude 3 Haiku10010099948395.1%
Claude Sonnet 4.61009594949094.3%
Hermes 3 70B1001001001007094.0%
GPT-4o, Aug. 6th (temp=0)10010096908393.9%
GPT-4.1 Nano10010097977593.9%
Mistral NeMO1009993898893.9%
Qwen 2.5 72B10010093908693.8%
Z.AI GLM 4.6979695938693.4%
DeepSeek V3.110010098977293.3%
ByteDance Seed 1.610010094908092.6%
Llama 3.1 8B10010099936791.9%
Gemini 2.5 Flash Lite1009891917991.8%
ByteDance Seed 2.0 Lite989795867991.1%
LFM2 24B969592878591.1%
ByteDance Seed 2.0 Mini1009489878190.1%
Llama 3.1 Nemotron 70B979292857688.5%
Inception Mercury1009894846688.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Arcee AI: Trinity Large (Preview)10010010010010099.9%
Mistral NeMO10010010010010099.9%
GPT-5.41001001001009999.9%
GPT-4o, May 13th (temp=1)1001001001009999.9%
GPT-4o, Aug. 6th (temp=1)1001001001009999.8%
Mistral Small 41001001001009999.8%
Qwen 3 32B1001001001009999.8%
GPT-4o, Aug. 6th (temp=0)1001001001009999.8%
GPT-5.4 Nano100100100999999.8%
DeepSeek-V2 Chat1001001001009899.6%
Writer: Palmyra X51001001001009899.6%
Z.AI GLM 4.7100100100999999.6%
Mistral Large 21001001001009899.6%
Gemini 2.5 Flash (Reasoning)1001001001009899.6%
Z.AI GLM 5100100100999999.6%
Stealth: Hunter Alpha1001001001009899.6%
Gemini 2.5 Flash1001001001009899.5%
Qwen 3.5 Flash1001001001009899.5%
GPT-5.4 Nano (Reasoning)1001001001009799.4%
DeepSeek V3.21009999999999.4%
GPT-5.4 Nano (Reasoning, Low)100100100989899.3%
Qwen 3.5 Plus (2026-02-15)100100100989899.3%
GPT-5.4 Mini1001001001009699.2%
Mistral Small Creative1001001001009699.2%
Gemini 3 Pro (Preview)100100100999799.1%
Ministral 3 14B1001001001009699.1%
Claude Opus 4.51001001001009699.1%
Z.AI GLM 5 Turbo1001001001009699.1%
GPT-4.1 Mini100100100989899.1%
GPT-5.21001001001009599.1%
Claude 3.5 Sonnet100100100989799.1%
GPT-4o Mini (temp=1)1001001001009699.1%
MoonshotAI: Kimi K2.51001001001009599.0%
Nemotron 3 Super100100100999699.0%
Nemotron 3 Nano1001001001009599.0%
Ministral 3 3B10010099989798.9%
DeepSeek V3 (2024-12-26)100100100999598.9%
MiniMax M2.51001001001009498.9%
GPT-4.1 Nano100100100999598.9%
Grok 4.20 (Beta, Reasoning)1001001001009498.7%
Aion 2.01001001001009398.6%
DeepSeek V3 (2025-03-24)100100100989598.6%
LFM2 24B100100100969698.6%
MiniMax M2.7100100100979698.5%
Gemma 3 27B100100100989498.5%
WizardLM 2 8x22b100100100999398.5%
ByteDance Seed 1.6 Flash10010098989698.4%
Hermes 3 405B1001001001009298.4%
Claude Sonnet 4.5100100100969598.2%
GPT-5 Mini100100100969498.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009098.0%
Mistral Large100100100959497.9%
Gemini 2.5 Flash Lite100100100969397.9%
Ministral 3 8B10010099969497.9%
Qwen 2.5 72B1001001001009097.9%
Claude Sonnet 4100100100969397.7%
Ministral 8B100100100959497.7%
Arcee AI: Trinity Mini1001001001008997.7%
Hermes 3 70B100100100969397.7%
Claude Opus 4.61009997979597.6%
GPT-5 Nano100100100979097.5%
Qwen 3.5 9B10010099969297.4%
Claude 3 Haiku10010099959397.4%
Gemini 3.1 Flash Lite (Preview)100100100949397.4%
Gemini 2.5 Flash Lite (Reasoning)1009998969397.2%
Claude Haiku 4.510010098949497.2%
Gemini 2.5 Pro100100100978796.8%
Stealth: Healer Alpha1009995959596.6%
Z.AI GLM 4.7 Flash1009997959296.6%
Mistral Medium 3.11001001001008396.6%
Llama 3.1 8B100100100948996.5%
ByteDance Seed 1.610010096969196.5%
Claude 3.5 Haiku100100100929096.4%
Llama 3.1 Nemotron 70B10010096968996.0%
ByteDance Seed 2.0 Lite10010096968795.9%
Mistral Large 3100100100908895.7%
ByteDance Seed 2.0 Mini1009998918995.3%
Mistral Small 3.2 24B1001001001007695.2%
Cohere Command R+ (Aug. 2024)10010096938695.1%
Gemma 3 12B1009796919094.9%
Z.AI GLM 4.610010093929094.8%
Claude Sonnet 4.6 (Reasoning)979494939394.4%
DeepSeek V3.11009996868593.3%
Gemma 3 4B1009592918793.1%
Ministral 3B10010088888692.5%
Gemini 3 Flash (Preview)1009994858291.9%
Inception Mercury100100100837691.9%
Claude Sonnet 4.6979391908891.9%
Rocinante 12B10010095935588.7%
Llama 3.1 70B1008988806283.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4.1 Mini10010010010010099.9%
Inception Mercury 21001001001009999.9%
Grok 4.20 (Beta)1001001001009999.7%
o4 Mini High1001001001009899.7%
Qwen 3.5 397B A17B100100100999899.6%
GPT-5.4 Mini (Reasoning)100100100999999.5%
GPT-5.4 Mini100100100999899.4%
Writer: Palmyra X51001001001009799.4%
Qwen 3.5 27B100100100999799.2%
GPT-5.4 Nano (Reasoning)100100100999799.2%
MoonshotAI: Kimi K2.51001001001009699.2%
MiniMax M2.71001001001009699.1%
Claude 3 Haiku100100100999699.1%
Rocinante 12B1001001001009599.0%
DeepSeek-V2 Chat100100100989798.9%
Qwen 3.5 Plus (2026-02-15)1001001001009598.9%
Grok 4.20 (Beta, Reasoning)1001001001009598.9%
GPT-5100100100979798.8%
Claude Opus 41001001001009498.8%
Z.AI GLM 510010099989698.7%
Nemotron 3 Super1009999999798.7%
GPT-5.4 (Reasoning, Low)100100100989598.6%
GPT-5.210010097979798.4%
Claude Sonnet 4.510010098979698.3%
Hermes 3 405B100100100989398.2%
DeepSeek V3 (2025-03-24)10010097979698.1%
Gemini 2.5 Flash (Reasoning)10010099989398.1%
GPT-5.4 Nano (Reasoning, Low)10010098969598.0%
Claude 3.5 Sonnet100100100999197.9%
Gemini 2.5 Flash1009897979797.9%
GPT-4o Mini (temp=1)100100100999197.9%
Hermes 3 70B10010097969697.9%
Llama 3.1 8B10010099979397.8%
LFM2 24B100100100999097.8%
GPT-4o Mini (temp=0)10010099969497.7%
Qwen 3.5 9B1001001001008697.2%
Claude Opus 4.6 (Reasoning)10010097959397.1%
GPT-5.4999998969497.0%
Claude Haiku 4.51009997969397.0%
Qwen 3 32B100100100929296.8%
Llama 3.1 Nemotron 70B10010095959496.7%
Claude Sonnet 41009896969396.6%
DeepSeek V3 (2024-12-26)1001001001008396.6%
GPT-5.4 Mini (Reasoning, Low)1009896969396.6%
Ministral 3B100100100998496.5%
Nemotron 3 Nano100100100988496.4%
Qwen 3.5 122B100100100918996.1%
Claude 3.7 Sonnet1009999948896.1%
Claude Opus 4.5100100100938796.0%
Gemini 3 Pro (Preview)1009796969195.8%
ByteDance Seed 1.6 Flash1009994939295.6%
GPT-5.4 Nano1009997919095.4%
Ministral 3 3B10010098948495.1%
Qwen 3.5 Flash10010096968495.1%
Mistral NeMO100100100898695.1%
Mistral Small 4 (Reasoning)1009795938995.0%
Qwen 3.5 35B989696949195.0%
GPT-5 Nano989894929294.9%
Mistral Small Creative989796948894.7%
Gemma 3 27B10010094938694.6%
GPT-5 Mini1009898968194.6%
Mistral Small 410010097928494.6%
GPT-4o, May 13th (temp=1)999696938894.6%
Gemma 3 12B1009695948794.4%
Claude 3.5 Haiku100100100947894.4%
Claude Opus 4.61009796908994.4%
Claude Sonnet 4.6 (Reasoning)1009894928794.3%
Mistral Small 3.2 24B100100100977494.3%
Z.AI GLM 4.7 Flash1009896948293.9%
Ministral 8B999796918793.8%
Aion 2.01009994888893.6%
Cohere Command R+ (Aug. 2024)100100100987093.6%
Stealth: Healer Alpha1009393938893.4%
GPT-4.1 Nano1009790909093.3%
Arcee AI: Trinity Large (Preview)10010092918493.1%
Z.AI GLM 4.510010090908593.0%
Ministral 3 8B1009892888792.8%
Arcee AI: Trinity Mini10010092888492.8%
Z.AI GLM 5 Turbo10010094947592.7%
GPT-4o, May 13th (temp=0)949494928992.6%
GPT-4o, Aug. 6th (temp=0)1009695918292.6%
Z.AI GLM 4.71009595868592.3%
Mistral Medium 3.1969594888691.9%
Gemma 3 4B1009691908291.7%
DeepSeek V3.2989590898391.1%
MiniMax M2.5969593888290.9%
ByteDance Seed 1.61009089898790.9%
Ministral 3 14B969493908190.8%
Gemini 3.1 Flash Lite (Preview)979391868490.3%
Llama 3.1 70B1009996946189.9%
WizardLM 2 8x22b959492848489.8%
Gemini 3 Flash (Preview, Reasoning)1009292848189.6%
Gemini 2.5 Pro1009592897289.5%
Inception Mercury100100100836289.1%
Mistral Large 21009998836588.8%
Gemini 2.5 Flash Lite989791817688.7%
Mistral Large 3969188877888.0%
DeepSeek V3.1999288867588.0%
Gemini 2.5 Flash Lite (Reasoning)969087838387.6%
Claude Sonnet 4.61009588787487.2%
ByteDance Seed 2.0 Mini979691886487.0%
Z.AI GLM 4.61009683817587.0%
Mistral Large969386847686.8%
Gemini 3 Flash (Preview)958985857986.5%
Stealth: Hunter Alpha988784847084.8%
Qwen 2.5 72B968985796983.5%
ByteDance Seed 2.0 Lite1007877745777.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
GPT-4o Mini (temp=1)10010010010010099.9%
GPT-5.4 (Reasoning, Low)10010010010010099.9%
GPT-5.410010010010010099.9%
Claude Sonnet 41001001001009999.9%
Mistral Small 3.2 24B1001001001009999.9%
GPT-4o, May 13th (temp=1)1001001001009999.8%
Hermes 3 405B1001001001009999.8%
Claude Opus 4.6 (Reasoning)1001001001009999.8%
Z.AI GLM 4.51001001001009999.8%
GPT-5.4 (Reasoning)1001001001009999.8%
DeepSeek-V2 Chat1001001001009999.8%
Qwen 3 32B1001001001009999.8%
Stealth: Hunter Alpha1001001001009999.8%
Claude Opus 4.51001001001009999.7%
GPT-5.4 Nano (Reasoning, Low)1001001001009999.7%
LFM2 24B1001001001009999.7%
DeepSeek V3 (2025-03-24)1001001001009899.6%
Mistral NeMO1001001001009899.6%
Qwen3 235B A22B Instruct 25071001001001009899.6%
Qwen 3.5 397B A17B1001001001009899.6%
Arcee AI: Trinity Mini100100100999999.6%
Gemini 2.5 Flash1001001001009799.5%
Qwen 3.5 27B1001001001009799.5%
Mistral Large 21001001001009699.3%
Z.AI GLM 4.71001001001009699.2%
Qwen 3.5 Flash1001001001009699.2%
Qwen 3.5 122B1001001001009699.2%
Mistral Medium 3.110010099999899.2%
GPT-5 Mini100100100999799.2%
GPT-4.1 Mini1001001001009699.1%
Gemini 2.5 Flash (Reasoning)1001001001009699.1%
Z.AI GLM 4.61001001001009699.1%
Gemini 3 Flash (Preview)100100100999699.1%
Mistral Small 4 (Reasoning)1001001001009599.1%
Mistral Large100100100999699.0%
Z.AI GLM 4.7 Flash10010099989899.0%
ByteDance Seed 1.6 Flash1001001001009599.0%
Mistral Small 41001001001009599.0%
Gemma 3 12B100100100989698.9%
Stealth: Aurora Alpha100100100989698.8%
Gemma 3 4B1001001001009498.8%
GPT-4.1 Nano100100100999598.7%
Inception Mercury 21001001001009398.7%
Claude Haiku 4.5100100100979698.6%
Ministral 3B100100100989598.5%
Gemini 2.5 Flash Lite1009999989698.3%
Mistral Large 31001001001009198.3%
GPT-5 Nano10010099969698.2%
Gemini 2.5 Flash Lite (Reasoning)100100100979398.2%
GPT-4o Mini (temp=0)10010099999398.1%
Gemini 3 Pro (Preview)10010098979397.8%
GPT-4o, May 13th (temp=0)100100100999097.8%
Claude 3 Haiku10010099959497.6%
Qwen 2.5 72B100100100959397.6%
DeepSeek V3.2999898979697.5%
Ministral 3 3B10010096959597.4%
Arcee AI: Trinity Large (Preview)10010097969497.3%
ByteDance Seed 1.610010099979197.3%
Ministral 3 8B1001001001008697.2%
GPT-4o, Aug. 6th (temp=0)1001001001008597.1%
Cohere Command R+ (Aug. 2024)1001001001008596.9%
Gemini 3.1 Flash Lite (Preview)1009898959496.8%
Gemma 3 27B10010096959396.7%
Ministral 8B1009898949296.5%
Nemotron 3 Nano100100100968696.4%
Stealth: Healer Alpha10010097948996.0%
GPT-4o, Aug. 6th (temp=1)100100100997995.7%
ByteDance Seed 2.0 Mini1009997928895.3%
Claude Sonnet 4.6 (Reasoning)10010098938394.9%
Llama 3.1 70B10010096928694.8%
Llama 3.1 8B10010096898994.7%
Rocinante 12B10010098888694.2%
Claude Sonnet 4.610010096948294.2%
Gemini 2.5 Pro1009692909093.7%
Hermes 3 70B100100100977093.5%
Qwen 3.5 35B1001001001003987.8%
Llama 3.1 Nemotron 70B1009997934987.5%
ByteDance Seed 2.0 Lite10010097525079.9%
Inception Mercury100100100712879.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha10010010010010099.9%
Qwen 3.5 122B10010010010010099.9%
GPT-5.410010010010010099.9%
GPT-4.1 Nano10010010010010099.9%
Grok 4.20 (Beta)1001001001009999.8%
Claude Opus 41001001001009999.8%
Qwen 3.5 397B A17B1001001001009999.8%
Qwen 3.5 Plus (2026-02-15)1001001001009999.8%
Inception Mercury 21001001001009999.7%
DeepSeek V3 (2024-12-26)1001001001009899.6%
Claude Sonnet 41001001001009899.6%
Mistral Small 4 (Reasoning)100100100999999.6%
Claude 3.5 Haiku1001001001009899.6%
GPT-5.21001001001009799.3%
Gemini 2.5 Flash1001001001009799.3%
Mistral Small 3.2 24B1001001001009699.3%
Qwen 3.5 27B100100100999899.3%
Gemini 3.1 Flash Lite (Preview)1001001001009699.3%
DeepSeek-V2 Chat1001001001009699.3%
LFM2 24B100100100999899.3%
Qwen 3 32B100100100999799.2%
Gemini 2.5 Flash (Reasoning)1001001001009799.2%
Grok 4.20 (Beta, Reasoning)100100100989899.1%
Claude Opus 4.6100100100999799.1%
Claude Opus 4.5100100100989799.1%
Rocinante 12B100100100989799.1%
Nemotron 3 Super1001001001009599.0%
Claude 3.5 Sonnet100100100999699.0%
MiniMax M2.7100100100999699.0%
Qwen 3.5 9B100100100989799.0%
GPT-5.4 Nano (Reasoning, Low)10010099999699.0%
GPT-5.4 Mini (Reasoning, Low)1001001001009599.0%
GPT-4o Mini (temp=1)1001001001009598.9%
DeepSeek V3 (2025-03-24)1001001001009498.8%
Gemini 3 Pro (Preview)10010098989798.6%
Ministral 3 14B100100100989598.5%
DeepSeek V3.210010099979698.4%
Z.AI GLM 4.71009998979798.4%
MiniMax M2.5100100100979598.3%
GPT-5.4 Nano10010098989598.3%
GPT-5.4 Nano (Reasoning)100100100999298.2%
Claude Opus 4.6 (Reasoning)1001001001009198.2%
WizardLM 2 8x22b100100100999298.2%
Z.AI GLM 5 Turbo1009998989598.1%
Claude Sonnet 4.6 (Reasoning)10010097979698.1%
Gemini 3 Flash (Preview, Reasoning)10010099989398.1%
Mistral Medium 3.11001001001009097.9%
GPT-5 Mini10010099979497.9%
Z.AI GLM 4.510010099989397.9%
Llama 3.1 70B100100100999197.8%
Ministral 3 3B1001001001008997.8%
Gemma 3 12B1009999989397.8%
ByteDance Seed 1.6 Flash100100100979297.7%
GPT-4o Mini (temp=0)100100100979197.6%
Cohere Command R+ (Aug. 2024)10010099998997.5%
Claude 3 Haiku100100100979097.4%
Qwen 3.5 35B10010097969397.4%
Arcee AI: Trinity Large (Preview)1001001001008797.4%
GPT-4o, May 13th (temp=1)10010098979297.3%
Gemma 3 4B10010097969397.3%
Mistral Large1009999999097.3%
GPT-5 Nano10010099969197.2%
Claude Sonnet 4.610010098979097.1%
Mistral Small 410010098979197.1%
Stealth: Hunter Alpha989898969597.0%
Ministral 3B10010099969097.0%
Claude Haiku 4.5100100100988797.0%
Mistral Small Creative1009897969296.8%
Z.AI GLM 4.7 Flash10010096949496.8%
Claude 3.7 Sonnet10010098978996.8%
Hermes 3 70B100100100998396.4%
ByteDance Seed 1.61009895959396.2%
GPT-4o, May 13th (temp=0)1009796949396.0%
Aion 2.010010099928995.9%
Arcee AI: Trinity Mini100100100928895.9%
Z.AI GLM 510010095949095.7%
Llama 3.1 Nemotron 70B10010096968595.2%
DeepSeek V3.11009795948995.2%
ByteDance Seed 2.0 Lite1009996968595.0%
Stealth: Healer Alpha10010096928795.0%
Mistral Large 3999895948994.9%
Gemini 2.5 Pro10010097918794.8%
Gemini 2.5 Flash Lite (Reasoning)10010099898594.6%
Ministral 3 8B999796948694.4%
Qwen 2.5 72B989694929194.3%
Gemma 3 27B10010093928594.1%
Ministral 8B1009795928694.0%
Mistral Large 21009393929093.7%
Nemotron 3 Nano100100100977093.4%
Llama 3.1 8B1009893888793.3%
Gemini 3 Flash (Preview)999994888693.2%
Z.AI GLM 4.61009994868693.1%
Inception Mercury100100100897592.8%
GPT-4o, Aug. 6th (temp=0)100100100897392.5%
Gemini 2.5 Flash Lite949392898791.0%
Mistral NeMO989385857887.9%
ByteDance Seed 2.0 Mini948969676677.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemini 2.5 Flash (Reasoning)10010010010010099.9%
GPT-4.1 Mini10010010010010099.9%
Mistral Small 410010010010010099.9%
Mistral Small 4 (Reasoning)1001001001009999.9%
GPT-4.1 Nano1001001001009999.9%
Writer: Palmyra X51001001001009999.8%
GPT-4o, Aug. 6th (temp=0)100100100999999.6%
Grok 41001001001009899.6%
Qwen 3.5 122B1001001001009899.5%
o4 Mini High1001001001009899.5%
Claude 3.7 Sonnet100100100999999.5%
GPT-5.41001001001009699.3%
GPT-4o, Aug. 6th (temp=1)100100100999899.3%
Qwen 3.5 27B100100100989899.1%
Mistral Large 31001001001009599.1%
GPT-5.4 (Reasoning, Low)100100100989799.0%
GPT-4o Mini (temp=0)1001001001009699.0%
MiniMax M2.51001001001009599.0%
Z.AI GLM 4.7100100100999699.0%
GPT-5.4 (Reasoning)10010099989798.9%
Qwen 3.5 Plus (2026-02-15)100100100979798.8%
Z.AI GLM 5100100100979698.7%
GPT-5.110010099979798.6%
Mistral Large 2100100100979698.6%
Ministral 3B10010098979798.6%
Qwen 3 32B100100100989598.6%
Grok 4.20 (Beta, Reasoning)100100100979698.6%
Gemini 2.5 Flash Lite (Reasoning)10010099999598.6%
Gemma 3 27B1001001001009398.5%
Mistral Small Creative10010098979798.5%
Gemini 3 Pro (Preview)1001001001009298.4%
Claude Opus 4.61009999999498.3%
ByteDance Seed 1.6 Flash100100100999398.3%
Claude Sonnet 4100100100999398.3%
Claude Opus 4.5100100100979398.1%
Qwen 3.5 397B A17B1001001001009098.1%
GPT-4.11001001001009098.1%
Claude 3 Haiku10010099999398.1%
Nemotron 3 Nano10010098989498.1%
Mistral Large100100100999198.1%
Claude Sonnet 4.51001001001009098.0%
GPT-5.21009999969698.0%
GPT-5100100100979397.9%
MoonshotAI: Kimi K2.5100100100959497.9%
GPT-4o, May 13th (temp=1)100100100979297.8%
Arcee AI: Trinity Mini1001001001008997.7%
DeepSeek-V2 Chat100100100959497.7%
Mistral NeMO100100100989097.7%
ByteDance Seed 1.6100100100969197.5%
Gemini 2.5 Flash10010099969297.3%
Claude Opus 4.6 (Reasoning)100100100988997.3%
Qwen 3.5 9B100100100969097.2%
Claude 3.5 Sonnet100100100988797.0%
GPT-5.4 Nano (Reasoning, Low)1009999959197.0%
Gemma 3 4B999999949396.9%
Rocinante 12B10010097949496.8%
Nemotron 3 Super10010098939396.8%
Gemini 2.5 Flash Lite10010098949296.8%
Hermes 3 70B10010098939396.6%
Mistral Medium 3.1100100100948996.6%
Cohere Command R+ (Aug. 2024)10010099939296.6%
GPT-5.4 Mini10010098939196.5%
LFM2 24B1009998949296.5%
Z.AI GLM 5 Turbo100100100978596.4%
GPT-5 Nano1009997959196.4%
Grok 4.20 (Beta)10010099939096.4%
Gemini 3.1 Pro (Preview)10010097958996.3%
Z.AI GLM 4.7 Flash10010098948996.3%
Arcee AI: Trinity Large (Preview)100100100968596.2%
Ministral 3 3B100100100938896.1%
Z.AI GLM 4.5100100100968496.1%
Stealth: Healer Alpha1009998948996.1%
Qwen 3.5 Flash1009897958995.8%
GPT-5.4 Nano (Reasoning)10010096938995.6%
Claude Sonnet 4.6 (Reasoning)100100100928695.6%
GPT-4o, May 13th (temp=0)10010098928795.4%
Ministral 3 8B10010097918995.4%
GPT-5.4 Nano10010094939095.4%
Inception Mercury1001001001007795.4%
DeepSeek V3.21009997958695.3%
Llama 3.1 8B100100100958295.3%
Hermes 3 405B1009995929095.2%
Stealth: Aurora Alpha10010096928995.2%
Ministral 3 14B989695949295.1%
Ministral 8B989894939294.7%
GPT-5.4 Mini (Reasoning, Low)989794949194.7%
Gemini 2.5 Pro1009893929094.7%
DeepSeek V3 (2024-12-26)100100100888193.9%
Inception Mercury 2969694938993.7%
Gemini 3 Flash (Preview)979796908993.7%
Z.AI GLM 4.6969493919192.9%
Aion 2.0989392898992.2%
Mistral Small 3.2 24B100100100827992.2%
GPT-5 Mini1009693898392.2%
Claude Sonnet 4.6989591888691.4%
GPT-5.4 Mini (Reasoning)999591898391.4%
Stealth: Hunter Alpha1009898837991.3%
MiniMax M2.71009489897990.3%
Gemma 3 12B939292918389.9%
Llama 3.1 Nemotron 70B1009793847589.8%
Gemini 3.1 Flash Lite (Preview)969290898289.8%
Qwen 3.5 35B969191857888.3%
DeepSeek V3.1949490907287.8%
ByteDance Seed 2.0 Lite989188847587.0%
Gemini 3 Flash (Preview, Reasoning)1009292846887.0%
Qwen 2.5 72B949084828086.0%
WizardLM 2 8x22b1009997736086.0%
Llama 3.1 70B968583787483.1%
ByteDance Seed 2.0 Mini918885836682.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 (Reasoning, Low)1001001001009999.9%
Grok 4 Fast1001001001009999.9%
Ministral 3 3B1001001001009799.5%
Grok 4.20 (Beta)1001001001009799.5%
Claude Opus 4.5100100100989899.4%
GPT-5.4 Mini (Reasoning)10010099999999.4%
Claude Opus 41001001001009699.3%
Claude Haiku 4.510010099999899.2%
GPT-4o Mini (temp=1)1001001001009599.1%
Mistral Small Creative10010099999799.0%
GPT-4o, Aug. 6th (temp=1)1001001001009599.0%
Gemma 3 27B10010099999799.0%
Grok 4.20 (Beta, Reasoning)10010099999799.0%
GPT-5.4100100100989798.9%
DeepSeek V3 (2024-12-26)10010099989898.9%
GPT-5.110010099989798.8%
Cohere Command R+ (Aug. 2024)100100100999498.6%
Mistral Small 4 (Reasoning)100100100979698.6%
Writer: Palmyra X510010099989598.4%
GPT-4.1100100100969598.3%
Qwen3 235B A22B Instruct 2507100100100969598.2%
Qwen 3.5 397B A17B10010099989498.1%
GPT-51009897979798.0%
MoonshotAI: Kimi K2.510010099969497.8%
Qwen 3.5 35B10010098979497.8%
Claude Opus 4.610010097969697.8%
GPT-5.4 Nano (Reasoning)1009997979697.8%
Mistral Small 410010099989197.8%
Ministral 3 8B10010099989197.7%
Z.AI GLM 5 Turbo100100100959397.6%
Mistral Medium 3.1100100100998897.4%
DeepSeek V3 (2025-03-24)100100100949397.4%
GPT-5.210010097959597.4%
Claude 3.5 Sonnet10010099949397.3%
Ministral 3B10010099949397.2%
o4 Mini10010099988897.2%
Claude 3.7 Sonnet100100100978997.1%
Claude Sonnet 4.6999997969497.1%
DeepSeek-V2 Chat100100100959097.0%
GPT-5.4 Mini (Reasoning, Low)100100100978897.0%
Mistral Small 3.2 24B100100100959096.9%
Gemini 2.5 Flash100100100968996.8%
Claude Opus 4.6 (Reasoning)10010099939296.7%
DeepSeek V3.21009995949496.6%
Gemma 3 12B10010095959396.5%
Qwen 3.5 Plus (2026-02-15)1009797959396.4%
Hermes 3 405B1009998978896.4%
Aion 2.01009896959396.4%
GPT-4.1 Nano10010096959096.3%
Gemini 3 Pro (Preview)999796969396.2%
Gemini 2.5 Flash Lite (Reasoning)1009797969196.1%
GPT-4.1 Mini10010095949296.1%
Gemini 2.5 Flash (Reasoning)1009796959296.1%
Stealth: Hunter Alpha10010098928995.8%
GPT-4o, Aug. 6th (temp=0)100100100948595.8%
GPT-5.4 Mini1009896949195.7%
GPT-4o Mini (temp=0)10010096929095.7%
Arcee AI: Trinity Large (Preview)10010094949195.6%
Claude Sonnet 4.6 (Reasoning)1009796958895.4%
Arcee AI: Trinity Mini999996929095.3%
MiniMax M2.510010096948795.3%
LFM2 24B10010097948595.1%
Qwen 3.5 Flash1009997918895.1%
Mistral Large1009796919195.1%
Mistral Large 2999895939095.0%
Qwen 3.5 27B1009895929095.0%
Llama 3.1 8B10010096908994.9%
Z.AI GLM 4.7999994929194.9%
GPT-4o, May 13th (temp=1)10010098968194.9%
Claude Sonnet 41009694939194.8%
Grok 4100100100947994.6%
Mistral Large 3100100100987494.5%
GPT-5.4 Nano (Reasoning, Low)1009595948994.4%
Llama 3.1 70B1009998928394.4%
WizardLM 2 8x22b10010099878694.4%
Ministral 8B1009494939094.3%
Gemini 3.1 Pro (Preview)1009893909094.3%
Claude 3 Haiku100100100947794.2%
Hermes 3 70B1009895928694.1%
ByteDance Seed 1.6 Flash10010092898994.0%
Ministral 3 14B10010097878694.0%
Qwen 3 32B989895958393.9%
Z.AI GLM 5999794928793.7%
GPT-4o, May 13th (temp=0)989794918793.4%
Gemma 3 4B1009892898592.9%
MiniMax M2.71009392908992.7%
Mistral NeMO1009292918992.6%
GPT-5.4 Nano1009592918592.6%
ByteDance Seed 2.0 Lite10010088878692.2%
DeepSeek V3.1989492898892.2%
Z.AI GLM 4.51009595937892.2%
Inception Mercury 2989892878692.2%
Gemini 2.5 Flash Lite949391919192.0%
Qwen 2.5 72B10010091888191.9%
GPT-5 Mini989893927891.7%
Z.AI GLM 4.61009490878491.0%
Llama 3.1 Nemotron 70B989792888091.0%
Gemini 2.5 Pro979793868291.0%
Qwen 3.5 9B979693907990.9%
Qwen 3.5 122B979589898590.9%
GPT-5 Nano999191888490.8%
Rocinante 12B10010095897090.7%
Stealth: Healer Alpha1009987868190.7%
Gemini 3 Flash (Preview)969491888089.8%
Z.AI GLM 4.7 Flash959388878489.6%
Nemotron 3 Super988989887988.7%
Gemini 3.1 Flash Lite (Preview)929089888388.7%
Nemotron 3 Nano969695787287.2%
Gemini 3 Flash (Preview, Reasoning)999086807886.6%
Stealth: Aurora Alpha898887868186.2%
Inception Mercury958987755981.1%
ByteDance Seed 2.0 Mini858177686575.3%
ByteDance Seed 1.6918970704172.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Mistral NeMO10010010010010099.9%
Mistral Medium 3.11001001001009999.8%
Claude 3.7 Sonnet1001001001009899.7%
GPT-5.4 (Reasoning)1001001001009999.7%
GPT-5.41001001001009999.6%
Gemini 2.5 Flash1001001001009899.6%
Mistral Large 2100100100999999.6%
Claude Sonnet 4.51001001001009799.4%
Rocinante 12B1001001001009799.2%
DeepSeek V3 (2025-03-24)100100100999799.2%
Qwen3 235B A22B Instruct 25071001001001009699.2%
Gemma 3 4B1001001001009699.2%
GPT-4.1100100100989899.2%
Mistral Small 3.2 24B1001001001009599.0%
GPT-5.4 Nano100100100989798.9%
GPT-4o Mini (temp=1)1001001001009598.9%
Claude 3.5 Sonnet100100999798.9%
Mistral Small 4 (Reasoning)10010099999698.9%
DeepSeek-V2 Chat1001001001009498.8%
Qwen 2.5 72B1001001001009498.7%
Arcee AI: Trinity Large (Preview)10010099979798.7%
Mistral Small 4100100100999598.7%
GPT-4o, May 13th (temp=1)100100100999498.7%
GPT-5.4 Mini (Reasoning, Low)100100100979798.7%
GPT-5.21009999989898.6%
Gemini 2.5 Flash (Reasoning)100100100989598.6%
Qwen 3 32B1001001001009398.6%
Grok 4100100100999498.6%
Hermes 3 405B100100100979598.5%
Qwen 3.5 9B10010099989598.5%
Ministral 8B10010099989598.4%
Writer: Palmyra X51001001001009298.3%
Grok 4.20 (Beta)10010098989598.3%
Claude Opus 41001001001009198.2%
o4 Mini100100100999298.2%
DeepSeek V3 (2024-12-26)100100100979498.1%
Llama 3.1 70B10010099969698.1%
GPT-5 Nano100100100979397.9%
Z.AI GLM 4.6100100100959497.7%
Stealth: Hunter Alpha10010098969497.7%
Hermes 3 70B100100100969297.7%
Claude 3 Haiku10010099989197.7%
ByteDance Seed 1.6 Flash1001001001008897.6%
Gemini 2.5 Flash Lite10010097969497.5%
Gemini 2.5 Flash Lite (Reasoning)1001001001008897.4%
GPT-5.4 Nano (Reasoning, Low)100100100969197.4%
Claude Haiku 4.51001001001008797.4%
Ministral 3 14B10010098959497.4%
Gemini 3.1 Flash Lite (Preview)10010098959497.3%
Claude Sonnet 4.6 (Reasoning)100100100939397.2%
Claude 3.5 Haiku10010099959297.1%
Qwen 3.5 Flash100100100949197.1%
Z.AI GLM 4.7 Flash10010096969397.0%
DeepSeek V3.21009998959297.0%
MiniMax M2.71009996959597.0%
Gemini 3.1 Pro (Preview)10010097969296.9%
Z.AI GLM 4.51001001001008596.9%
Ministral 3B10010098969096.8%
DeepSeek V3.11009999978996.7%
Mistral Small Creative10010099968896.7%
Inception Mercury 21009999968996.7%
Gemini 3 Pro (Preview)10010096969196.7%
GPT-5.4 (Reasoning, Low)1009897959496.7%
GPT-5.4 Mini10010098978896.6%
Qwen 3.5 122B10010096949396.5%
Cohere Command R+ (Aug. 2024)10010099998596.4%
Qwen 3.5 397B A17B10010097968996.3%
Z.AI GLM 4.71009896959296.1%
Mistral Large10010096968996.1%
LFM2 24B10010099958696.0%
Mistral Large 310010095939296.0%
GPT-5.4 Nano (Reasoning)10010098968596.0%
Gemini 2.5 Pro1009896949295.9%
MiniMax M2.510010098948695.6%
Claude Opus 4.6 (Reasoning)1009795939395.5%
Claude Opus 4.610010097958695.5%
Ministral 3 8B10010099908995.5%
Gemma 3 12B10010097918995.4%
GPT-4.1 Nano1009999918795.2%
Z.AI GLM 5 Turbo10010097958395.1%
MoonshotAI: Kimi K2.510010099948295.1%
Gemini 3 Flash (Preview)1009897938795.1%
GPT-5 Mini1009997928795.0%
Qwen 3.5 27B10010094928895.0%
GPT-51009694939194.9%
Llama 3.1 Nemotron 70B999898918994.8%
Gemini 3 Flash (Preview, Reasoning)1009694939294.8%
Z.AI GLM 510010097898894.7%
GPT-5.4 Mini (Reasoning)1009995918894.5%
ByteDance Seed 2.0 Mini999797908794.1%
Llama 3.1 8B10010095938294.1%
Stealth: Aurora Alpha1009692919093.9%
ByteDance Seed 1.610010092918593.7%
Grok 4.20 (Beta, Reasoning)979594928993.6%
Claude Opus 4.5969595948993.6%
Nemotron 3 Nano1009594928593.2%
Qwen 3.5 35B1009793918493.2%
Claude Sonnet 410010092908292.9%
Qwen 3.5 Plus (2026-02-15)10010096907892.6%
Gemma 3 27B999897858492.5%
GPT-4o, Aug. 6th (temp=1)100100100887492.4%
GPT-4.1 Mini1009998917392.3%
Aion 2.0979593908692.2%
Ministral 3 3B100100100877392.2%
ByteDance Seed 2.0 Lite10010092878091.8%
Claude Sonnet 4.6989292898791.7%
Stealth: Healer Alpha999795907290.6%
Inception Mercury10010099856990.5%
Nemotron 3 Super1008987828187.8%
Arcee AI: Trinity Mini1009191817186.9%
WizardLM 2 8x22b1001001001003286.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast1001001001009899.7%
Llama 3.1 8B1001001001009899.5%
Hermes 3 405B1001001001009699.3%
LFM2 24B1001001001009699.2%
DeepSeek V3 (2025-03-24)10010099989798.7%
o4 Mini High100100100989598.4%
o4 Mini1009898979597.8%
GPT-4o Mini (temp=1)10010099949497.3%
Qwen3 235B A22B Instruct 25071009998969497.3%
GPT-4o, May 13th (temp=1)10010096959597.1%
Gemma 3 12B100100100939196.8%
Grok 410010098988596.3%
Mistral Small 410010099958796.3%
Writer: Palmyra X51009696959396.0%
Arcee AI: Trinity Large (Preview)10010097919095.6%
Claude 3.7 Sonnet1009898938895.6%
Gemma 3 4B100100100958295.4%
GPT-5 Nano1009895948995.0%
GPT-4.110010098948394.9%
Claude 3.5 Sonnet989893939294.8%
Claude Opus 41009898928694.8%
Claude Sonnet 4.5989795949094.7%
GPT-4o, May 13th (temp=0)1009695948994.7%
Gemini 2.5 Flash Lite10010099987694.6%
Hermes 3 70B10010099888594.3%
Grok 4 Fast10010093918794.3%
GPT-5.11009797908794.2%
Stealth: Hunter Alpha1009693919094.0%
Claude 3 Haiku10010094908693.9%
Llama 3.1 70B1009897908593.9%
Qwen 3.5 9B1009797918493.8%
Qwen 3.5 122B969694948993.8%
Stealth: Aurora Alpha10010095928393.8%
Z.AI GLM 5999797908593.7%
Mistral Small Creative10010092908593.4%
Ministral 3 3B10010094898493.3%
GPT-51009595938393.2%
GPT-4.1 Mini1009995908293.2%
Mistral Small 4 (Reasoning)1009695908492.9%
Mistral Large10010096868392.8%
Gemma 3 27B989894888592.7%
Stealth: Healer Alpha989695888692.7%
DeepSeek V3.1969492919092.5%
Z.AI GLM 4.71009391898892.2%
DeepSeek V3 (2024-12-26)969594918592.1%
GPT-5.41009392898692.1%
Gemini 3 Pro (Preview)1009890888492.0%
Claude Sonnet 4100100100857491.8%
Gemini 2.5 Flash1009593858491.6%
GPT-4o, Aug. 6th (temp=1)10010092858091.2%
GPT-5.4 (Reasoning)929292908991.1%
MiniMax M2.5959390898791.0%
Mistral NeMO979592878390.9%
Ministral 3 8B1009291908190.9%
Z.AI GLM 4.5959491898590.9%
Qwen 3 32B999791877990.8%
GPT-4.1 Nano1009794857890.8%
Mistral Medium 3.11009897827690.6%
Ministral 3B1009491868290.5%
Rocinante 12B1009492907690.4%
GPT-5.2999088878790.3%
Gemini 2.5 Pro969695877690.1%
Gemini 3 Flash (Preview, Reasoning)979290868590.0%
Qwen 3.5 Flash979189898389.9%
Gemini 2.5 Flash Lite (Reasoning)989291848489.8%
Llama 3.1 Nemotron 70B1009796797789.7%
ByteDance Seed 1.6 Flash999592877589.6%
Grok 4.20 (Beta, Reasoning)989490848289.5%
Qwen 3.5 35B969590848289.3%
Qwen 3.5 27B949390898289.3%
DeepSeek-V2 Chat999389848189.3%
GPT-4o Mini (temp=0)928989878688.7%
Gemini 2.5 Flash (Reasoning)969088858488.6%
Claude Sonnet 4.6919089878588.6%
Mistral Large 3929188888388.5%
WizardLM 2 8x22b959490887688.4%
Cohere Command R+ (Aug. 2024)1009684818088.0%
Claude Sonnet 4.6 (Reasoning)999590807788.0%
Gemini 3.1 Pro (Preview)989889777687.8%
GPT-5.4 Nano (Reasoning, Low)929089848487.6%
Claude 3.5 Haiku1009488817587.6%
Gemini 3.1 Flash Lite (Preview)959286848187.5%
Z.AI GLM 5 Turbo929190887587.3%
Z.AI GLM 4.6939389867587.2%
Qwen 3.5 Plus (2026-02-15)1008685848187.2%
Claude Opus 4.5989791787287.1%
Z.AI GLM 4.7 Flash959088837987.0%
Gemini 3 Flash (Preview)928787868286.7%
Arcee AI: Trinity Mini1009895786386.7%
Claude Haiku 4.51009881797686.7%
MoonshotAI: Kimi K2.5969392837186.7%
GPT-5.4 Nano908986858186.2%
Mistral Small 3.2 24B1009282797986.1%
Ministral 3 14B939393836986.0%
GPT-5.4 Mini (Reasoning, Low)918785858186.0%
GPT-5.4 (Reasoning, Low)968785837985.9%
MiniMax M2.7938986847685.7%
Inception Mercury1009690875485.5%
GPT-5.4 Nano (Reasoning)938986867385.5%
ByteDance Seed 2.0 Lite988786827485.5%
DeepSeek V3.21009382807185.4%
Aion 2.0918887817884.9%
Claude Opus 4.6968984787684.9%
Grok 4.20 (Beta)988784787884.8%
GPT-4o, Aug. 6th (temp=0)909088866784.4%
Qwen 2.5 72B898985857284.2%
GPT-5 Mini949183787484.1%
Qwen 3.5 397B A17B959484767084.0%
Mistral Large 2969284836383.6%
GPT-5.4 Mini858584828083.0%
ByteDance Seed 1.610010078726482.8%
Claude Opus 4.6 (Reasoning)898783817382.7%
Ministral 8B978484776781.8%
GPT-5.4 Mini (Reasoning)968579737180.9%
Inception Mercury 2938580775978.8%
Nemotron 3 Nano988580724676.1%
Nemotron 3 Super867171675570.0%
ByteDance Seed 2.0 Mini736765616065.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-5.4100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Small 4100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.110010010010010099.9%
GPT-4o, May 13th (temp=1)1001001001009999.8%
Grok 4.1 Fast1001001001009999.8%
Hermes 3 405B1001001001009899.6%
GPT-4o, Aug. 6th (temp=0)1001001001009799.5%
GPT-4.11001001001009799.4%
Mistral Medium 3.1100100100999899.3%
Qwen 3.5 122B100100100999899.3%
Claude 3.5 Sonnet10010099999899.3%
Grok 4.20 (Beta)100100100999799.2%
Mistral Large 2100100100999799.2%
Ministral 3B10010099999899.2%
GPT-4.1 Mini100100100999799.2%
GPT-510010099999899.2%
DeepSeek-V2 Chat1001001001009699.2%
Mistral Small 4 (Reasoning)100100100989899.2%
o4 Mini1001001001009599.1%
Gemini 3 Pro (Preview)10010099999899.1%
Claude Opus 4.61009999989899.0%
Z.AI GLM 5 Turbo100100100989799.0%
GPT-4o Mini (temp=1)100100100999699.0%
MoonshotAI: Kimi K2.5100100100999699.0%
Writer: Palmyra X5100100100989798.9%
Gemini 2.5 Flash1001001001009598.9%
GPT-5.4 Mini (Reasoning)10010099989798.9%
LFM2 24B1001001001009498.9%
GPT-5.4 Nano (Reasoning, Low)100100100989698.8%
Llama 3.1 70B1009998989898.8%
Mistral Large 31001001001009498.7%
Claude Opus 41001001001009498.7%
Grok 410010099989798.7%
MiniMax M2.51001001001009498.7%
Mistral Large10010099989698.6%
GPT-5.21009998989798.5%
Gemini 2.5 Flash (Reasoning)100100100979598.5%
Qwen 3.5 9B100100100999498.5%
Qwen3 235B A22B Instruct 2507100100100979698.4%
Gemma 3 4B100100100999398.4%
GPT-5.4 Mini (Reasoning, Low)100100100969698.4%
Claude Sonnet 41001001001009298.4%
Qwen 3.5 Flash10010099999598.4%
Claude Sonnet 4.5100100100979498.4%
DeepSeek V3 (2024-12-26)100100100979598.3%
Mistral Small Creative100100100969598.3%
Claude Sonnet 4.61001001001009298.3%
Ministral 3 8B100100100989398.3%
Grok 4.20 (Beta, Reasoning)100100100979498.2%
GPT-5.4 Nano (Reasoning)100100100969698.2%
GPT-5.4 (Reasoning)10010099969597.9%
GPT-4.1 Nano10010099979497.9%
Ministral 8B1001001001009097.9%
Gemini 2.5 Flash Lite (Reasoning)100100100999097.8%
GPT-5.4 Mini10010098969597.8%
GPT-5.4 (Reasoning, Low)10010099979297.7%
Hermes 3 70B100100100989097.6%
MiniMax M2.7100100100979097.6%
GPT-5.4 Nano1009998979497.6%
GPT-4o Mini (temp=0)100100100949497.5%
Stealth: Aurora Alpha10010098969397.5%
Ministral 3 14B100100100959397.5%
Aion 2.010010097979397.5%
Claude 3.7 Sonnet10010096969497.4%
Z.AI GLM 4.710010097969497.4%
ByteDance Seed 1.6 Flash100100100949397.4%
Llama 3.1 8B10010099998997.4%
Claude 3.5 Haiku1001001001008797.4%
Qwen 3.5 35B1009997969497.3%
GPT-5 Nano10010099998897.2%
Qwen 2.5 72B1001001001008797.2%
Arcee AI: Trinity Mini10010098969297.2%
Gemini 3 Flash (Preview)10010098979097.1%
GPT-4o, Aug. 6th (temp=1)1009999988997.0%
Qwen 3.5 397B A17B1009897959497.0%
Z.AI GLM 4.7 Flash10010098949397.0%
Rocinante 12B1001001001008596.9%
Claude Opus 4.6 (Reasoning)1009797969696.9%
Gemini 3 Flash (Preview, Reasoning)10010098949296.9%
Gemini 3.1 Pro (Preview)10010099988896.8%
Stealth: Healer Alpha10010099939096.6%
Mistral NeMO100100100928996.2%
Arcee AI: Trinity Large (Preview)10010097929296.1%
Llama 3.1 Nemotron 70B10010097958796.0%
Stealth: Hunter Alpha10010096948995.9%
Gemini 2.5 Flash Lite10010096939095.9%
Nemotron 3 Super100100100928795.9%
Inception Mercury 210010097938995.7%
Claude Haiku 4.5100100100978195.6%
DeepSeek V3.210010098909095.5%
Mistral Small 3.2 24B100100100938495.4%
Qwen 3.5 Plus (2026-02-15)100100100958295.4%
Z.AI GLM 51009898968495.3%
Qwen 3.5 27B10010098938695.3%
Claude Sonnet 4.6 (Reasoning)10010097968495.2%
Gemini 2.5 Pro10010096968495.2%
Cohere Command R+ (Aug. 2024)100100100987795.1%
WizardLM 2 8x22b10010094919094.9%
Z.AI GLM 4.61009998898894.9%
DeepSeek V3.11009998958294.8%
ByteDance Seed 1.610010092919094.5%
Nemotron 3 Nano989794939094.4%
GPT-5 Mini989898928394.0%
Gemini 3.1 Flash Lite (Preview)999393928492.1%
Z.AI GLM 4.51009793927791.7%
ByteDance Seed 2.0 Lite989893868391.5%
ByteDance Seed 2.0 Mini1009891828290.7%
Inception Mercury10010098615582.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
o4 Mini High1001001001009899.5%
Grok 4.1 Fast1001001001009899.5%
Grok 4 Fast1001001001009699.3%
Writer: Palmyra X5100100100999799.2%
GPT-5.1100100100989799.0%
o4 Mini1001001001009599.0%
DeepSeek V3 (2025-03-24)1001001001009599.0%
Ministral 3 3B1009999989898.9%
GPT-5.4 Mini (Reasoning)1009999989898.8%
GPT-5.4 (Reasoning, Low)100100100989698.8%
Qwen3 235B A22B Instruct 2507100100100979798.8%
GPT-510010099989698.6%
GPT-4.110010099989598.4%
DeepSeek V3 (2024-12-26)100100100969698.4%
Claude 3.7 Sonnet1009999989498.0%
Mistral Small Creative100100100989298.0%
Hermes 3 405B100100100969397.9%
DeepSeek V3.110010099969497.8%
Mistral Medium 3.11009997969597.7%
Ministral 3 14B10010099989197.6%
Gemma 3 12B1009999989197.5%
Gemma 3 27B100100100959397.4%
DeepSeek-V2 Chat1009998959597.3%
Llama 3.1 8B1001001001008697.2%
GPT-5.21009898989297.2%
Mistral Small 4 (Reasoning)1009998979297.2%
Z.AI GLM 4.5100100100949297.1%
Grok 4100100100959197.1%
Qwen 3.5 122B10010099988997.0%
LFM2 24B100100100968896.8%
GPT-5.4 Mini989797969496.5%
Ministral 3B10010096949296.5%
GPT-4o, May 13th (temp=1)10010098939296.5%
Qwen 3 32B10010095949396.4%
GPT-5.41009898949296.4%
Gemma 3 4B999996969296.4%
Claude 3 Haiku10010098939196.4%
Gemini 3.1 Pro (Preview)999797959496.4%
Arcee AI: Trinity Large (Preview)10010097939196.4%
Claude Opus 410010097949096.3%
Gemini 2.5 Flash Lite (Reasoning)989897969396.2%
Qwen 3.5 397B A17B10010098939096.1%
GPT-5.4 Mini (Reasoning, Low)10010097958996.1%
GPT-4o, May 13th (temp=0)10010099948796.0%
GPT-4.1 Nano10010097958896.0%
Rocinante 12B1009896969096.0%
GPT-5.4 (Reasoning)1009796949295.8%
Claude Haiku 4.510010097929095.8%
GPT-5.4 Nano (Reasoning)1009795949395.7%
Mistral Small 410010094929195.5%
GPT-4o, Aug. 6th (temp=1)1009993939295.4%
WizardLM 2 8x22b100100100987995.3%
Grok 4.20 (Beta)1009795949095.3%
Mistral Small 3.2 24B10010096918895.1%
Mistral Large 210010098928595.0%
GPT-4o, Aug. 6th (temp=0)999594949295.0%
Claude Sonnet 4.61009797918994.9%
GPT-4o Mini (temp=1)1009695948994.9%
Z.AI GLM 51009694939094.8%
Gemini 2.5 Pro1009997958294.7%
GPT-4.1 Mini10010097958294.7%
Gemini 2.5 Flash Lite10010093918994.7%
Stealth: Hunter Alpha989896938694.4%
Gemini 2.5 Flash (Reasoning)989695939094.4%
Hermes 3 70B1009897918594.3%
Nemotron 3 Super989694929194.2%
Stealth: Healer Alpha1009494948893.9%
Qwen 3.5 Plus (2026-02-15)1009994898793.6%
GPT-5 Nano1009494908993.5%
MiniMax M2.51009693908893.4%
GPT-5.4 Nano (Reasoning, Low)1009893918493.3%
Mistral Large1009691898993.2%
Z.AI GLM 4.71009695898693.2%
Ministral 3 8B10010089898793.1%
Inception Mercury 2979795898893.1%
Mistral NeMO989792908993.0%
Ministral 8B1009696888493.0%
ByteDance Seed 1.6 Flash969592929093.0%
Qwen 3.5 27B959493929092.8%
Z.AI GLM 4.6979693908792.7%
Gemini 3 Pro (Preview)979791918792.6%
Llama 3.1 Nemotron 70B1009796888192.4%
Claude Sonnet 4.6 (Reasoning)10010091898292.3%
Claude Opus 4.61009796858292.2%
Mistral Large 3999594898492.1%
MiniMax M2.71009590898692.1%
ByteDance Seed 1.6999696937792.1%
Gemini 3 Flash (Preview)1009894927592.0%
Qwen 3.5 9B969494918491.7%
GPT-5.4 Nano979392908591.4%
Qwen 3.5 Flash989691878591.3%
DeepSeek V3.2989595888091.2%
Z.AI GLM 4.7 Flash989292898691.2%
Z.AI GLM 5 Turbo999291888691.1%
Claude Opus 4.5979290898891.1%
Claude Sonnet 4.5979491878591.0%
Arcee AI: Trinity Mini989793917390.5%
Inception Mercury100100100916290.5%
Qwen 3.5 35B989291868590.2%
MoonshotAI: Kimi K2.51009897916590.2%
Grok 4.20 (Beta, Reasoning)1009090868490.0%
Nemotron 3 Nano1009794827589.5%
Aion 2.0959592857989.3%
Cohere Command R+ (Aug. 2024)969292897688.8%
Gemini 2.5 Flash1009492867288.8%
Claude 3.5 Sonnet989389887588.5%
Qwen 2.5 72B1009792807488.5%
Stealth: Aurora Alpha1009490807888.4%
Claude Opus 4.6 (Reasoning)949189897988.4%
GPT-5 Mini968988878188.3%
Claude Sonnet 4969487828288.0%
Claude 3.5 Haiku989696826487.3%
Gemini 3.1 Flash Lite (Preview)1009588797487.1%
GPT-4o Mini (temp=0)929088848086.9%
Llama 3.1 70B1009983757586.4%
Gemini 3 Flash (Preview, Reasoning)969485817385.8%
ByteDance Seed 2.0 Mini898686848285.4%
ByteDance Seed 2.0 Lite988988875383.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemini 2.5 Flash Lite10010010010010099.9%
MoonshotAI: Kimi K2.51001001001009999.9%
GPT-5.4 Mini1001001001009999.9%
GPT-5.4 Mini (Reasoning)1001001001009999.8%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Rocinante 12B1001001001009999.7%
GPT-4o, Aug. 6th (temp=0)1001001001009999.7%
Mistral Large1001001001009899.6%
Mistral NeMO1001001001009899.6%
GPT-5.4 Nano (Reasoning, Low)100100100999899.6%
Mistral Small 3.2 24B1001001001009799.5%
Stealth: Aurora Alpha1001001001009799.5%
Nemotron 3 Super1001001001009799.4%
Inception Mercury 2100100100999899.4%
Mistral Small 41001001001009799.4%
Gemini 2.5 Flash1001001001009899.4%
Gemini 2.5 Flash (Reasoning)1001001001009799.4%
Mistral Small Creative1001001001009699.2%
Ministral 3 3B100100100999899.2%
Qwen 3 32B1001001001009699.2%
GPT-4o, May 13th (temp=1)1001001001009699.2%
Stealth: Healer Alpha100100100999799.1%
GPT-5.4 Mini (Reasoning, Low)100100100999699.1%
Claude Opus 4.51001001001009699.1%
Qwen 3.5 35B100100100989899.1%
Claude Sonnet 4.61001001001009699.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001009599.0%
GPT-5.4 (Reasoning, Low)100100100999699.0%
GPT-4o Mini (temp=1)1001001001009599.0%
Ministral 8B100100100989799.0%
GPT-4.1 Mini1001001001009498.9%
Claude Sonnet 4100100100999598.7%
GPT-4.1 Nano100100100999498.7%
Claude 3.7 Sonnet100100100979698.7%
Mistral Large 210010098989798.6%
Cohere Command R+ (Aug. 2024)10010098989798.6%
Qwen 3.5 Flash1001001001009398.5%
GPT-4o, Aug. 6th (temp=1)100100100999398.4%
Nemotron 3 Nano1001001001009298.4%
DeepSeek V3 (2024-12-26)100100100979598.4%
Mistral Large 310010099979698.4%
ByteDance Seed 1.6 Flash100100100969598.3%
Claude Opus 4.610010099989598.3%
Z.AI GLM 4.7100100100989398.2%
MiniMax M2.7100100100969498.2%
Gemma 3 12B100100100979498.1%
GPT-5 Nano10010099989398.1%
GPT-5 Mini10010098979598.0%
GPT-4o, May 13th (temp=0)100100100979398.0%
Writer: Palmyra X5100100100969498.0%
Hermes 3 405B1001001001009097.9%
MiniMax M2.51001001001008997.8%
Z.AI GLM 4.610010099969497.8%
Ministral 3 8B100100100999097.8%
Qwen3 235B A22B Instruct 250710010099959497.8%
GPT-5100100100969397.7%
Grok 4.20 (Beta)10010099999197.6%
Arcee AI: Trinity Large (Preview)100100100969297.5%
Gemini 3 Pro (Preview)10010098979397.5%
Ministral 3B10010097979397.3%
DeepSeek V3.21009998979397.3%
Z.AI GLM 5100100100988897.3%
Arcee AI: Trinity Mini10010097979297.3%
Gemma 3 27B10010096969597.2%
Claude Opus 4100100100968896.8%
DeepSeek-V2 Chat10010099939296.7%
Ministral 3 14B10010099959096.7%
Inception Mercury100100100948996.6%
Aion 2.01009998959196.6%
Hermes 3 70B10010099948996.4%
Claude 3 Haiku10010096959196.3%
WizardLM 2 8x22b10010095959095.9%
Claude Sonnet 4.6 (Reasoning)10010097938995.8%
Gemini 3 Flash (Preview, Reasoning)1009897929095.5%
Gemma 3 4B10010094929095.3%
Gemini 3 Flash (Preview)10010096958494.9%
Gemini 3.1 Flash Lite (Preview)10010096908694.3%
Z.AI GLM 4.7 Flash989595938893.7%
Qwen 2.5 72B1009492929093.6%
ByteDance Seed 1.610010097967493.3%
Stealth: Hunter Alpha989796898693.1%
Llama 3.1 Nemotron 70B1009694928292.7%
ByteDance Seed 2.0 Mini10010095937492.3%
Llama 3.1 8B1009695858392.0%
ByteDance Seed 2.0 Lite1009595848491.5%
Gemini 2.5 Pro969695898191.2%
DeepSeek V3.1959186797685.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Gemma 3 12B10010010010010099.9%
Qwen 3.5 Flash10010010010010099.9%
Stealth: Aurora Alpha1001001001009999.8%
GPT-4.11001001001009899.6%
Claude Haiku 4.51001001001009899.6%
Gemma 3 27B100100100999999.6%
Mistral Large 31001001001009899.6%
Qwen 3.5 Plus (2026-02-15)1001001001009899.5%
Mistral Large1001001001009899.5%
GPT-4.1 Nano1001001001009899.5%
Ministral 8B1001001001009899.5%
GPT-5.4 Nano (Reasoning, Low)1001001001009799.4%
Claude Opus 41001001001009799.4%
Inception Mercury 2100100100989899.4%
Grok 4.20 (Beta)1001001001009799.3%
GPT-4.1 Mini1001001001009799.3%
Mistral Medium 3.110010099999899.3%
Claude Sonnet 41001001001009699.3%
Z.AI GLM 4.5100100100989799.2%
Gemini 3.1 Flash Lite (Preview)100100100989899.1%
Claude 3.7 Sonnet100100100999799.0%
Z.AI GLM 5 Turbo1001001001009699.0%
Claude Opus 4.61001001001009498.8%
Writer: Palmyra X5100100100999598.8%
Gemini 2.5 Flash (Reasoning)100100100979698.6%
MiniMax M2.7100100100989698.6%
Gemini 2.5 Pro100100100999498.6%
Qwen 3.5 35B10010099979698.5%
Gemma 3 4B10010099989698.5%
Claude Opus 4.5100100100979598.5%
DeepSeek V3 (2024-12-26)100100100979598.4%
Rocinante 12B100100100979598.4%
Mistral Small Creative100100100989598.4%
Mistral Small 4 (Reasoning)100100100989498.4%
Cohere Command R+ (Aug. 2024)1001001001009298.3%
Ministral 3 3B1001001001009298.3%
Ministral 3 14B100100100979598.3%
Qwen 3.5 122B100100100999298.3%
Stealth: Hunter Alpha100100100979598.3%
DeepSeek V3 (2025-03-24)100100100979498.2%
Gemini 3 Pro (Preview)100100100989298.1%
Claude Opus 4.6 (Reasoning)1009999989598.1%
Z.AI GLM 51009998979798.1%
GPT-4o, Aug. 6th (temp=0)100100100969498.1%
Mistral Small 4100100100969498.0%
GPT-5 Mini10010099999298.0%
Qwen 3.5 27B10010099979397.8%
GPT-4o Mini (temp=0)100100100979297.8%
Qwen 2.5 72B10010097969697.7%
Ministral 3B10010098959497.5%
Z.AI GLM 4.7 Flash10010099989097.2%
Mistral Large 2100100100939397.2%
Stealth: Healer Alpha10010098969297.2%
DeepSeek-V2 Chat10010097949497.1%
ByteDance Seed 1.6 Flash100100100959197.1%
Gemini 2.5 Flash Lite (Reasoning)10010098949397.0%
ByteDance Seed 1.610010098978996.9%
WizardLM 2 8x22b10010098939296.8%
Nemotron 3 Nano10010099988796.7%
Arcee AI: Trinity Large (Preview)10010098939296.6%
GPT-5 Nano1009995959496.5%
Claude Sonnet 4.61001001001008296.3%
Gemini 2.5 Flash10010096939296.2%
Gemini 3 Flash (Preview, Reasoning)1009696959496.2%
GPT-4o, May 13th (temp=0)10010099958796.1%
Inception Mercury10010098978596.0%
Qwen 3 32B100100100948696.0%
Claude Sonnet 4.6 (Reasoning)100100100918895.8%
GPT-4o, May 13th (temp=1)10010098978495.8%
Nemotron 3 Super100100100908895.7%
Gemini 2.5 Flash Lite1009997958795.6%
Z.AI GLM 4.61009996948995.5%
GPT-4o, Aug. 6th (temp=1)100100100958395.5%
Hermes 3 405B10010096928995.5%
ByteDance Seed 2.0 Lite100100100918595.3%
Aion 2.010010094919195.1%
Gemini 3 Flash (Preview)979796949195.1%
DeepSeek V3.210010094928995.0%
Arcee AI: Trinity Mini100100100977895.0%
MiniMax M2.510010096888794.3%
Z.AI GLM 4.710010092908994.2%
Mistral NeMO1009997908594.0%
Claude 3 Haiku1009994897892.0%
Hermes 3 70B1009893897891.5%
Llama 3.1 8B10010099965991.0%
Llama 3.1 Nemotron 70B999089898790.8%
ByteDance Seed 2.0 Mini1009693887690.5%
Mistral Small 3.2 24B100100100945790.3%
DeepSeek V3.110010093886188.5%
Llama 3.1 70B989691866286.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
GPT-5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Mistral Small 3.2 24B10010010010010099.9%
DeepSeek-V2 Chat10010010010010099.9%
Mistral Large 21001001001009999.9%
Stealth: Hunter Alpha1001001001009999.9%
Qwen 3.5 9B10010010010010099.8%
GPT-4.1 Mini1001001001009999.8%
MoonshotAI: Kimi K2.51001001001009999.7%
GPT-4o Mini (temp=0)1001001001009999.7%
Z.AI GLM 4.61001001001009899.6%
Arcee AI: Trinity Mini1001001001009899.6%
Qwen3 235B A22B Instruct 2507100100100999999.5%
GPT-4o, Aug. 6th (temp=1)100100100999899.5%
Nemotron 3 Nano1001001001009899.5%
Gemini 2.5 Flash Lite (Reasoning)1001001001009799.4%
Gemini 2.5 Flash1001001001009799.4%
Gemma 3 4B1001001001009799.4%
Z.AI GLM 5 Turbo100100100999799.2%
Claude Opus 4.5100100100989899.2%
Claude Sonnet 4100100100989899.2%
Z.AI GLM 4.7100100100989799.1%
DeepSeek V3 (2024-12-26)10010099989899.0%
GPT-4o, May 13th (temp=1)1001001001009498.9%
Ministral 3 3B100100100999598.8%
Gemini 3 Flash (Preview)1001001001009498.8%
Ministral 8B10010099989798.8%
GPT-4.1 Nano100100100999598.8%
MiniMax M2.71001001001009498.7%
Mistral Small 4100100100999498.7%
Hermes 3 405B10010098989798.7%
Stealth: Healer Alpha1001001001009498.7%
ByteDance Seed 1.6 Flash100100100989698.7%
Writer: Palmyra X5100100100999498.7%
GPT-5.4 Mini100100100979698.6%
Mistral NeMO100100100999498.4%
Aion 2.0100100100989498.4%
Claude Sonnet 4.51001001001009298.4%
Claude 3.7 Sonnet100100100979498.4%
MiniMax M2.51001001001009198.2%
Gemma 3 27B100100100969598.1%
Ministral 3 8B10010099979498.0%
Ministral 3 14B100100100979297.8%
Nemotron 3 Super10010099959597.8%
Gemini 3 Pro (Preview)999998989497.6%
Qwen 2.5 72B100100100949397.5%
GPT-5 Nano10010098969497.5%
Mistral Large100100100969197.3%
Qwen 3 32B10010096959597.2%
GPT-5 Mini10010097969197.0%
DeepSeek V3.21009998979096.9%
WizardLM 2 8x22b100100100998696.9%
Claude 3.5 Sonnet100100100968996.8%
Claude Sonnet 4.610010099958996.7%
Arcee AI: Trinity Large (Preview)1009897969196.6%
Rocinante 12B100100100958696.2%
Gemini 2.5 Flash Lite100100100948796.1%
Hermes 3 70B100100100918995.8%
Claude Opus 4.6989897959195.6%
Ministral 3B10010099938695.6%
Z.AI GLM 4.7 Flash10010099908995.6%
Llama 3.1 8B1001001001007895.5%
Gemini 3 Flash (Preview, Reasoning)10010097968495.5%
Z.AI GLM 51009797958895.4%
Gemma 3 12B10010098978295.3%
Mistral Large 310010098968295.3%
Gemini 2.5 Pro10010096908995.0%
Claude Opus 4.6 (Reasoning)1009695948994.8%
Claude Haiku 4.51009694939194.7%
Claude 3.5 Haiku100100100908394.7%
ByteDance Seed 1.61009896918894.3%
DeepSeek V3.110010094898794.0%
Gemini 3.1 Flash Lite (Preview)10010094928493.9%
Cohere Command R+ (Aug. 2024)1009896908593.8%
ByteDance Seed 2.0 Mini10010095878593.5%
Llama 3.1 Nemotron 70B10010098967393.5%
Claude 3 Haiku10010097907893.0%
Llama 3.1 70B1009493888291.5%
Inception Mercury10010090867890.8%
ByteDance Seed 2.0 Lite999391888290.6%
Claude Sonnet 4.6 (Reasoning)979692927390.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.1100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4 Fast100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 3.5 Flash1001001001009999.8%
GPT-5.21001001001009999.8%
Claude Opus 41001001001009999.8%
o4 Mini High1001001001009999.8%
GPT-51001001001009899.7%
GPT-4o, Aug. 6th (temp=1)1001001001009899.7%
Grok 4.20 (Beta, Reasoning)1001001001009899.5%
Qwen 3.5 397B A17B1001001001009899.5%
GPT-4o Mini (temp=1)100100100999999.5%
GPT-4.11001001001009799.4%
MoonshotAI: Kimi K2.51001001001009799.4%
Qwen 3.5 27B1001001001009799.3%
GPT-4o, May 13th (temp=1)1001001001009599.0%
Mistral Small 4 (Reasoning)10010099989799.0%
Mistral NeMO10010099999698.9%
Grok 4100100100989798.8%
Qwen 3 32B100100100989598.6%
Gemini 2.5 Flash (Reasoning)1009999989798.5%
GPT-4.1 Mini1001001001009398.5%
GPT-5.4 Nano (Reasoning)100100100989498.5%
Qwen 3.5 9B100100100989498.4%
Inception Mercury 210010099999498.4%
Nemotron 3 Super100100100979498.2%
GPT-5.4 Mini10010098969597.9%
Stealth: Aurora Alpha100100100969397.8%
Qwen3 235B A22B Instruct 2507100100100979197.6%
Qwen 3.5 122B1009996969697.5%
Qwen 3.5 35B100100100989097.5%
GPT-5.4 Nano (Reasoning, Low)1009998969397.4%
GPT-4o Mini (temp=0)1001001001008697.2%
Z.AI GLM 510010097959497.1%
Grok 4.20 (Beta)1001001001008697.1%
GPT-5 Nano10010096959397.0%
Claude 3 Haiku10010099959297.0%
Llama 3.1 Nemotron 70B10010096959496.9%
GPT-5.4 Nano999898979396.9%
GPT-5.4 Mini (Reasoning, Low)10010099968996.7%
GPT-5.4 (Reasoning, Low)10010097949396.7%
Gemini 2.5 Flash Lite10010097949296.6%
Qwen 3.5 Plus (2026-02-15)1009795959596.6%
Gemma 3 4B10010096949296.4%
GPT-5.4 (Reasoning)10010098958996.4%
GPT-5.410010098929196.2%
Hermes 3 405B100100100958696.2%
MiniMax M2.7100100100978295.9%
Mistral Small 41009595959495.8%
Gemma 3 12B10010098978495.7%
ByteDance Seed 1.6 Flash1009995949095.6%
Gemma 3 27B1009896939195.6%
Mistral Large1009996948895.5%
Rocinante 12B100100100918695.4%
Writer: Palmyra X510010094929195.4%
Gemini 3.1 Flash Lite (Preview)100100100968195.3%
Ministral 3 3B1009993929295.1%
Z.AI GLM 4.5100100100908595.1%
Mistral Small 3.2 24B100100100997695.0%
Ministral 3 8B1009994929094.9%
Qwen 2.5 72B999897928894.6%
Gemini 3 Flash (Preview)999896948694.4%
Ministral 3 14B10010096938294.3%
DeepSeek V3 (2024-12-26)1009997918394.1%
Gemini 3 Flash (Preview, Reasoning)989493929293.9%
Claude 3.5 Haiku1009494948793.9%
Claude 3.7 Sonnet10010095918393.9%
Gemini 2.5 Flash989693919193.8%
Cohere Command R+ (Aug. 2024)1009793918793.6%
MiniMax M2.51009696898793.6%
Gemini 3 Pro (Preview)1009993918493.6%
GPT-4.1 Nano10010095908293.5%
Hermes 3 70B10010095918193.5%
Mistral Medium 3.1979594938893.4%
DeepSeek V3 (2025-03-24)1009995918093.0%
Claude Haiku 4.5969694948593.0%
Z.AI GLM 5 Turbo10010095917992.9%
Llama 3.1 70B10010097887892.5%
WizardLM 2 8x22b1009593928092.2%
DeepSeek V3.21009994858392.1%
Llama 3.1 8B1009594898192.0%
Ministral 3B1009691908291.7%
GPT-5.4 Mini (Reasoning)989391908691.7%
DeepSeek V3.11009696897891.7%
Arcee AI: Trinity Mini10010087868591.7%
Gemini 2.5 Flash Lite (Reasoning)10010095838191.6%
Arcee AI: Trinity Large (Preview)1009489898791.5%
Claude Opus 4.5989595917991.5%
GPT-5 Mini999490888691.3%
Mistral Large 21009992858191.3%
Z.AI GLM 4.7 Flash1009891848291.2%
Ministral 8B1009491868491.0%
Claude Sonnet 41009088878790.5%
DeepSeek-V2 Chat1009789897890.5%
Claude Opus 4.6 (Reasoning)969593888190.4%
Claude Sonnet 4.5979690858289.9%
Stealth: Hunter Alpha949088888889.7%
Nemotron 3 Nano1009789867589.4%
Claude 3.5 Sonnet1009790867389.1%
Claude Opus 4.6959292848188.8%
Mistral Small Creative949287858488.5%
Gemini 2.5 Pro999188867788.5%
ByteDance Seed 1.61009189877588.4%
ByteDance Seed 2.0 Lite1009185838288.1%
ByteDance Seed 2.0 Mini989289827987.8%
Mistral Large 31008885838187.4%
Stealth: Healer Alpha949290847286.2%
Claude Sonnet 4.6969290767485.5%
GPT-4o, May 13th (temp=0)958685837885.5%
Z.AI GLM 4.7968987847185.3%
Z.AI GLM 4.6969482777584.8%
GPT-4o, Aug. 6th (temp=0)1009886706683.8%
Inception Mercury908682817983.8%
Claude Sonnet 4.6 (Reasoning)958783807383.6%
Aion 2.0938584757081.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
GPT-4o Mini (temp=1)10010010010010099.9%
Ministral 3 14B10010010010010099.9%
Qwen 3.5 9B10010010010010099.9%
GPT-51001001001009999.9%
Qwen 3.5 35B10010010010010099.9%
DeepSeek V3 (2025-03-24)1001001001009999.9%
Gemini 2.5 Flash (Reasoning)1001001001009999.8%
GPT-4o Mini (temp=0)1001001001009999.8%
GPT-5 Nano1001001001009999.8%
Stealth: Aurora Alpha1001001001009999.8%
Mistral Medium 3.11001001001009999.7%
Qwen 3.5 27B1001001001009899.7%
Ministral 3B1001001001009899.7%
Claude Opus 4.51001001001009899.6%
Claude Sonnet 41001001001009899.6%
Gemini 3 Flash (Preview)1001001001009899.6%
Hermes 3 405B1001001001009899.6%
GPT-4o, May 13th (temp=1)100100100999999.6%
Hermes 3 70B1001001001009899.6%
Stealth: Hunter Alpha1001001001009899.6%
MiniMax M2.51001001001009899.6%
Grok 4.20 (Beta)1001001001009899.6%
GPT-4.1 Mini1001001001009899.6%
Z.AI GLM 5100100100999999.6%
Writer: Palmyra X51001001001009899.5%
Z.AI GLM 4.51001001001009799.5%
Ministral 3 3B100100100999799.3%
Claude 3.7 Sonnet1001001001009699.3%
GPT-5 Mini1001001001009699.3%
DeepSeek V3 (2024-12-26)1001001001009699.3%
Cohere Command R+ (Aug. 2024)100100100999899.3%
ByteDance Seed 1.6 Flash100100100999799.2%
Z.AI GLM 5 Turbo1001001001009699.1%
Gemini 3 Flash (Preview, Reasoning)1001001001009599.0%
Mistral Small Creative1001001001009599.0%
Stealth: Healer Alpha1001001001009599.0%
Qwen 3.5 Flash1001001001009598.9%
Claude Opus 4.6100100100999698.9%
Claude Opus 4100100100979798.9%
Gemma 3 27B100100100979798.9%
Mistral Large 31001001001009498.7%
Ministral 8B1001001001009398.7%
Mistral Small 3.2 24B10010099999598.6%
Claude Sonnet 4.6 (Reasoning)100100100999598.6%
Qwen3 235B A22B Instruct 25071001001001009398.6%
Arcee AI: Trinity Mini10010099999598.5%
Claude Haiku 4.51001001001009398.5%
Claude Opus 4.6 (Reasoning)1009999989798.4%
MiniMax M2.7100100100989498.4%
Claude Sonnet 4.610010099989498.2%
GPT-4o, May 13th (temp=0)100100100999198.1%
Gemini 2.5 Flash Lite1009999969698.1%
Gemini 3 Pro (Preview)10010097979598.0%
Nemotron 3 Super1001001001008997.9%
Aion 2.010010099989297.8%
Mistral Large 2100100100989197.7%
Qwen 2.5 72B1009998979397.3%
Z.AI GLM 4.7 Flash10010096959597.2%
Mistral NeMO100100100949297.2%
GPT-4.1 Nano100100100959197.2%
Z.AI GLM 4.7100100100969097.1%
Mistral Small 41001001001008597.0%
Z.AI GLM 4.610010098959196.8%
Ministral 3 8B100100100958896.7%
DeepSeek V3.11009897959396.6%
DeepSeek V3.2100100100968496.1%
Qwen 3 32B100100100938695.9%
Inception Mercury1001001001007995.9%
Gemini 2.5 Pro10010097929095.7%
ByteDance Seed 2.0 Lite10010096908794.6%
Llama 3.1 70B100100100987494.4%
Llama 3.1 Nemotron 70B1001001001007194.3%
Rocinante 12B10010094938394.2%
Arcee AI: Trinity Large (Preview)979695938893.6%
GPT-4o, Aug. 6th (temp=0)10010093878793.2%
Claude 3 Haiku1009989888492.0%
ByteDance Seed 1.61009291888290.7%
ByteDance Seed 2.0 Mini959288858388.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Claude Opus 4.6 (Reasoning)1001001001009999.9%
GPT-510010010010010099.9%
GPT-4o, Aug. 6th (temp=1)1001001001009999.8%
Gemini 3.1 Flash Lite (Preview)1001001001009999.8%
Qwen 3 32B1001001001009999.8%
GPT-5.41001001001009999.7%
Stealth: Aurora Alpha100100100999999.7%
GPT-4o Mini (temp=0)1001001001009899.6%
Grok 4.20 (Beta)1001001001009899.6%
GPT-4o, May 13th (temp=1)1001001001009899.5%
Inception Mercury 21001001001009799.5%
Gemma 3 12B1001001001009899.5%
GPT-4.11001001001009799.4%
Qwen 3.5 27B1001001001009799.4%
Mistral Small 4 (Reasoning)100100100999899.3%
GPT-5 Mini1001001001009699.2%
Mistral Large 21001001001009599.0%
Mistral Large1001001001009599.0%
Gemini 2.5 Flash (Reasoning)100100100989799.0%
Writer: Palmyra X51001001001009599.0%
DeepSeek V3 (2025-03-24)10010099999698.7%
Grok 4.20 (Beta, Reasoning)1001001001009398.6%
Z.AI GLM 5 Turbo100100100999498.6%
Qwen 3.5 122B100100100989598.6%
GPT-5.4 Nano100100100999498.5%
LFM2 24B1001001001009298.4%
GPT-4.1 Nano100100100979498.3%
Gemini 2.5 Flash Lite1001001001009198.3%
Qwen3 235B A22B Instruct 2507100100100969598.1%
MiniMax M2.71001001001009098.1%
DeepSeek V3 (2024-12-26)100100100969598.0%
Ministral 3 8B1001001001009098.0%
GPT-4o, Aug. 6th (temp=0)10010097979698.0%
Mistral Medium 3.110010099969598.0%
Gemini 3 Pro (Preview)10010098979598.0%
Gemini 3 Flash (Preview, Reasoning)10010098969597.9%
Claude 3.5 Haiku1001001001008997.9%
Ministral 3B100100100959497.8%
GPT-5 Nano10010098979597.7%
Mistral Large 3100100100959397.7%
Mistral NeMO100100100949497.6%
Gemma 3 4B1009898969497.4%
Cohere Command R+ (Aug. 2024)100100100949397.4%
GPT-4.1 Mini100100100959297.3%
Qwen 3.5 Flash100100100969197.3%
Llama 3.1 Nemotron 70B1009797969697.3%
ByteDance Seed 1.6 Flash100100100949297.2%
DeepSeek-V2 Chat100100100969097.2%
WizardLM 2 8x22b1001001001008597.0%
Z.AI GLM 510010099988896.9%
Z.AI GLM 4.7100100100968896.9%
Nemotron 3 Super10010099939296.7%
Ministral 8B1001001001008396.6%
Claude 3.7 Sonnet100100100968796.6%
ByteDance Seed 2.0 Lite1009696969596.6%
Claude Opus 410010097969096.5%
Claude Opus 4.5100100100929196.4%
Claude Opus 4.610010097958996.4%
Claude Sonnet 4.6 (Reasoning)10010099968696.3%
Stealth: Hunter Alpha100100100968696.3%
Gemini 2.5 Pro10010098978696.2%
MiniMax M2.510010098958796.0%
DeepSeek V3.210010097939095.9%
Mistral Small Creative10010096929195.7%
Ministral 3 3B10010099948595.6%
Inception Mercury1001001001007895.5%
Claude Haiku 4.5999996939095.5%
Arcee AI: Trinity Mini10010096948895.5%
Gemini 2.5 Flash Lite (Reasoning)1009997928794.9%
DeepSeek V3.1999996909094.8%
Ministral 3 14B10010093909094.7%
Llama 3.1 8B10010093918994.7%
Claude 3.5 Sonnet10010096908694.5%
GPT-4o, May 13th (temp=0)10010096938394.5%
Mistral Small 3.2 24B10010099868694.3%
Gemini 2.5 Flash1001001001007094.0%
Rocinante 12B10010096878794.0%
Gemini 3 Flash (Preview)1009896938293.8%
Claude 3 Haiku100100100858493.6%
Claude Sonnet 410010095947893.5%
Llama 3.1 70B10010091908593.3%
Hermes 3 405B1009995918293.3%
Z.AI GLM 4.7 Flash1009794888793.2%
Arcee AI: Trinity Large (Preview)1009996898293.1%
Stealth: Healer Alpha1009691918793.0%
Qwen 2.5 72B1009894938092.7%
Hermes 3 70B1009594918492.7%
Aion 2.01009993927892.4%
Z.AI GLM 4.61009391908792.0%
ByteDance Seed 2.0 Mini999891898191.6%
ByteDance Seed 1.6949289898790.1%
Claude Sonnet 4.61009389897889.7%