Past progressive (was/were + -ing) overuse

Test: Bad Writing Habits

Avg. Score
88.1%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-4.1 Mini99.9%$0.002719.0s99%
2Inception Mercury 299.7%$0.00327.0s96%
3Grok 4 Fast100.0%$0.001724.1s99%
4LFM2 24B99.9%$0.000228.4s98%
5Grok 4.1 Fast100.0%$0.001837.8s100%
6o4 Mini99.8%$0.01525.7s97%
7GPT-5.4 Nano99.2%$0.005726.3s93%
8GPT-4o, Aug. 6th (temp=1)99.7%$0.01824.4s95%
9GPT-5.4 Nano (Reasoning, Low)99.0%$0.005520.6s90%
10GPT-4.199.8%$0.01844.7s96%
11o4 Mini High100.0%$0.02547.2s100%
12Stealth: Aurora Alpha98.1%$0.00009.8s82%
13GPT-5 Mini99.7%$0.010057.4s96%
14GPT-5.4 Mini (Reasoning, Low)98.2%$0.01516.8s88%
15GPT-5 Nano99.6%$0.00421.4m95%
16GPT-5.4 Mini (Reasoning)98.6%$0.02228.1s88%
17GPT-4o Mini (temp=1)97.7%$0.001234.8s77%
18GPT-4o Mini (temp=0)97.9%$0.001234.8s76%
19GPT-5.4 Nano (Reasoning)96.8%$0.006124.5s76%
20Nemotron 3 Nano98.1%$0.00101.1m81%
21Nemotron 3 Super98.5%$0.00001.4m83%
22GPT-5.4100.0%$0.0491.4m100%
23Grok 4.20 (Beta)95.8%$0.01815.8s73%
24GPT-4o, Aug. 6th (temp=0)97.0%$0.02322.7s73%
25Grok 4100.0%$0.0481.7m100%
26GPT-5.2100.0%$0.0561.5m100%
27GPT-5.4 Mini96.2%$0.01516.8s68%
28Llama 3.1 Nemotron 70B95.4%$0.003831.7s67%
29GPT-4o, May 13th (temp=1)96.9%$0.03314.4s71%
30GPT-5.1100.0%$0.0541.8m99%
31Claude 3 Haiku92.7%$0.002514.9s63%
32Mistral Small 4 (Reasoning)92.9%$0.002230.2s66%
33ByteDance Seed 1.6 Flash93.9%$0.001327.3s61%
34DeepSeek V3 (2025-03-24)94.5%$0.001439.4s63%
35Qwen 3.5 Plus (2026-02-15)94.4%$0.006031.5s62%
36Writer: Palmyra X594.3%$0.01122.0s62%
37Qwen3 235B A22B Instruct 250794.8%$0.001159.2s65%
38Grok 4.20 (Beta, Reasoning)96.3%$0.03934.0s71%
39GPT-5.4 (Reasoning, Low)99.0%$0.0551.4m87%
40GPT-4o, May 13th (temp=0)93.9%$0.03514.1s67%
41Qwen 2.5 72B91.6%$0.001036.7s62%
42Gemini 2.5 Flash (Reasoning)92.8%$0.01121.5s57%
43Hermes 3 405B93.3%$0.003253.2s62%
44Gemini 2.5 Flash Lite91.3%$0.00099.5s52%
45Gemini 2.5 Flash92.0%$0.005210.6s53%
46DeepSeek-V2 Chat92.9%$0.002153.3s61%
47Arcee AI: Trinity Mini90.5%$0.00039.2s51%
48DeepSeek V3 (2024-12-26)92.6%$0.002154.6s58%
49Hermes 3 70B93.9%$0.00101.2m58%
50Z.AI GLM 4.591.5%$0.005142.1s56%
51Mistral Small 490.1%$0.001418.2s49%
52GPT-5100.0%$0.0652.8m100%
53Claude 3.7 Sonnet95.0%$0.04246.7s64%
54Cohere Command R+ (Aug. 2024)93.1%$0.02052.5s58%
55Claude Sonnet 4.592.9%$0.03538.1s60%
56Gemini 2.5 Flash Lite (Reasoning)89.7%$0.002830.8s49%
57Mistral Medium 3.188.6%$0.004836.5s50%
58Mistral NeMO86.3%$0.000510.1s45%
59Llama 3.1 8B92.1%$0.00031.3m51%
60Claude 3.5 Sonnet92.6%$0.04835.5s59%
61GPT-5.4 (Reasoning)99.4%$0.0892.6m93%
62Qwen 3 32B89.1%$0.001554.6s45%
63GPT-4.1 Nano84.8%$0.000713.3s39%
64Claude Sonnet 489.6%$0.03243.7s52%
65Mistral Large 385.0%$0.003330.3s42%
66Mistral Large 285.7%$0.01329.4s45%
67MoonshotAI: Kimi K2.595.7%$0.0193.2m73%
68ByteDance Seed 2.0 Lite92.8%$0.0122.2m59%
69Arcee AI: Trinity Large (Preview)85.7%$0.000043.6s43%
70ByteDance Seed 1.694.0%$0.0132.5m60%
71Gemma 3 4B82.4%$0.000220.0s41%
72Llama 3.1 70B84.5%$0.001529.4s38%
73Qwen 3.5 9B88.8%$0.00111.4m43%
74Qwen 3.5 Flash85.9%$0.002547.5s39%
75Rocinante 12B84.6%$0.001438.4s38%
76WizardLM 2 8x22b89.1%$0.00261.8m48%
77Gemma 3 12B83.0%$0.000441.3s36%
78Mistral Large83.3%$0.01430.9s37%
79Mistral Small Creative80.5%$0.00079.1s31%
80Aion 2.084.8%$0.00641.3m43%
81Inception Mercury83.3%$0.01117.6s29%
82MiniMax M2.783.8%$0.00401.1m38%
83Qwen 3.5 397B A17B91.3%$0.0143.0m56%
84Z.AI GLM 5 Turbo79.1%$0.008133.2s34%
85Gemini 3.1 Flash Lite (Preview)75.4%$0.00308.4s31%
86Ministral 3 3B77.7%$0.000511.1s24%
87Qwen 3.5 122B85.5%$0.0251.1m34%
88Stealth: Hunter Alpha78.2%$0.000055.0s34%
89Stealth: Healer Alpha75.7%$0.000023.7s30%
90DeepSeek V3.281.5%$0.00141.9m39%
91Z.AI GLM 581.3%$0.00841.2m31%
92Claude Opus 4.584.5%$0.07053.4s45%
93Gemini 3.1 Pro (Preview)92.7%$0.1071.8m57%
94Qwen 3.5 35B78.9%$0.0181.0m31%
95Claude Haiku 4.573.6%$0.01121.6s26%
96Claude 3.5 Haiku73.4%$0.003510.8s19%
97Claude Opus 4.685.4%$0.0781.2m44%
98Ministral 8B69.4%$0.000410.4s23%
99Gemma 3 27B74.1%$0.000652.6s25%
100Gemini 2.5 Pro75.2%$0.03636.2s34%
101Gemini 3 Flash (Preview, Reasoning)72.7%$0.01230.1s26%
102Ministral 3 14B69.3%$0.000711.7s21%
103ByteDance Seed 2.0 Mini90.9%$0.00454.9m56%
104MiniMax M2.573.6%$0.00341.3m27%
105Qwen 3.5 27B80.2%$0.0201.6m26%
106Ministral 3 8B68.2%$0.000819.6s18%
107Ministral 3B66.8%$0.00018.1s16%
108Claude Opus 4.6 (Reasoning)84.8%$0.0881.4m39%
109Gemini 3 Flash (Preview)63.0%$0.007819.6s16%
110DeepSeek V3.169.5%$0.00201.8m24%
111Gemini 3 Pro (Preview)71.9%$0.05554.4s28%
112Claude Sonnet 4.667.1%$0.03139.3s18%
113Claude Opus 491.7%$0.2091.4m55%
114Z.AI GLM 4.661.6%$0.006551.5s14%
115Claude Sonnet 4.6 (Reasoning)69.4%$0.0601.2m20%
116Z.AI GLM 4.760.9%$0.0101.4m12%
117Z.AI GLM 4.7 Flash51.6%$0.00171.2m8%
118Mistral Small 3.2 24B80.5%$0.00695.7m30%
88.08%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
ByteDance Seed 2.0 Lite1001001001009899.7%
Qwen 3.5 35B1001001001009899.6%
Mistral Small 41001001001009899.6%
GPT-4o, May 13th (temp=1)1001001001009799.4%
MiniMax M2.71001001001009699.3%
DeepSeek V3.11001001001009498.8%
Mistral Medium 3.11001001001009498.8%
Qwen 3.5 9B1001001001009398.7%
Inception Mercury1001001001009398.6%
Stealth: Healer Alpha1001001001009097.9%
Aion 2.01001001001008997.8%
ByteDance Seed 1.6 Flash1001001001008296.4%
GPT-4.1 Nano1001001001007895.6%
Claude Sonnet 41001001001007394.5%
Cohere Command R+ (Aug. 2024)1001001001007194.2%
Claude Sonnet 4.61001001001007194.2%
Claude Sonnet 4.6 (Reasoning)1001001001007094.0%
Gemini 2.5 Flash (Reasoning)1001001001007094.0%
Mistral Large 21001001001007094.0%
Qwen 3.5 27B1001001001006993.7%
Stealth: Hunter Alpha1001001001006693.2%
Gemma 3 27B100100100867792.7%
Claude Sonnet 4.5100100100827992.2%
Mistral Large 3100100100797891.3%
Mistral Small Creative100100100807591.0%
DeepSeek V3.21001001001005090.0%
Z.AI GLM 4.61001001001004889.6%
Qwen 2.5 72B100100100747389.4%
Qwen 3.5 Flash1001001001004589.0%
Gemma 3 4B1001001001004589.0%
Hermes 3 405B1001001001003386.7%
Gemini 3 Flash (Preview, Reasoning)1001001001002785.5%
WizardLM 2 8x22b1001001001001282.4%
Gemini 3.1 Flash Lite (Preview)100100100100681.2%
Claude 3.5 Haiku100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Llama 3.1 8B100100100100080.0%
Gemini 3 Pro (Preview)10010010089077.8%
Mistral Large10010097672377.4%
Z.AI GLM 4.7100100100443174.9%
Mistral Small 3.2 24B10010086444274.5%
Qwen 3 32B10010010068073.6%
Ministral 3B10010010050070.0%
Ministral 3 8B10010010028867.1%
Ministral 3 14B1001001002060.4%
Z.AI GLM 4.7 Flash100100295046.8%
Gemini 3 Flash (Preview)93842120043.5%
Ministral 3 3B1008900037.8%
Ministral 8B6352505034.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Small 410010010010010099.9%
Claude Opus 4.6 (Reasoning)1001001001009899.6%
Z.AI GLM 51001001001009899.6%
ByteDance Seed 2.0 Lite1001001001009599.1%
DeepSeek V3 (2024-12-26)1001001001009398.6%
Claude Opus 41001001001009198.1%
Cohere Command R+ (Aug. 2024)1001001001008897.6%
MiniMax M2.71001001001008697.3%
Claude Opus 4.61001001001008596.9%
ByteDance Seed 1.61001001001008296.5%
Gemini 2.5 Flash Lite (Reasoning)100100100938996.4%
Mistral Large 3100100100938996.4%
Claude Opus 4.5100100100958395.7%
Claude 3 Haiku1001001001007895.6%
GPT-4o, Aug. 6th (temp=1)1001001001007795.5%
Claude Sonnet 4.6 (Reasoning)1001001001007795.3%
Inception Mercury1001001001007695.2%
GPT-4o Mini (temp=1)1001001001007294.5%
DeepSeek V3 (2025-03-24)1001001001006993.9%
GPT-5 Nano10010099898093.5%
Ministral 3 3B100100100946792.2%
Llama 3.1 Nemotron 70B100100100857592.1%
ByteDance Seed 2.0 Mini1001001001006091.9%
Mistral Small 3.2 24B1001001001005991.8%
Qwen 3.5 35B100100100965890.8%
Z.AI GLM 4.5100100100806989.9%
Ministral 3B100100100894887.5%
Gemini 3 Flash (Preview, Reasoning)1001001001003787.3%
GPT-4o, May 13th (temp=0)100100100874686.7%
Claude Sonnet 4.61001001001002584.9%
Mistral Small Creative1001001001002084.0%
Stealth: Aurora Alpha100100100724683.5%
GPT-4o, Aug. 6th (temp=0)1001001001001783.5%
Z.AI GLM 5 Turbo100100100912583.1%
Aion 2.010010097615382.1%
Gemini 2.5 Pro10010092594980.0%
Claude Sonnet 4.5100100100100080.0%
Mistral Large100100100100080.0%
Ministral 3 14B100100100100080.0%
GPT-4.1 Nano100100100643479.6%
Arcee AI: Trinity Mini100100100554179.3%
Claude 3.7 Sonnet10010010095079.0%
Gemini 3 Pro (Preview)10010010090178.3%
Llama 3.1 70B10010086544877.7%
Mistral Large 210010010088077.6%
Arcee AI: Trinity Large (Preview)10010010087077.3%
Gemma 3 4B10010092562374.2%
WizardLM 2 8x22b100100100482173.8%
Claude 3.5 Haiku10010010061072.2%
Qwen 2.5 72B10010069662572.2%
DeepSeek V3.21001009748068.9%
Hermes 3 405B10010010037067.5%
Rocinante 12B10010010028666.9%
MiniMax M2.510010064343266.1%
Writer: Palmyra X51001008041064.3%
Claude Sonnet 41001007246063.6%
Stealth: Healer Alpha10010048461662.1%
Gemini 3 Flash (Preview)1001006933060.4%
Z.AI GLM 4.71001001000060.0%
Llama 3.1 8B1001001000060.0%
Gemini 2.5 Flash Lite100100800056.0%
Mistral NeMO1006664262155.3%
Z.AI GLM 4.6100746630054.1%
Ministral 8B10080710050.1%
Gemma 3 27B73736914045.8%
DeepSeek V3.110068330040.2%
Claude Haiku 4.510088101039.5%
Stealth: Hunter Alpha10081100038.4%
Gemini 3.1 Flash Lite (Preview)804623221437.2%
Ministral 3 8B10040284034.4%
Z.AI GLM 4.7 Flash10037290033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Large 21001001001009899.6%
Aion 2.01001001001009598.9%
Z.AI GLM 4.5100100100979698.5%
Hermes 3 405B1001001001009298.5%
Qwen 3.5 35B1001001001009098.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001008997.8%
Mistral Medium 3.11001001001008797.4%
Ministral 3B1001001001008496.8%
Gemini 2.5 Flash1001001001008096.0%
Mistral Small 41001001001007795.3%
Ministral 3 14B1001001001006793.5%
Gemma 3 12B100100100976592.3%
Qwen 3.5 Flash1001001001006192.1%
Z.AI GLM 51001001001005390.7%
Stealth: Hunter Alpha100100100866590.1%
Gemini 3.1 Flash Lite (Preview)1001001001004488.8%
DeepSeek V3.2100100100855688.2%
Gemini 3 Flash (Preview, Reasoning)100100100964588.2%
Nemotron 3 Super100100100974087.4%
Mistral Small Creative1001001001003687.2%
Mistral Small 4 (Reasoning)1001001001003687.2%
Gemma 3 27B10010084757386.3%
Arcee AI: Trinity Large (Preview)10010094854685.1%
Qwen3 235B A22B Instruct 25071001001001002184.3%
Claude 3.7 Sonnet100100100952584.0%
MiniMax M2.71001001001001583.0%
Claude Sonnet 4.510010092774582.9%
Gemini 2.5 Pro10010096803682.4%
Stealth: Healer Alpha100100100100480.8%
Llama 3.1 70B100100100100080.0%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Arcee AI: Trinity Mini10010010093078.7%
Gemini 3 Flash (Preview)1001008969071.5%
Ministral 3 8B10010072502870.2%
Claude Haiku 4.510010010049069.9%
Z.AI GLM 5 Turbo100878675069.6%
Gemini 3 Pro (Preview)1001008457068.2%
Ministral 8B1001009132064.7%
Claude Sonnet 4.6 (Reasoning)1001008029061.7%
DeepSeek V3.11001007015056.9%
Rocinante 12B100665722049.1%
Z.AI GLM 4.795824720048.7%
Z.AI GLM 4.7 Flash10071480043.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.61001001001009699.3%
Claude 3 Haiku1001001001009498.8%
Llama 3.1 70B1001001001009298.5%
Cohere Command R+ (Aug. 2024)1001001001009298.5%
Mistral Large1001001001008296.3%
Qwen 3 32B100100100968496.1%
Claude 3.7 Sonnet1001001001008096.0%
ByteDance Seed 1.61001001001006893.7%
Mistral Large 3100100100868093.0%
Gemini 2.5 Flash Lite1001001001006492.8%
Llama 3.1 8B1001001001005891.6%
Z.AI GLM 4.6100100100827691.5%
Gemma 3 4B100100100807691.1%
Qwen 3.5 Flash100100100757489.8%
Qwen 3.5 35B1001001001004989.8%
Gemini 2.5 Flash (Reasoning)1001001001004789.4%
Claude Opus 41001001001003486.8%
Mistral Large 210010091727186.8%
Claude Sonnet 4.51001001001003186.1%
Mistral NeMO1001001001002985.8%
Claude Sonnet 4.6 (Reasoning)1001001001002885.6%
Gemini 2.5 Flash1001001001002384.6%
Gemini 3 Flash (Preview, Reasoning)100100100814184.6%
Qwen 3.5 27B1001001001002084.0%
DeepSeek V3.21001001001001983.7%
Qwen 2.5 72B100100100793783.2%
Stealth: Hunter Alpha1001001001001583.0%
GPT-4o, May 13th (temp=1)1001001001001583.0%
Aion 2.01001001001001482.7%
Gemma 3 12B100100100100881.5%
Mistral Small 4 (Reasoning)10010090813681.4%
MiniMax M2.7100100100100781.4%
ByteDance Seed 2.0 Lite100100100100480.8%
Inception Mercury100100100100080.0%
ByteDance Seed 1.6 Flash100100100100080.0%
Mistral Small Creative100100100100080.0%
Ministral 3 3B100100100100080.0%
DeepSeek V3.1100100100802079.9%
GPT-4o Mini (temp=0)1001009694478.8%
DeepSeek V3 (2024-12-26)10010010091078.1%
Claude Haiku 4.5100100100671776.7%
GPT-4o, May 13th (temp=0)10010099434176.7%
Mistral Small 410010010079075.8%
Arcee AI: Trinity Mini1001009271072.6%
Z.AI GLM 5 Turbo10010010060072.0%
Mistral Small 3.2 24B10010010053070.6%
MiniMax M2.510010010040068.1%
Claude Opus 4.51001008648568.0%
Ministral 3B10010010039067.9%
Gemini 3.1 Flash Lite (Preview)100100100151565.9%
Qwen3 235B A22B Instruct 25071001009229064.3%
Claude 3.5 Haiku10010010015063.0%
Claude Sonnet 4.61009454511162.1%
Arcee AI: Trinity Large (Preview)10010037362860.1%
Z.AI GLM 51001001000060.0%
Stealth: Healer Alpha1001001000060.0%
Gemini 2.5 Flash Lite (Reasoning)1001001000060.0%
Gemma 3 27B1001001000060.0%
Gemini 3 Pro (Preview)1001005624055.9%
GPT-4.1 Nano100100735055.8%
Ministral 3 14B10088640050.4%
Gemini 2.5 Pro1001002611047.4%
Gemini 3 Flash (Preview)100684619046.6%
Ministral 8B10083330043.3%
Ministral 3 8B10062290038.2%
Z.AI GLM 4.71004700029.4%
Z.AI GLM 4.7 Flash622100016.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
ByteDance Seed 2.0 Lite1001001001009999.8%
Nemotron 3 Super1001001001009799.3%
Z.AI GLM 4.71001001001009699.3%
Claude Sonnet 4.51001001001009398.7%
Qwen 3.5 Flash1001001001008897.6%
Z.AI GLM 4.51001001001008797.4%
Cohere Command R+ (Aug. 2024)1001001001008496.8%
DeepSeek V3.21001001001008496.7%
Claude Opus 4.5100100100968095.2%
Claude Sonnet 4.6 (Reasoning)1001001001007595.0%
Claude Haiku 4.5100100100908494.8%
Z.AI GLM 51001001001007394.7%
Arcee AI: Trinity Mini1001001001007194.1%
Gemma 3 27B1001001001007094.0%
Hermes 3 405B1001001001006793.3%
Gemini 3.1 Flash Lite (Preview)100100100848092.9%
Stealth: Hunter Alpha1001001001005991.8%
Claude 3 Haiku1001001001005791.4%
Gemini 2.5 Flash (Reasoning)1001001001005490.9%
Gemma 3 4B100100100757590.1%
Gemini 3 Flash (Preview, Reasoning)1001001001005090.0%
Gemini 3 Pro (Preview)100100100905889.4%
Ministral 8B100100100915388.9%
ByteDance Seed 1.6 Flash100100100934888.1%
Gemini 2.5 Flash Lite1001001001003486.9%
GPT-4.1 Nano1001001001003086.0%
Mistral Large 3100100100913885.9%
Gemini 3 Flash (Preview)1001001001002985.8%
Mistral Medium 3.1100100100983185.8%
Arcee AI: Trinity Large (Preview)100100100715084.1%
Ministral 3 14B1001001001001583.0%
Aion 2.0100100100833082.6%
Mistral NeMO100100100931581.6%
Ministral 3 8B1008884736281.4%
Gemini 2.5 Pro100100100644080.8%
Claude Sonnet 4.6100100100100080.0%
Stealth: Healer Alpha100100100100080.0%
Mistral Small 4100100100100080.0%
Llama 3.1 8B100100100100080.0%
MiniMax M2.5100100100652678.2%
Gemma 3 12B10010010086478.0%
Ministral 3B1001008579072.8%
Z.AI GLM 4.7 Flash10010087531170.0%
Gemini 2.5 Flash Lite (Reasoning)10010086312668.7%
Z.AI GLM 5 Turbo10010010031667.3%
MiniMax M2.710010010016063.1%
Mistral Small 3.2 24B1001001006061.2%
Rocinante 12B1001001004060.8%
Z.AI GLM 4.61001001000060.0%
Llama 3.1 70B1001001000060.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Gemini 2.5 Flash Lite1001001001009899.6%
MiniMax M2.51001001001009298.3%
ByteDance Seed 1.6 Flash1001001001008897.7%
Claude Sonnet 4.61001001001008797.4%
Rocinante 12B1001001001008797.4%
ByteDance Seed 2.0 Mini1001001001008597.0%
Ministral 8B1001001001008496.8%
Claude Opus 4.6 (Reasoning)100100100948495.5%
Claude Opus 4.61001001001007795.3%
Mistral Medium 3.1100100100978095.2%
Qwen 3.5 Flash1001001001007093.9%
WizardLM 2 8x22b1001001001006593.1%
Z.AI GLM 4.7 Flash1001001001004989.9%
Arcee AI: Trinity Large (Preview)1001001001004689.2%
Qwen 3 32B1001001001004689.2%
Llama 3.1 Nemotron 70B1001001001004589.0%
Claude 3.5 Sonnet1001001001004388.6%
Aion 2.01001001001004188.3%
DeepSeek V3.2100100100845587.9%
Mistral Large 31001001001003887.6%
DeepSeek V3 (2025-03-24)1001001001003787.5%
ByteDance Seed 2.0 Lite1001001001003687.3%
Claude Opus 4.5100100100954087.1%
Mistral Small 4 (Reasoning)100100100973286.0%
GPT-4o, May 13th (temp=0)10010085686884.3%
Z.AI GLM 5 Turbo1001001001002184.2%
GPT-4o Mini (temp=1)1001001001001983.8%
Gemini 2.5 Flash (Reasoning)1001001001001182.2%
Qwen 2.5 72B100100100575381.8%
Mistral NeMO10010092674180.1%
Qwen 3.5 27B100100100100080.0%
MiniMax M2.7100100100100080.0%
Claude Sonnet 4.5100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Mistral Small Creative100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Gemini 3 Flash (Preview, Reasoning)100100100702679.3%
Claude Sonnet 410010010095078.9%
Hermes 3 70B100100100682578.5%
Z.AI GLM 5100100100642577.7%
GPT-4.1 Nano1001009593077.7%
Gemini 3 Flash (Preview)1001009981877.6%
GPT-4o, Aug. 6th (temp=0)100100100771077.2%
Claude Haiku 4.5100100100523377.0%
Stealth: Healer Alpha100958784073.3%
Gemini 2.5 Flash Lite (Reasoning)10010010060071.9%
Mistral Large 210010093471971.8%
Claude Sonnet 4.6 (Reasoning)1001009463071.3%
Stealth: Hunter Alpha10010065513570.3%
Ministral 3 8B10010010037067.5%
Llama 3.1 70B10010010035066.9%
DeepSeek V3.110010058512166.0%
Claude 3.5 Haiku100877461064.5%
Gemini 3 Pro (Preview)1007759422660.9%
Mistral Small 41001007128059.9%
Ministral 3 14B1001004746058.6%
Gemini 2.5 Pro1005342383253.1%
Gemini 3.1 Flash Lite (Preview)1001002617048.6%
Gemma 3 12B10065488044.2%
Gemma 3 4B10094270044.2%
Ministral 3 3B100100150043.0%
Mistral Small 3.2 24B10010000040.0%
Ministral 3B10010000040.0%
Mistral Large10069100035.8%
Z.AI GLM 4.610045250034.1%
Gemma 3 27B10057110033.6%
Z.AI GLM 4.7891085022.5%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Claude Opus 4.6 (Reasoning)1001001001009799.4%
DeepSeek V3.21001001001009398.6%
LFM2 24B1001001001009198.3%
GPT-5.4 Nano (Reasoning, Low)100100100969498.1%
Claude Opus 4.61001001001009098.0%
Qwen 3 32B1001001001008697.3%
Gemini 2.5 Flash (Reasoning)1001001001008597.0%
GPT-5.4 Nano1001001001008597.0%
Hermes 3 70B1001001001008496.7%
Llama 3.1 8B1001001001007995.8%
GPT-4o, May 13th (temp=1)1001001001007895.6%
GPT-5.4 Mini (Reasoning, Low)100100100948495.6%
ByteDance Seed 1.6 Flash1001001001007795.4%
GPT-4.1 Nano1001001001007795.3%
Z.AI GLM 5 Turbo1001001001007695.2%
MoonshotAI: Kimi K2.51001001001007494.8%
Qwen 3.5 9B1001001001007494.7%
Qwen 3.5 397B A17B1001001001006592.9%
Aion 2.01001001001006592.9%
Qwen 3.5 Plus (2026-02-15)1001001001006392.6%
Gemini 3.1 Pro (Preview)10010093917191.0%
DeepSeek V3 (2024-12-26)1001001001005390.6%
MiniMax M2.710010094817890.5%
Gemma 3 4B100100100975490.1%
ByteDance Seed 2.0 Lite1001001001004689.2%
Gemini 2.5 Pro10010099836088.3%
Gemini 3.1 Flash Lite (Preview)100100100736988.3%
Claude Sonnet 41001001001003787.3%
Writer: Palmyra X51001001001003587.1%
Claude Haiku 4.5100100100864987.0%
Z.AI GLM 4.51001001001003386.7%
Claude 3 Haiku1001001001002685.1%
Mistral Small 4 (Reasoning)1001001001002384.5%
Gemini 2.5 Flash100100100961782.7%
Mistral Medium 3.1100100100981081.8%
Gemini 3 Flash (Preview)10010082735381.6%
GPT-4o, May 13th (temp=0)100100100792881.3%
Stealth: Healer Alpha100100100654081.0%
GPT-5.4 Nano (Reasoning)100100100614180.4%
Qwen 3.5 122B100100100100080.0%
Mistral Large100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Z.AI GLM 4.610010083595479.2%
Claude Opus 4.510010010093078.6%
Qwen 3.5 27B1001009594077.8%
Mistral Large 310010010077075.5%
ByteDance Seed 2.0 Mini100100100541273.2%
Gemini 3 Pro (Preview)10010096501872.8%
Rocinante 12B10010010063072.6%
Claude 3.5 Sonnet100100100441872.4%
Arcee AI: Trinity Large (Preview)10010010058071.6%
Gemma 3 27B10010068553170.9%
Stealth: Hunter Alpha1001009047067.3%
MiniMax M2.5100100100211166.2%
Ministral 3 14B10010010027065.4%
DeepSeek V3.110010010026065.3%
Qwen 3.5 35B1007159532762.1%
Claude Sonnet 4.61001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Ministral 3B1001001000060.0%
Gemini 3 Flash (Preview, Reasoning)100836352059.6%
Llama 3.1 70B1001007720059.3%
Ministral 8B1001004840057.6%
Qwen 3.5 Flash100100730054.6%
Inception Mercury10079770051.1%
Gemma 3 12B1001003210248.7%
Mistral Small Creative10076542046.4%
Z.AI GLM 4.7 Flash100603416041.9%
Z.AI GLM 4.7422550014.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
DeepSeek V3 (2024-12-26)1001001001009999.8%
Mistral Large1001001001009699.2%
GPT-5.4 Mini1001001001009298.4%
Stealth: Aurora Alpha100100100998997.7%
Cohere Command R+ (Aug. 2024)1001001001008897.6%
GPT-4o, May 13th (temp=0)1001001001008897.5%
Claude 3.7 Sonnet100100100949397.4%
GPT-4o, Aug. 6th (temp=0)1001001001008697.1%
Mistral Small 410010099978997.1%
WizardLM 2 8x22b1001001001008496.7%
GPT-4.11001001001008396.6%
GPT-5.4 Mini (Reasoning)100100100948796.2%
Hermes 3 70B1001001001008196.2%
GPT-5 Mini1001001001008096.0%
Claude Opus 41001001001007995.9%
Inception Mercury 2100100100898694.9%
GPT-5.4 Nano1001001001007294.4%
GPT-5.4 Mini (Reasoning, Low)1001001001006693.3%
GPT-5.4 (Reasoning)100100100887893.3%
Qwen 2.5 72B100100100848192.9%
DeepSeek V3 (2025-03-24)100100100927192.6%
Claude Sonnet 4.510010098976792.3%
GPT-5.4 Nano (Reasoning, Low)100100100887091.6%
GPT-4o Mini (temp=1)1001001001005190.1%
Grok 4.20 (Beta)1001001001005090.0%
Claude Opus 4.510010085786886.1%
Gemma 3 12B10010091825084.5%
MoonshotAI: Kimi K2.51001001001001983.7%
Mistral Large 3100100100862983.1%
Nemotron 3 Nano100100100644882.3%
Mistral Medium 3.1100100100951682.1%
ByteDance Seed 2.0 Mini100100100674281.8%
Llama 3.1 8B100100100100581.1%
DeepSeek-V2 Chat100100100772880.9%
Mistral Small 3.2 24B100100100100080.0%
Llama 3.1 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Mistral Small 4 (Reasoning)1009688773679.5%
Qwen 3 32B100100100722278.7%
Qwen3 235B A22B Instruct 250710010096474677.8%
Ministral 3 8B100100100741377.4%
Grok 4.20 (Beta, Reasoning)1009590514576.0%
Gemini 2.5 Flash10010076722674.7%
GPT-5.4 Nano (Reasoning)10010086642374.6%
Claude 3.5 Sonnet10010094562174.3%
Writer: Palmyra X5100100100541774.2%
Qwen 3.5 397B A17B1009272535274.0%
Z.AI GLM 4.610010087481970.8%
DeepSeek V3.21007979692570.5%
ByteDance Seed 2.0 Lite1001008762069.9%
ByteDance Seed 1.6 Flash1001008062769.9%
Z.AI GLM 4.5100949148768.1%
Mistral Small Creative10010010031066.1%
Stealth: Hunter Alpha1008274471864.2%
Rocinante 12B10010010014062.8%
Qwen 3.5 Plus (2026-02-15)100877846162.6%
Gemini 2.5 Pro100817548762.2%
Aion 2.01001009415061.7%
Gemini 2.5 Flash (Reasoning)10096829257.8%
Gemini 2.5 Flash Lite100100890057.8%
Mistral Large 21001005328256.7%
MiniMax M2.5100887221056.1%
Claude Sonnet 41008353291455.9%
Claude Sonnet 4.61001004632055.5%
DeepSeek V3.11001003837055.0%
Claude Haiku 4.510098710053.8%
ByteDance Seed 1.610090790053.7%
Gemini 2.5 Flash Lite (Reasoning)100857210053.4%
Gemma 3 4B1001003923052.6%
Ministral 3 3B10074685049.3%
GPT-4.1 Nano100100430048.7%
Z.AI GLM 5 Turbo100604820045.6%
Ministral 8B100100130042.5%
Claude Sonnet 4.6 (Reasoning)100100120042.4%
Claude Opus 4.6 (Reasoning)7975380038.4%
Ministral 3 14B100313127037.9%
Qwen 3.5 9B10072110036.6%
Z.AI GLM 51006740034.1%
Z.AI GLM 4.71005780032.9%
Claude Opus 4.68962130032.8%
MiniMax M2.790242315932.4%
Stealth: Healer Alpha69502912032.1%
Inception Mercury1005400030.8%
Qwen 3.5 Flash1004900029.8%
Gemini 3.1 Flash Lite (Preview)7340280028.2%
Gemini 3 Flash (Preview)5946260026.3%
Gemini 3 Pro (Preview)1002900025.7%
Z.AI GLM 4.7 Flash771700018.9%
Gemini 3.1 Pro (Preview)691260017.4%
Claude 3.5 Haiku414140017.3%
Qwen 3.5 122B63000012.5%
Gemma 3 27B50830012.3%
Ministral 3B61000012.2%
Qwen 3.5 27B54000010.9%
Qwen 3.5 35B47500010.4%
Gemini 3 Flash (Preview, Reasoning)3000006.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Small 4 (Reasoning)10010010010010099.9%
GPT-5.11001001001009799.3%
Qwen 3 32B1001001001009398.7%
Qwen 3.5 Plus (2026-02-15)1001001001009298.5%
Gemma 3 27B1001001001009097.9%
GPT-5.4 Mini (Reasoning, Low)1001001001008797.4%
Gemma 3 12B1001001001008496.7%
Mistral NeMO1001001001008496.7%
Grok 4.20 (Beta)1001001001007995.8%
Gemini 2.5 Flash (Reasoning)1001001001007294.5%
Claude 3.7 Sonnet100100100987394.4%
Gemini 2.5 Flash Lite1001001001006793.3%
Qwen 3.5 Flash10010098848393.0%
GPT-5.4 Mini (Reasoning)1001001001006593.0%
Writer: Palmyra X51001001001006593.0%
Stealth: Hunter Alpha1001001001006492.8%
ByteDance Seed 2.0 Mini100100100976692.7%
Rocinante 12B1001001001006092.0%
Gemma 3 4B1001001001005891.6%
GPT-5.4 Nano (Reasoning)1001001001005891.6%
Aion 2.01001001001005891.5%
Claude Haiku 4.510010090847990.7%
Arcee AI: Trinity Large (Preview)1001001001005390.5%
Hermes 3 405B1001001001005190.1%
Ministral 8B1001001001004889.7%
Llama 3.1 70B1001001001004488.8%
Qwen 3.5 397B A17B1001001001004388.6%
Claude 3.5 Haiku1001001001004388.6%
Gemini 2.5 Flash1001001001004388.6%
Stealth: Healer Alpha100100100796288.2%
GPT-5.4 Mini1001001001003887.6%
Claude Opus 4.51001001001003486.8%
DeepSeek-V2 Chat1001001001003186.2%
Claude Opus 4.6100100100765385.9%
Mistral Large 31001001001002985.8%
Qwen 3.5 122B10010094873683.5%
Gemini 3 Pro (Preview)10010091646183.1%
Qwen 3.5 27B100100100981682.9%
MoonshotAI: Kimi K2.510010075696982.7%
Claude Opus 4100100100674281.8%
Claude 3.5 Sonnet100100100100681.2%
Gemini 2.5 Pro10010084665681.1%
DeepSeek V3 (2024-12-26)100100100891681.0%
Mistral Large100100100643780.1%
Ministral 3 3B100100100653580.0%
Z.AI GLM 4.710010010099079.9%
DeepSeek V3.2100100100584079.7%
ByteDance Seed 2.0 Lite1001009695278.5%
Mistral Small 4100100100533577.8%
Z.AI GLM 5 Turbo10010096652777.5%
MiniMax M2.71009581704077.2%
Mistral Medium 3.1100100100591875.5%
Z.AI GLM 51001009082374.9%
Claude Opus 4.6 (Reasoning)1001009182074.7%
Ministral 3B1007966646274.0%
Z.AI GLM 4.7 Flash10010010061072.1%
Qwen 3.5 35B10010010056071.3%
Mistral Large 210010086331767.2%
Gemini 3 Flash (Preview)10010064332063.4%
MiniMax M2.5100966353062.6%
DeepSeek V3.110010041261857.1%
Claude Sonnet 4.6 (Reasoning)1001007211056.7%
Gemini 3 Flash (Preview, Reasoning)100854635854.9%
Ministral 3 8B100100330046.5%
Qwen 3.5 9B1005323151240.4%
Claude Sonnet 4.6854600026.0%
Z.AI GLM 4.65938253025.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-5.4 Nano1001001001009899.7%
Arcee AI: Trinity Mini1001001001009699.2%
GPT-4.1 Mini1001001001009599.0%
GPT-5.4 Nano (Reasoning)1001001001008997.8%
DeepSeek V3 (2024-12-26)1001001001008496.8%
Llama 3.1 70B1001001001008096.0%
Claude Sonnet 4.51001001001007995.8%
GPT-5.4 Mini (Reasoning, Low)1001001001007594.9%
Cohere Command R+ (Aug. 2024)1001001001007494.8%
Qwen 2.5 72B1001001001006993.9%
GPT-5.4 Mini (Reasoning)100100100967393.9%
Mistral NeMO100100100888093.6%
GPT-4o, Aug. 6th (temp=0)1001001001006893.6%
Mistral Small 4 (Reasoning)100100100927493.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006392.6%
Gemma 3 4B1001001001006092.0%
Qwen3 235B A22B Instruct 25071001001001005390.6%
GPT-4o Mini (temp=1)1001001001005290.4%
ByteDance Seed 1.6100100100964688.4%
ByteDance Seed 2.0 Lite100100100806288.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001004188.1%
GPT-4o Mini (temp=0)100100100923785.8%
Stealth: Aurora Alpha100100100804384.6%
GPT-5.4 Mini100100100843784.4%
Gemma 3 27B100100100972283.9%
ByteDance Seed 2.0 Mini100100100833683.8%
Grok 4.20 (Beta, Reasoning)100100100100080.0%
MiniMax M2.7100100100100080.0%
DeepSeek V3 (2025-03-24)100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Hermes 3 70B100100100100080.0%
Gemini 3.1 Pro (Preview)1009276655978.4%
Qwen 3.5 Plus (2026-02-15)10010099701376.3%
Mistral Medium 3.1100100100621775.8%
Claude Opus 4.51009870595075.5%
Mistral Small Creative100100100591474.6%
Z.AI GLM 4.510010096482173.1%
Claude 3.5 Sonnet1001008379072.4%
Claude 3 Haiku100100100401871.6%
Stealth: Healer Alpha10010010057071.4%
ByteDance Seed 1.6 Flash10010010056071.3%
Grok 4.20 (Beta)1008377464670.6%
Gemma 3 12B10010010052070.4%
Mistral Large 310010073403669.8%
Qwen 3.5 Flash10010010037067.5%
WizardLM 2 8x22b1001007364067.5%
DeepSeek V3.2100908854066.4%
GPT-4.1 Nano1001007755066.3%
Mistral Small 3.2 24B1008978313065.6%
Aion 2.010010010025065.1%
Claude Opus 4.610010061431764.0%
Mistral Small 4100797263062.6%
Claude Opus 41001001006462.0%
Claude Opus 4.6 (Reasoning)1001008623061.7%
Mistral Large1001005347060.0%
Qwen 3.5 9B1001001000060.0%
Inception Mercury1001001000060.0%
Ministral 3 3B100100828058.0%
Claude Sonnet 4100806841057.8%
Z.AI GLM 5100100537052.2%
Mistral Large 2100815223051.2%
Ministral 8B100865317051.1%
Gemini 3 Pro (Preview)100794531050.9%
Ministral 3 14B10077750050.5%
Z.AI GLM 4.710068635047.1%
DeepSeek V3.1100992611047.1%
Gemini 3 Flash (Preview, Reasoning)100853211045.8%
Claude Haiku 4.5100100260045.1%
MiniMax M2.510072520044.9%
Stealth: Hunter Alpha10096240044.0%
Claude Sonnet 4.6100494228043.9%
Gemini 3.1 Flash Lite (Preview)78454541041.6%
Ministral 3B10079290041.6%
Ministral 3 8B10010000040.0%
Qwen 3.5 122B10050490039.9%
Qwen 3.5 397B A17B77751414035.9%
Z.AI GLM 4.7 Flash1006800033.6%
Z.AI GLM 5 Turbo6239340027.0%
Gemini 2.5 Pro9619120025.4%
Qwen 3.5 27B873900025.2%
Qwen 3.5 35B565100021.4%
Claude Sonnet 4.6 (Reasoning)100000020.0%
Z.AI GLM 4.658000011.6%
Gemini 3 Flash (Preview)100000.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.6%
Qwen 3.5 122B1001001001009799.4%
Grok 4.20 (Beta)1001001001009398.6%
Grok 4.20 (Beta, Reasoning)1001001001009298.4%
GPT-5.4 Nano (Reasoning)1001001001009198.1%
Stealth: Healer Alpha1001001001008897.7%
Stealth: Hunter Alpha100100100998997.7%
o4 Mini1001001001008697.3%
Qwen 3.5 Plus (2026-02-15)1001001001008396.6%
GPT-5.4 Nano100100100919196.6%
GPT-5.4 Mini (Reasoning, Low)100100100988296.0%
Hermes 3 405B1001001001007194.1%
Gemma 3 27B1001001001007094.0%
GPT-4.1 Nano1001001001006793.3%
Ministral 3 14B10010098907793.2%
Z.AI GLM 4.51001001001006392.6%
Claude 3.7 Sonnet1001001001006392.5%
Mistral Large100100100946691.9%
Gemini 2.5 Flash Lite (Reasoning)1001001001005791.3%
Qwen 3.5 Flash1001001001005691.1%
Claude 3.5 Sonnet100100100876590.5%
Qwen 2.5 72B1001001001005290.4%
Claude Sonnet 4.51001001001004789.3%
Mistral Large 21001001001004488.8%
Gemini 3.1 Pro (Preview)1001001001004388.6%
Arcee AI: Trinity Large (Preview)100100100964688.5%
Z.AI GLM 5 Turbo1001001001004188.3%
Qwen 3.5 35B100100100796087.9%
Nemotron 3 Nano1001001001003787.5%
Mistral Medium 3.1100100100875087.3%
Gemini 2.5 Pro100100100706587.0%
Qwen 3.5 397B A17B1001001001002785.5%
Inception Mercury1001001001002685.1%
Claude 3.5 Haiku1001001001002584.9%
Rocinante 12B1001001001002584.9%
MoonshotAI: Kimi K2.5100100100794584.7%
Gemini 3 Flash (Preview, Reasoning)1009482776383.2%
Llama 3.1 Nemotron 70B1001001001001583.0%
Claude Opus 4.510010093635882.9%
Aion 2.010010097832781.6%
Claude Opus 4100100100733481.4%
Hermes 3 70B100100100100681.2%
Z.AI GLM 510010010095980.8%
Gemini 3 Flash (Preview)100100100881680.7%
DeepSeek-V2 Chat10010097555180.6%
Claude Opus 4.61001009999480.4%
DeepSeek V3.1100100100100080.0%
DeepSeek V3 (2025-03-24)100100100100080.0%
Mistral NeMO100100100100080.0%
Ministral 3 8B10010010094078.8%
DeepSeek V3.210010010091078.3%
Z.AI GLM 4.710010010088077.7%
Gemini 2.5 Flash (Reasoning)10010095771276.8%
Mistral Large 310010010084076.7%
MiniMax M2.7100100100631876.1%
Z.AI GLM 4.6100100100532575.6%
Ministral 8B10010010071074.2%
DeepSeek V3 (2024-12-26)1008371683771.8%
Z.AI GLM 4.7 Flash1001008163269.4%
Gemini 3.1 Flash Lite (Preview)10010064363366.7%
Mistral Small Creative10010010019063.7%
Claude Opus 4.6 (Reasoning)1007867461761.5%
Gemma 3 4B1001007325059.5%
Claude Haiku 4.51001004641057.4%
Qwen 3.5 27B1001003628052.8%
Claude Sonnet 4.6 (Reasoning)1001004413051.5%
Gemini 3 Pro (Preview)100533723243.0%
Ministral 3B1008500037.0%
Claude Sonnet 4.6828020032.9%
MiniMax M2.58645260031.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
DeepSeek V3 (2025-03-24)1001001001009899.6%
Arcee AI: Trinity Large (Preview)1001001001009699.3%
Grok 4 Fast1001001001009699.2%
Claude 3.5 Sonnet1001001001009398.7%
GPT-4o Mini (temp=0)1001001001008897.7%
GPT-5.4 Nano1001001001008797.5%
ByteDance Seed 1.6 Flash1001001001008096.0%
GPT-5.4 (Reasoning)1001001001007995.8%
Mistral Large 2100100100948094.8%
Claude Sonnet 4.5100100100947493.7%
Qwen3 235B A22B Instruct 25071001001001006793.5%
Claude Sonnet 41001001001006793.3%
Arcee AI: Trinity Mini1001001001006593.1%
Gemini 3.1 Pro (Preview)100100100848193.0%
Llama 3.1 Nemotron 70B1001001001005791.3%
GPT-5.4 Mini (Reasoning)100100100847291.2%
GPT-4o, May 13th (temp=0)1001001001005490.8%
Claude 3 Haiku100100100817290.5%
GPT-5.4 Mini (Reasoning, Low)10010095866989.9%
Rocinante 12B1001001001004889.7%
Mistral Medium 3.11001001001004789.5%
Mistral Small 4 (Reasoning)100100100747389.5%
DeepSeek V3.210010096895988.9%
Claude 3.7 Sonnet1001001001004388.5%
Hermes 3 405B1001001001004088.1%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
Qwen 2.5 72B100100100864586.2%
Nemotron 3 Super100100100755385.6%
GPT-5.4 (Reasoning, Low)100100100724282.8%
MoonshotAI: Kimi K2.510010088783580.2%
Mistral Small 4100100100100080.0%
GPT-4.1 Nano100100100100080.0%
Inception Mercury10010010095079.1%
Claude Opus 410010010094078.8%
Grok 4.20 (Beta, Reasoning)10010076694878.5%
GPT-4o, May 13th (temp=1)10010010091078.1%
Gemma 3 4B10010088663076.8%
MiniMax M2.7100100100443776.4%
Mistral Small 3.2 24B100100100443174.8%
Z.AI GLM 4.51001008478072.5%
Qwen 3.5 Plus (2026-02-15)10010010061072.2%
Grok 4.20 (Beta)1008974602168.9%
Qwen 3.5 9B1001007469068.7%
Llama 3.1 70B10010085371667.7%
Qwen 3.5 397B A17B1009776442067.4%
Gemini 2.5 Flash Lite10010053532165.5%
Mistral NeMO1001007242063.0%
GPT-5.4 Mini1001009021062.1%
Gemini 3 Flash (Preview)1001006641061.2%
Qwen 3.5 Flash1001001002060.3%
Claude Opus 4.6 (Reasoning)100908327059.9%
Mistral Small Creative100100940058.8%
Mistral Large1007249462357.9%
ByteDance Seed 2.0 Mini100945341057.8%
MiniMax M2.5100805744056.2%
Ministral 8B1007156321955.6%
Mistral Large 3100756239155.3%
Stealth: Healer Alpha1008156221454.6%
Claude Opus 4.5100100552051.5%
Gemini 2.5 Flash100100490049.9%
Gemma 3 12B100706314049.3%
Claude Sonnet 4.6100100400048.0%
Gemini 3.1 Flash Lite (Preview)10095395047.9%
Claude 3.5 Haiku100100330046.7%
DeepSeek V3.1100842915045.5%
Qwen 3.5 35B82775214045.0%
Stealth: Hunter Alpha100624117044.0%
Z.AI GLM 5 Turbo100763113044.0%
Gemini 2.5 Pro1009880041.3%
Z.AI GLM 510092102040.8%
Qwen 3 32B10086110039.4%
Gemini 3 Pro (Preview)10050389039.4%
Claude Opus 4.69153510039.1%
Ministral 3B10066250038.1%
Qwen 3.5 27B10059260037.1%
Aion 2.09172210036.9%
Gemini 3 Flash (Preview, Reasoning)10048260034.9%
Claude Sonnet 4.6 (Reasoning)7948410033.6%
Ministral 3 14B59363122330.4%
Claude Haiku 4.510023210028.7%
Z.AI GLM 4.61003300026.5%
Ministral 3 8B646070026.3%
Z.AI GLM 4.7 Flash675600024.7%
Ministral 3 3B4743290023.8%
Qwen 3.5 122B882800023.3%
Z.AI GLM 4.7602780018.9%
Gemma 3 27B611850016.8%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Small 410010010010010099.9%
Gemini 2.5 Flash Lite (Reasoning)1001001001009999.9%
GPT-4o, May 13th (temp=1)1001001001009999.8%
Claude Opus 41001001001009799.5%
Claude 3.7 Sonnet1001001001009599.1%
Mistral Large 31001001001009598.9%
GPT-5 Nano1001001001009498.8%
WizardLM 2 8x22b1001001001009298.5%
GPT-4o, May 13th (temp=0)1001001001009198.1%
Mistral Small Creative1001001001008797.3%
Mistral Large 21001001001008697.3%
Gemini 3 Flash (Preview)1001001001008396.7%
Claude Sonnet 41001001001008396.6%
Gemma 3 4B1001001001008096.0%
Stealth: Hunter Alpha1001001001006693.2%
DeepSeek V3.11001001001006593.0%
Gemini 3 Pro (Preview)1001001001006492.9%
Arcee AI: Trinity Large (Preview)1001001001006392.6%
Gemini 2.5 Pro1001001001005591.0%
MiniMax M2.51001001001005490.9%
Cohere Command R+ (Aug. 2024)1001001001005290.4%
Claude Opus 4.610010096876189.0%
DeepSeek V3 (2025-03-24)1001001001004188.3%
Claude Opus 4.5100100100944487.6%
Rocinante 12B100100100954087.1%
Claude Sonnet 4.6 (Reasoning)1001001001003486.9%
Z.AI GLM 4.61001001001002885.6%
Stealth: Healer Alpha1001001001002585.1%
Gemma 3 12B1001001001002584.9%
Mistral Medium 3.1100100100714182.4%
Claude Sonnet 4.61001001001001182.2%
Qwen 3.5 35B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Claude 3.5 Haiku10010010096079.2%
Ministral 3 3B10010010089077.8%
Ministral 3 14B1009477714076.4%
Claude Haiku 4.51001007868069.1%
Z.AI GLM 4.7 Flash100100520050.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Z.AI GLM 5 Turbo1001001001009599.1%
Claude Sonnet 4.61001001001008697.1%
Mistral NeMO1001001001008396.5%
Nemotron 3 Nano1001001001008296.5%
Gemini 3 Pro (Preview)1001001001008096.0%
Claude Sonnet 41001001001008096.0%
Qwen3 235B A22B Instruct 25071001001001007995.8%
ByteDance Seed 2.0 Lite1001001001007494.8%
GPT-4o, May 13th (temp=0)1001001001007494.7%
Gemini 2.5 Flash Lite10010097888894.5%
Mistral Small 4 (Reasoning)10010097938194.3%
Mistral Small 4100100100987294.0%
Claude 3.5 Sonnet100100100878293.8%
MiniMax M2.71001001001006793.5%
Aion 2.0100100100877792.9%
Arcee AI: Trinity Large (Preview)1001001001005991.8%
Hermes 3 405B1001001001005891.6%
Llama 3.1 70B100100100956291.4%
Ministral 8B1001001001005791.3%
Gemini 3.1 Flash Lite (Preview)1001001001005691.2%
ByteDance Seed 2.0 Mini100100100837290.9%
Ministral 3 8B1001001001005390.7%
Gemma 3 12B100100100797490.6%
Arcee AI: Trinity Mini1001001001005190.1%
Claude Opus 4.610010099916090.0%
Gemini 2.5 Pro1001001001005090.0%
Rocinante 12B100100100895889.4%
Cohere Command R+ (Aug. 2024)1001001001004488.8%
Llama 3.1 Nemotron 70B100100100816388.7%
Gemini 2.5 Flash Lite (Reasoning)100100100845988.6%
Claude Opus 4.6 (Reasoning)100100100974488.2%
Qwen 3 32B1001001001003887.7%
Gemini 2.5 Flash (Reasoning)1001001001003687.3%
GPT-4.1 Nano100100100855087.0%
Mistral Large1001001001002885.5%
Qwen 3.5 35B100100100725385.0%
Gemini 3 Flash (Preview, Reasoning)100100100883484.4%
DeepSeek-V2 Chat10010073696982.2%
Ministral 3 3B100100100100781.4%
Claude Haiku 4.510010084685280.8%
Inception Mercury100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Ministral 3B10010010099079.8%
ByteDance Seed 1.6100100100732178.9%
Claude Opus 4100100100791278.1%
DeepSeek V3 (2024-12-26)10010010084076.7%
Claude Opus 4.5100100100503276.5%
Qwen 2.5 72B10010010080076.0%
Ministral 3 14B10010010063072.5%
Gemma 3 27B10010010050069.9%
Claude 3 Haiku1008773602769.3%
Z.AI GLM 510010080323268.6%
Gemma 3 4B10010010034066.8%
Claude Sonnet 4.6 (Reasoning)10010010029065.8%
Stealth: Hunter Alpha1001007749065.3%
Stealth: Healer Alpha100767169063.1%
Gemini 3 Flash (Preview)1001007342063.0%
Z.AI GLM 4.7 Flash100807252060.8%
MiniMax M2.5100100750055.0%
DeepSeek V3.110077667050.0%
WizardLM 2 8x22b10086576049.7%
Claude 3.5 Haiku100100290045.8%
DeepSeek V3.2100483224040.8%
Z.AI GLM 4.71006700033.3%
Z.AI GLM 4.67459280032.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.51001001001009899.7%
GPT-5 Mini1001001001009699.2%
Claude Opus 4.6 (Reasoning)1001001001009198.2%
ByteDance Seed 1.61001001001008897.7%
DeepSeek-V2 Chat1001001001008897.5%
Z.AI GLM 4.7 Flash1001001001008797.4%
DeepSeek V3 (2025-03-24)1001001001008597.0%
Claude Opus 4.61001001001008296.4%
MiniMax M2.71001001001008096.1%
ByteDance Seed 2.0 Lite1001001001007094.0%
Stealth: Healer Alpha100100100897993.7%
Gemma 3 4B1001001001006893.5%
Gemini 3.1 Flash Lite (Preview)1001001001006593.1%
Aion 2.0100100100848092.6%
Mistral Small 4 (Reasoning)100100100877692.5%
Mistral Small 3.2 24B1001001001006192.2%
Claude Haiku 4.5100100100827892.1%
Z.AI GLM 51001001001006091.9%
DeepSeek V3.11001001001004488.8%
Mistral Small Creative100100100983686.8%
Mistral Large 2100100100874486.2%
Claude Sonnet 4.51001001001002584.9%
MiniMax M2.5100100100992083.9%
DeepSeek V3 (2024-12-26)100100100100781.4%
Mistral Large 3100100100100080.0%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Ministral 3 3B100100100100080.0%
Claude Sonnet 4.6 (Reasoning)10010010098079.6%
Gemini 2.5 Pro100100100641976.6%
Gemma 3 27B100100100612176.4%
Mistral Large10010010077075.5%
Mistral Medium 3.1100100100383574.8%
ByteDance Seed 2.0 Mini10010010063072.7%
Claude 3.5 Haiku100100100371570.4%
Ministral 3 8B10010010047069.4%
Gemini 3 Flash (Preview)10010010045069.0%
Ministral 3 14B100100100301168.2%
Ministral 3B10010010038067.7%
Ministral 8B1001001000060.0%
Z.AI GLM 4.610081760051.3%
Claude Sonnet 4.61005400030.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
MiniMax M2.71001001001009899.6%
Gemini 2.5 Flash Lite1001001001009799.5%
MiniMax M2.51001001001009398.7%
Claude Opus 41001001001009298.3%
DeepSeek V3 (2024-12-26)1001001001009198.1%
Z.AI GLM 5 Turbo1001001001008296.4%
Qwen 2.5 72B1001001001008296.4%
Cohere Command R+ (Aug. 2024)1001001001008296.4%
Arcee AI: Trinity Mini1001001001007995.8%
GPT-4o, Aug. 6th (temp=0)100100100928394.9%
Claude Haiku 4.51001001001007194.1%
Claude Opus 4.51001001001006993.7%
Gemma 3 4B1001001001006793.3%
Gemini 3.1 Flash Lite (Preview)100100100858193.1%
Claude Sonnet 4.6 (Reasoning)1001001001006292.4%
Gemini 2.5 Pro1001001001006092.0%
Mistral Small 3.2 24B1001001001005891.6%
Llama 3.1 70B1001001001005891.6%
Stealth: Healer Alpha10010089877790.6%
Claude Opus 4.6100100100985290.0%
Aion 2.01001001001004889.7%
Ministral 3 14B100100100796989.6%
Claude 3.5 Haiku1001001001004188.3%
Ministral 8B1001001001003987.9%
Stealth: Hunter Alpha100100100766287.6%
Qwen 3.5 Flash1001001001003386.7%
WizardLM 2 8x22b100100100656485.9%
Hermes 3 70B1001001001002985.8%
Gemini 3 Pro (Preview)1001001001002585.1%
GPT-4.1 Nano1001001001002384.7%
ByteDance Seed 1.61001001001002084.0%
GPT-4o, May 13th (temp=0)1001001001001282.4%
Claude 3 Haiku100100100852281.5%
Writer: Palmyra X5100100100921581.3%
Qwen 3.5 35B100100100100080.0%
Inception Mercury100100100100080.0%
Gemma 3 12B100100100100080.0%
Mistral Small Creative100100100100080.0%
Ministral 3 8B100100100100080.0%
Rocinante 12B100100100100080.0%
Gemini 3 Flash (Preview)100100100543978.7%
Hermes 3 405B100100100642577.7%
Qwen 3 32B1001009289076.3%
Ministral 3B10010087741775.7%
Mistral Large 31001008478072.3%
Z.AI GLM 4.71001008476072.0%
Arcee AI: Trinity Large (Preview)100100100352572.0%
Gemini 3 Flash (Preview, Reasoning)10010010057071.4%
Mistral NeMO100938957067.9%
DeepSeek V3.1100848155064.0%
DeepSeek V3.2100912914046.7%
Z.AI GLM 4.610083210040.9%
Z.AI GLM 4.7 Flash000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Mistral Small 41001001001009799.5%
Llama 3.1 70B1001001001009398.7%
Z.AI GLM 4.51001001001008997.8%
MoonshotAI: Kimi K2.51001001001008697.1%
Z.AI GLM 4.71001001001008697.1%
Claude Opus 41001001001008496.9%
Mistral Medium 3.11001001001008396.6%
Arcee AI: Trinity Mini1001001001008396.6%
Stealth: Hunter Alpha100100100968395.8%
Aion 2.01001001001007795.3%
Ministral 3 3B1001001001007695.3%
ByteDance Seed 1.6 Flash1001001001007595.1%
Claude Sonnet 41001001001007394.6%
DeepSeek V3 (2024-12-26)100100100888294.1%
Claude Opus 4.6 (Reasoning)1001001001006993.8%
GPT-4.1 Nano1001001001006392.5%
Ministral 8B1001001001005791.4%
Z.AI GLM 4.7 Flash1001001001004689.2%
Z.AI GLM 5 Turbo1001001001004589.0%
Claude Opus 4.61001001001004488.8%
ByteDance Seed 2.0 Lite100100100756788.5%
Gemini 3 Pro (Preview)1001001001003787.5%
Mistral Small Creative100100100894586.8%
MiniMax M2.71001001001001583.0%
Ministral 3B100100100882582.6%
Claude Opus 4.510010010099580.8%
Claude Haiku 4.5100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Ministral 3 14B100100100100080.0%
Mistral NeMO100100100100080.0%
DeepSeek V3.110010098573177.1%
Llama 3.1 8B100100100582075.6%
Gemini 3 Flash (Preview)10010010077075.4%
MiniMax M2.510010072571568.8%
Claude Sonnet 4.6 (Reasoning)998959391059.1%
Claude Sonnet 4.610094480048.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
MiniMax M2.71001001001009899.6%
Gemini 2.5 Flash Lite (Reasoning)1001001001009398.7%
GPT-4o, May 13th (temp=0)1001001001009298.5%
Ministral 3B1001001001009198.1%
Gemini 2.5 Flash Lite1001001001008897.7%
Z.AI GLM 51001001001008897.5%
Writer: Palmyra X51001001001008496.7%
Claude Sonnet 4.51001001001008296.3%
Claude Sonnet 41001001001007595.0%
Claude 3.5 Sonnet1001001001007494.8%
Aion 2.010010097948194.5%
ByteDance Seed 2.0 Mini1001001001007194.1%
Llama 3.1 Nemotron 70B1001001001006893.6%
Llama 3.1 8B100100100858193.2%
Gemma 3 12B1001001001006492.8%
Gemini 3.1 Flash Lite (Preview)1001001001006392.6%
Claude Opus 410010096868092.5%
Claude Opus 4.61001001001006192.1%
Hermes 3 405B1001001001005691.2%
Arcee AI: Trinity Mini100100100836890.2%
Rocinante 12B100100100757489.9%
Z.AI GLM 4.71009884818088.6%
Claude Haiku 4.51001001001004288.4%
Stealth: Hunter Alpha1001001001004188.3%
GPT-4o, May 13th (temp=1)100100100766487.9%
Gemini 3 Pro (Preview)100100100894686.8%
Qwen 2.5 72B100100100894386.5%
DeepSeek V3.1100100100686085.7%
Z.AI GLM 5 Turbo100100100695484.7%
GPT-4.1 Nano1001001001002284.4%
Mistral Small 4 (Reasoning)10010097675683.9%
Claude 3 Haiku10010081756383.8%
Gemma 3 4B100100100595382.5%
DeepSeek V3.2100100100793282.2%
DeepSeek V3 (2025-03-24)100100100555381.7%
Claude 3.5 Haiku100100100100480.8%
Mistral Large100100100871780.7%
Z.AI GLM 4.6100100100100080.0%
Claude 3.7 Sonnet100100100100080.0%
Mistral Large 310010010094078.8%
Claude Opus 4.510010086672475.4%
Gemini 2.5 Pro1008482614875.1%
Cohere Command R+ (Aug. 2024)10010010073074.5%
Mistral NeMO1008885583172.4%
Hermes 3 70B10010010061072.2%
Mistral Medium 3.110010010058071.6%
Mistral Large 21001008174071.0%
Z.AI GLM 4.7 Flash100978869070.8%
Z.AI GLM 4.510010010050069.9%
Ministral 8B1001008957069.2%
Qwen 3 32B10010010036067.2%
Mistral Small Creative10010074402066.9%
Mistral Small 3.2 24B10010010030066.0%
Llama 3.1 70B1001009729065.3%
DeepSeek-V2 Chat10010062412265.1%
Claude Sonnet 4.61007766523064.9%
Inception Mercury1001001000060.0%
Ministral 3 8B1001001000060.0%
Ministral 3 14B100100920058.3%
Claude Sonnet 4.6 (Reasoning)1001004741057.6%
Claude Opus 4.6 (Reasoning)100100810056.3%
WizardLM 2 8x22b1001005328056.1%
Gemini 3 Flash (Preview, Reasoning)1001005114052.9%
Gemini 3 Flash (Preview)100635543052.0%
Stealth: Healer Alpha100100530050.5%