Adverbs in dialogue tags

Test: Bad Writing Habits

Avg. Score
89.8%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Inception Mercury 2100.0%$0.00327.0s100%
2Gemini 3.1 Flash Lite (Preview)98.9%$0.00308.4s89%
3ByteDance Seed 1.6 Flash99.2%$0.001327.3s88%
4Grok 4.1 Fast98.7%$0.001837.8s89%
5Stealth: Aurora Alpha98.6%$0.00009.8s78%
6Ministral 8B97.6%$0.000410.4s76%
7Gemini 3 Flash (Preview)98.4%$0.007819.6s78%
8GPT-5.4 Nano97.2%$0.005726.3s81%
9Ministral 3 3B96.8%$0.000511.1s70%
10Qwen 3.5 Flash98.1%$0.002547.5s77%
11Gemini 3 Flash (Preview, Reasoning)97.8%$0.01230.1s77%
12GPT-5.4 Mini97.8%$0.01516.8s74%
13Qwen 2.5 72B97.7%$0.001036.7s71%
14Mistral NeMO95.6%$0.000510.1s64%
15Qwen 3.5 35B98.5%$0.0181.0m78%
16o4 Mini95.8%$0.01525.7s71%
17GPT-5.4 Mini (Reasoning, Low)96.8%$0.01516.8s66%
18Z.AI GLM 4.696.6%$0.006551.5s73%
19Stealth: Healer Alpha95.0%$0.000023.7s64%
20Ministral 3B94.3%$0.00018.1s61%
21Qwen 3.5 27B98.9%$0.0201.6m85%
22Qwen 3.5 Plus (2026-02-15)94.6%$0.006031.5s65%
23Mistral Small 4 (Reasoning)92.5%$0.002230.2s65%
24Z.AI GLM 5 Turbo95.2%$0.008133.2s63%
25Nemotron 3 Nano96.2%$0.00101.1m67%
26Stealth: Hunter Alpha95.4%$0.000055.0s64%
27GPT-5.4 Nano (Reasoning)94.2%$0.006124.5s61%
28GPT-5.4 Nano (Reasoning, Low)94.1%$0.005520.6s59%
29Mistral Small Creative93.2%$0.00079.1s53%
30GPT-4o Mini (temp=0)93.4%$0.001234.8s60%
31GPT-5.4 Mini (Reasoning)95.6%$0.02228.1s62%
32Mistral Medium 3.193.4%$0.004836.5s58%
33Grok 4 Fast92.2%$0.001724.1s55%
34Nemotron 3 Super95.4%$0.00001.4m66%
35Qwen 3.5 122B96.2%$0.0251.1m69%
36WizardLM 2 8x22b94.7%$0.00261.8m73%
37Qwen3 235B A22B Instruct 250793.5%$0.001159.2s59%
38GPT-4o, May 13th (temp=0)94.0%$0.03514.1s61%
39Ministral 3 8B90.8%$0.000819.6s52%
40Gemini 2.5 Pro94.8%$0.03636.2s65%
41Writer: Palmyra X592.3%$0.01122.0s54%
42GPT-5.498.4%$0.0491.4m79%
43Gemini 2.5 Flash Lite89.1%$0.00099.5s50%
44Cohere Command R+ (Aug. 2024)93.8%$0.02052.5s62%
45Gemini 2.5 Flash89.8%$0.005210.6s49%
46Qwen 3.5 9B94.6%$0.00111.4m59%
47GPT-4.193.4%$0.01844.7s58%
48Gemma 3 4B89.6%$0.000220.0s49%
49Qwen 3 32B92.7%$0.001554.6s54%
50LFM2 24B89.1%$0.000228.4s51%
51Z.AI GLM 4.589.3%$0.005142.1s56%
52Inception Mercury92.2%$0.01117.6s46%
53Mistral Large92.0%$0.01430.9s50%
54GPT-4o, Aug. 6th (temp=0)90.5%$0.02322.7s54%
55Qwen 3.5 397B A17B98.9%$0.0143.0m79%
56Ministral 3 14B88.8%$0.000711.7s43%
57Aion 2.092.8%$0.00641.3m55%
58o4 Mini High91.8%$0.02547.2s56%
59Mistral Large 290.1%$0.01329.4s47%
60MiniMax M2.790.6%$0.00401.1m53%
61Z.AI GLM 4.792.0%$0.0101.4m57%
62Gemma 3 27B87.8%$0.000652.6s49%
63Mistral Large 388.9%$0.003330.3s41%
64Claude Sonnet 4.591.7%$0.03538.1s52%
65Z.AI GLM 4.7 Flash89.9%$0.00171.2m49%
66DeepSeek V3.291.7%$0.00141.9m56%
67Grok 4.20 (Beta)85.6%$0.01815.8s45%
68GPT-5.295.2%$0.0561.5m65%
69Mistral Small 484.2%$0.001418.2s38%
70Gemini 3 Pro (Preview)93.1%$0.05554.4s57%
71Gemini 2.5 Flash Lite (Reasoning)85.4%$0.002830.8s39%
72Llama 3.1 8B89.1%$0.00031.3m44%
73Gemini 2.5 Flash (Reasoning)85.8%$0.01121.5s39%
74Claude 3.5 Haiku85.9%$0.003510.8s31%
75Z.AI GLM 587.7%$0.00841.2m48%
76Claude 3 Haiku83.8%$0.002514.9s35%
77ByteDance Seed 1.694.4%$0.0132.5m59%
78Claude Sonnet 4.689.0%$0.03139.3s45%
79DeepSeek V3 (2025-03-24)85.0%$0.001439.4s36%
80GPT-5 Mini86.1%$0.010057.4s41%
81GPT-5.4 (Reasoning, Low)93.0%$0.0551.4m55%
82Llama 3.1 Nemotron 70B83.2%$0.003831.7s34%
83Gemma 3 12B82.5%$0.000441.3s36%
84Llama 3.1 70B83.2%$0.001529.4s32%
85Arcee AI: Trinity Mini80.1%$0.00039.2s30%
86Claude Opus 4.693.4%$0.0781.2m59%
87MiniMax M2.582.7%$0.00341.3m42%
88Grok 492.2%$0.0481.7m53%
89GPT-4o Mini (temp=1)78.2%$0.001234.8s36%
90DeepSeek V3 (2024-12-26)82.4%$0.002154.6s34%
91Claude Opus 4.6 (Reasoning)93.8%$0.0881.4m60%
92GPT-5.190.8%$0.0541.8m58%
93ByteDance Seed 2.0 Lite90.8%$0.0122.2m45%
94Claude Sonnet 4.6 (Reasoning)90.1%$0.0601.2m48%
95GPT-4o, May 13th (temp=1)81.2%$0.03314.4s33%
96Rocinante 12B79.5%$0.001438.4s28%
97DeepSeek-V2 Chat80.8%$0.002153.3s30%
98DeepSeek V3.184.8%$0.00201.8m37%
99Claude Sonnet 482.1%$0.03243.7s35%
100GPT-5.4 (Reasoning)95.4%$0.0892.6m68%
101Claude Opus 4.587.5%$0.07053.4s44%
102MoonshotAI: Kimi K2.591.8%$0.0193.2m52%
103GPT-4o, Aug. 6th (temp=1)77.0%$0.01824.4s30%
104Grok 4.20 (Beta, Reasoning)83.6%$0.03934.0s30%
105GPT-594.3%$0.0652.8m62%
106Claude 3.5 Sonnet82.7%$0.04835.4s34%
107Gemini 3.1 Pro (Preview)93.3%$0.1071.8m59%
108Arcee AI: Trinity Large (Preview)75.3%$0.000043.6s22%
109Claude 3.7 Sonnet78.0%$0.04246.7s34%
110Claude Haiku 4.572.0%$0.01121.6s22%
111GPT-4.1 Mini67.0%$0.002719.0s20%
112ByteDance Seed 2.0 Mini92.9%$0.00454.9m51%
113Hermes 3 405B70.7%$0.003253.2s18%
114Mistral Small 3.2 24B92.8%$0.00695.7m50%
115GPT-5 Nano65.9%$0.00421.4m18%
116Hermes 3 70B63.7%$0.00101.2m13%
117Claude Opus 488.7%$0.2091.4m50%
118GPT-4.1 Nano46.6%$0.000713.3s4%
89.77%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Gemini 2.5 Flash1001001001009298.4%
Aion 2.01001001001008997.8%
Stealth: Healer Alpha1001001001008997.8%
Gemini 2.5 Flash Lite (Reasoning)1001001001008997.8%
GPT-4o, Aug. 6th (temp=0)1001001001008997.8%
LFM2 24B1001001001008997.8%
o4 Mini1001001001008697.1%
Mistral Large1001001001008697.1%
Mistral Small 3.2 24B1001001001008596.9%
Claude Sonnet 4.61001001001008296.5%
Mistral Small 4 (Reasoning)1001001001008296.5%
WizardLM 2 8x22b100100100928495.2%
Gemini 2.5 Flash Lite1001001001007595.0%
Llama 3.1 70B1001001001007595.0%
Gemma 3 12B1001001001007093.9%
Arcee AI: Trinity Large (Preview)1001001001007093.9%
DeepSeek V3.21001001001006793.3%
Mistral Large 31001001001006593.0%
Claude Opus 4100100100867592.3%
GPT-4.1 Mini1001001001005791.4%
Ministral 3B1001001001005791.4%
Gemini 3.1 Pro (Preview)1001001001005290.4%
Qwen 3 32B1001001001005090.0%
Claude Haiku 4.51001001001004288.4%
Llama 3.1 Nemotron 70B1001001001003386.7%
Claude 3.7 Sonnet100100100706486.6%
GPT-5 Nano100100100874486.1%
GPT-4o Mini (temp=1)100100100854084.9%
GPT-5 Mini1001001001002484.7%
GPT-4o, May 13th (temp=1)100100100824084.5%
Grok 4.20 (Beta)100100100784083.5%
DeepSeek V3 (2024-12-26)100100100674682.6%
Claude Sonnet 4100100100575281.8%
GPT-4o, Aug. 6th (temp=1)100100100100080.0%
Ministral 3 3B100100100100080.0%
DeepSeek V3 (2025-03-24)10010010089077.8%
MiniMax M2.51009787574276.7%
Cohere Command R+ (Aug. 2024)100100100571875.1%
ByteDance Seed 1.610010010075075.0%
Z.AI GLM 4.5100100100422974.1%
Z.AI GLM 5100100100571073.3%
Mistral Small 410010071502969.9%
MoonshotAI: Kimi K2.510010010018063.6%
ByteDance Seed 2.0 Lite1001001000060.0%
Hermes 3 405B100100890057.8%
Hermes 3 70B100100330046.7%
GPT-4.1 Nano100100330046.7%
Rocinante 12B1005200030.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude 3.5 Sonnet1001001001009598.9%
Gemini 3.1 Pro (Preview)1001001001008997.8%
GPT-5.4 Mini (Reasoning)1001001001008997.8%
Claude Opus 4.51001001001008997.8%
GPT-5.41001001001008997.8%
GPT-5.4 Nano (Reasoning)1001001001008997.8%
ByteDance Seed 1.6 Flash1001001001008997.8%
Z.AI GLM 4.51001001001008296.5%
GPT-5.11001001001007595.0%
MiniMax M2.71001001001005791.4%
Gemma 3 12B1001001001005791.4%
GPT-5.4 Nano1001001001005791.4%
Cohere Command R+ (Aug. 2024)1001001001005791.4%
Grok 4.20 (Beta)100100100896290.2%
Llama 3.1 8B100100100757590.0%
GPT-5.4 (Reasoning)1001001001003386.7%
Grok 41001001001003386.7%
GPT-4o, Aug. 6th (temp=1)1001001001003386.7%
Claude 3.7 Sonnet10010089825084.2%
ByteDance Seed 1.61001001001001883.6%
GPT-4.1100100100100080.0%
ByteDance Seed 2.0 Mini100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
Grok 4 Fast100100100100080.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100080.0%
Claude Haiku 4.5100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Nemotron 3 Super100100100100080.0%
Mistral Large 2100100100100080.0%
DeepSeek V3.1100100100100080.0%
Qwen3 235B A22B Instruct 2507100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Hermes 3 70B100100100100080.0%
Rocinante 12B100100100100080.0%
GPT-5.4 (Reasoning, Low)10010010095078.9%
Ministral 3 8B10010010089077.8%
Writer: Palmyra X510010010082076.5%
Mistral Large10010010082076.5%
Hermes 3 405B10010010075075.0%
Gemma 3 4B1001008975072.8%
Gemini 2.5 Flash10010010057071.4%
GPT-5 Nano10010010026065.2%
GPT-4.1 Mini1001006757064.8%
GPT-5 Mini1001001000060.0%
Claude Opus 4.61001001000060.0%
Grok 4.20 (Beta, Reasoning)1001001000060.0%
MiniMax M2.51001001000060.0%
DeepSeek V3 (2024-12-26)1001001000060.0%
Mistral Small 41001001000060.0%
Ministral 3 14B1001001000060.0%
Claude 3 Haiku1001001000060.0%
Claude Sonnet 4100100670053.3%
GPT-4o, May 13th (temp=1)1001003318050.3%
GPT-4o Mini (temp=1)67333333033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001009899.7%
Grok 4 Fast1001001001009799.3%
GPT-5 Nano1001001001009799.3%
DeepSeek V3 (2024-12-26)1001001001009598.9%
Mistral Small 4 (Reasoning)1001001001009398.6%
Mistral Small Creative1001001001009398.6%
Gemini 2.5 Flash1001001001009298.4%
Grok 4.20 (Beta, Reasoning)1001001001009198.2%
Gemini 3 Pro (Preview)1001001001009098.1%
Gemini 2.5 Flash (Reasoning)1001001001008997.8%
Mistral NeMO1001001001008997.8%
Z.AI GLM 51001001001008797.5%
Aion 2.01001001001008797.4%
Grok 4.1 Fast1001001001008596.9%
Stealth: Hunter Alpha1001001001008596.9%
Claude Sonnet 4.61001001001008296.5%
GPT-4o, May 13th (temp=0)1001001001008296.5%
Gemini 2.5 Pro1001001001008096.0%
Llama 3.1 70B100100100898995.6%
Claude Sonnet 41001001001007595.0%
Llama 3.1 8B1001001001007595.0%
Z.AI GLM 4.610010097978095.0%
DeepSeek-V2 Chat100100100868694.3%
Claude Opus 4.51001001001007194.2%
Claude Opus 41001001001006593.0%
Grok 41001001001006292.4%
GPT-4o Mini (temp=0)10010092897991.9%
Grok 4.20 (Beta)100100100847391.3%
WizardLM 2 8x22b100100100926491.3%
Mistral Medium 3.11001001001004388.6%
Arcee AI: Trinity Mini1001001001004288.4%
Claude 3.7 Sonnet10010090826988.2%
Mistral Large 31001001001003887.6%
GPT-4o, Aug. 6th (temp=0)1001001001003887.6%
Stealth: Healer Alpha1001001001003386.7%
DeepSeek V3.21001001001003386.7%
Gemini 2.5 Flash Lite100100100755786.4%
GPT-4.1 Mini10010097716285.9%
Gemma 3 12B100100100824685.7%
GPT-4o, Aug. 6th (temp=1)100100100675784.8%
o4 Mini High1001001001002184.3%
GPT-4o, May 13th (temp=1)10010093636083.1%
Ministral 3 8B100100100654381.6%
DeepSeek V3.1100100100792981.5%
Rocinante 12B100100100100080.0%
Mistral Small 410010010098079.7%
Ministral 3 14B100100100673179.4%
Z.AI GLM 4.510010077535176.1%
Gemma 3 4B10010010080076.0%
Qwen 3.5 Plus (2026-02-15)100100100393675.0%
Claude 3 Haiku100100100363373.9%
Hermes 3 405B10010010067073.3%
Gemma 3 27B1009595481871.2%
LFM2 24B1008271603369.4%
GPT-4o Mini (temp=1)100939357068.7%
Hermes 3 70B1001008946067.0%
MiniMax M2.5100897339060.2%
Arcee AI: Trinity Large (Preview)1001005033056.7%
Llama 3.1 Nemotron 70B100100750055.0%
Claude Haiku 4.510071403042.8%
GPT-4.1 Nano402400012.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009298.4%
Rocinante 12B1001001001008997.8%
LFM2 24B1001001001008296.5%
Ministral 3 3B1001001001008296.5%
Gemini 2.5 Flash1001001001008096.0%
Mistral NeMO1001001001008096.0%
Claude 3.7 Sonnet1001001001007995.8%
Writer: Palmyra X51001001001007995.8%
Claude Sonnet 4.51001001001007895.5%
MiniMax M2.71001001001007595.0%
Ministral 3 8B1001001001007595.0%
WizardLM 2 8x22b100100100977794.6%
GPT-4o, May 13th (temp=1)1001001001007093.9%
Gemini 3.1 Flash Lite (Preview)1001001001006793.3%
GPT-4o Mini (temp=0)1001001001006492.9%
GPT-5 Nano1001001001006292.4%
MiniMax M2.5100100100857391.5%
GPT-4o Mini (temp=1)1001001001005791.4%
Gemini 2.5 Flash Lite (Reasoning)100100100757590.0%
Cohere Command R+ (Aug. 2024)1001001001005090.0%
Grok 4 Fast100100100797190.0%
Claude Sonnet 4.61001001001003386.7%
DeepSeek V3 (2024-12-26)1001001001003186.1%
Mistral Medium 3.1100100100785285.9%
Z.AI GLM 4.51001001001002985.7%
Arcee AI: Trinity Large (Preview)1001001001002584.9%
Claude 3.5 Sonnet100100100863383.8%
Mistral Large1001001001001883.6%
Hermes 3 70B1001001001001883.6%
GPT-4o, Aug. 6th (temp=1)100100100892983.5%
Gemini 2.5 Flash (Reasoning)10010089575780.6%
Inception Mercury100100100100080.1%
Mistral Small 3.2 24B100100100100080.0%
Gemma 3 12B100100100100080.0%
Claude 3 Haiku100100100100080.0%
DeepSeek V3 (2025-03-24)10010010095078.9%
Mistral Small 4 (Reasoning)100100100463776.6%
Gemini 2.5 Flash Lite10010010040068.0%
Hermes 3 405B1001007157065.6%
GPT-4.1 Mini1007972571464.4%
Claude 3.5 Haiku1001001000060.0%
GPT-4.1 Nano716700027.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Z.AI GLM 4.71001001001009799.5%
Mistral Small 3.2 24B1001001001009799.5%
Grok 41001001001009598.9%
Grok 4 Fast1001001001009598.9%
DeepSeek V3 (2025-03-24)1001001001009598.9%
Ministral 3 14B1001001001008997.8%
WizardLM 2 8x22b10010098959597.5%
Gemini 2.5 Flash1001001001008797.4%
Claude Sonnet 4.51001001001008697.3%
LFM2 24B1001001001008496.8%
Z.AI GLM 4.7 Flash1001001001007995.8%
o4 Mini High1001001001007895.5%
DeepSeek V3 (2024-12-26)100100100987995.4%
Stealth: Healer Alpha1001001001007595.0%
DeepSeek V3.11001001001007595.0%
Llama 3.1 8B1001001001007595.0%
Mistral Large 2100100100868293.6%
GPT-5 Mini100100100986993.4%
MiniMax M2.51001001001006593.0%
DeepSeek V3.21001001001006092.1%
Claude Haiku 4.510010097847891.8%
Z.AI GLM 4.5100100100956391.6%
Claude Sonnet 41001001001005290.4%
Mistral Medium 3.1100100100975590.2%
Gemma 3 12B1001001001004388.6%
Grok 4.20 (Beta, Reasoning)1001001001004288.4%
Mistral Small 4 (Reasoning)1001001001004288.4%
MiniMax M2.7100100100894486.6%
Ministral 3 8B1001001001002685.2%
ByteDance Seed 2.0 Mini100100100891881.4%
GPT-4o, May 13th (temp=0)10010097852481.1%
GPT-4.1 Mini10010097555080.4%
GPT-4o, Aug. 6th (temp=1)100100100891380.3%
Claude 3.5 Haiku100100100100080.0%
Inception Mercury100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
GPT-4o Mini (temp=1)1009589852879.2%
Gemma 3 4B10010097573377.6%
Rocinante 12B10010010082076.5%
GPT-5 Nano10010093681575.2%
GPT-4o, Aug. 6th (temp=0)10010067575275.1%
Claude 3 Haiku100100100541874.4%
Mistral Small 410010010071074.2%
Claude 3.7 Sonnet100100100432673.8%
Hermes 3 70B10010010013062.5%
Gemma 3 27B1001005746561.6%
Gemini 2.5 Flash Lite (Reasoning)100100700053.9%
DeepSeek-V2 Chat10079628049.8%
Llama 3.1 70B10075670048.3%
Llama 3.1 Nemotron 70B100753318045.3%
Arcee AI: Trinity Large (Preview)10010000040.0%
GPT-4.1 Nano6218130018.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Z.AI GLM 4.71001001001009799.5%
DeepSeek V3 (2024-12-26)1001001001009598.9%
Gemma 3 27B1001001001009598.9%
WizardLM 2 8x22b1001001001009598.9%
Z.AI GLM 51001001001009298.4%
o4 Mini1001001001008997.8%
Claude Opus 41001001001008997.8%
Mistral Medium 3.11001001001008296.5%
Gemini 3 Flash (Preview)1001001001007595.0%
DeepSeek V3 (2025-03-24)1001001001007194.2%
MiniMax M2.5100100100927593.4%
o4 Mini High1001001001006793.3%
Nemotron 3 Super100100100897592.8%
Claude Sonnet 4.51001001001006292.4%
Grok 4 Fast1001001001005791.4%
MoonshotAI: Kimi K2.51001001001005290.4%
Z.AI GLM 4.61001001001003386.7%
Llama 3.1 70B1001001001003386.7%
Claude Sonnet 4100100100923385.0%
Hermes 3 70B1001001001001883.6%
Gemini 2.5 Flash1001001001001081.9%
DeepSeek V3.1100100100574680.7%
DeepSeek-V2 Chat100100100891480.6%
GPT-4.1100100100100080.0%
Grok 4100100100100080.0%
ByteDance Seed 2.0 Mini100100100100080.0%
Stealth: Healer Alpha100100100100080.0%
Claude Haiku 4.5100100100100080.0%
ByteDance Seed 2.0 Lite100100100100080.0%
Hermes 3 405B100100100100080.0%
GPT-5 Nano100100100100080.0%
DeepSeek V3.2100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Gemma 3 4B100100100100080.0%
Ministral 8B100100100100080.0%
Qwen 3.5 Plus (2026-02-15)10010010079075.8%
GPT-4.1 Mini100100100571875.1%
Gemma 3 12B100100100521874.0%
Mistral Small 410010010067073.3%
Grok 4.20 (Beta, Reasoning)100100100521372.9%
LFM2 24B10010010057071.4%
Claude Sonnet 4.6 (Reasoning)1001008267069.8%
GPT-4.1 Nano10010010033066.7%
Rocinante 12B10010010033066.7%
Claude 3.7 Sonnet1001001008061.5%
GPT-4o, May 13th (temp=1)1001001000060.0%
Inception Mercury1001001000060.0%
Aion 2.0100100950058.9%
Claude 3.5 Sonnet100100820056.5%
GPT-4o, Aug. 6th (temp=1)100100180043.6%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
ByteDance Seed 1.6 Flash1001001001009799.3%
GPT-4o, Aug. 6th (temp=0)1001001001009598.9%
Claude 3 Haiku1001001001009598.9%
Ministral 8B1001001001009598.9%
o4 Mini High1001001001008997.8%
Grok 4.1 Fast1001001001008997.8%
GPT-4.11001001001008997.8%
Grok 41001001001008997.8%
ByteDance Seed 2.0 Lite1001001001008997.8%
Grok 4 Fast1001001001008697.1%
Claude Sonnet 41001001001008296.5%
Mistral Large 31001001001008296.5%
GPT-5.21001001001007795.4%
ByteDance Seed 1.61001001001007595.0%
Gemini 2.5 Pro1001001001007595.0%
GPT-4o, May 13th (temp=1)1001001001007595.0%
WizardLM 2 8x22b1001001001007595.0%
Claude 3.7 Sonnet1001001001006793.3%
Mistral Small 4 (Reasoning)1001001001006793.3%
DeepSeek V3.2100100100897592.8%
Gemini 2.5 Flash (Reasoning)1001001001005791.4%
DeepSeek V3.11001001001005791.4%
Ministral 3 14B1001001001005791.4%
GPT-5.4 (Reasoning, Low)100100100757590.0%
Z.AI GLM 51001001001005090.0%
GPT-4.1 Mini100100100825787.9%
Gemini 3 Pro (Preview)100100100895087.8%
Mistral Medium 3.11001001001003887.6%
Claude Opus 4.51001001001003386.7%
Writer: Palmyra X51001001001003386.7%
Gemma 3 12B1001001001003386.7%
Claude Haiku 4.510010089825785.7%
DeepSeek V3 (2025-03-24)100100100953385.6%
GPT-5.4 (Reasoning)100100100705785.3%
Claude Opus 4100100100854084.9%
GPT-5.4 Mini (Reasoning)100100100823884.0%
GPT-5.4 Mini (Reasoning, Low)1001001001001883.6%
Qwen 3.5 9B100100100754283.4%
LFM2 24B10010075706782.2%
Claude Opus 4.610010078706482.2%
Z.AI GLM 5 Turbo100100100891881.4%
DeepSeek V3 (2024-12-26)100100100703380.6%
Qwen 3.5 122B100100100100080.0%
GPT-4o, May 13th (temp=0)100100100100080.0%
Nemotron 3 Super100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Inception Mercury100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Rocinante 12B100100100100080.0%
MiniMax M2.710010010095078.9%
Gemini 2.5 Flash Lite (Reasoning)10010010095078.9%
GPT-4o, Aug. 6th (temp=1)100100100573378.1%
Z.AI GLM 4.510010095602676.3%
GPT-51001009780075.5%
MiniMax M2.51009582673375.4%
Arcee AI: Trinity Mini100100100571875.1%
Gemini 3.1 Pro (Preview)10010010067073.3%
Mistral Small 41008986642672.9%
Claude Sonnet 4.610010010057071.4%
GPT-4o Mini (temp=1)100100100421370.9%
GPT-5.11001006755064.4%
Claude Opus 4.6 (Reasoning)1001007150064.2%
Z.AI GLM 4.7100956457063.1%
Hermes 3 405B1001007924060.5%
Grok 4.20 (Beta, Reasoning)1001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Llama 3.1 70B1001001000060.0%
Hermes 3 70B100100890057.8%
Arcee AI: Trinity Large (Preview)100100808057.5%
GPT-5 Nano100894239053.9%
GPT-4.1 Nano757575181852.3%
Claude Sonnet 4.6 (Reasoning)10089130040.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
Grok 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Z.AI GLM 4.51001001001008997.8%
Claude Opus 41001001001007595.0%
Mistral Small 4 (Reasoning)1001001001006793.3%
GPT-5.4 Nano100100100896290.2%
ByteDance Seed 1.6 Flash1001001001004689.2%
GPT-4o Mini (temp=1)100100100755786.4%
GPT-5 Mini100100100754684.2%
Qwen 3 32B100100100753381.7%
Claude Opus 4.6 (Reasoning)100100100100080.0%
Z.AI GLM 5 Turbo100100100100080.0%
Claude Opus 4.6100100100100080.0%
Claude Sonnet 4.6100100100100080.0%
Claude Opus 4.5100100100100080.0%
Z.AI GLM 4.7100100100100080.0%
Gemini 2.5 Pro100100100100080.0%
o4 Mini100100100100080.0%
Qwen 3.5 35B100100100100080.0%
Qwen 3.5 Flash100100100100080.0%
Grok 4 Fast100100100100080.0%
Qwen 3.5 9B100100100100080.0%
GPT-5.4 Mini (Reasoning, Low)100100100100080.0%
Gemini 3 Flash (Preview)100100100100080.0%
Claude 3.5 Sonnet100100100100080.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
DeepSeek V3.1100100100100080.0%
DeepSeek V3 (2025-03-24)100100100100080.0%
Qwen3 235B A22B Instruct 2507100100100100080.0%
Inception Mercury100100100100080.0%
Gemma 3 12B100100100100080.0%
GPT-4o Mini (temp=0)100100100100080.0%
Mistral Small 4100100100100080.0%
Qwen 2.5 72B100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Mistral Small Creative100100100100080.0%
GPT-4.1 Nano100100100100080.0%
Z.AI GLM 510010010089077.8%
Claude Sonnet 410010010089077.8%
GPT-5.41001009589076.7%
GPT-5.210010010075075.0%
Stealth: Aurora Alpha10010010075075.0%
GPT-5.4 Mini10010010075075.0%
Arcee AI: Trinity Large (Preview)10010010075075.0%
Hermes 3 70B100100100333373.3%
Claude 3.7 Sonnet1001008275071.5%
ByteDance Seed 2.0 Lite10010010057071.4%
Writer: Palmyra X510010010057071.4%
GPT-5100100100331068.6%
Grok 4.20 (Beta)10010010033066.7%
Nemotron 3 Nano10010010033066.7%
GPT-5.11001006764066.1%
GPT-5.4 (Reasoning)100957546063.2%
GPT-5.4 Mini (Reasoning)1001001000060.0%
MiniMax M2.51001001000060.0%
Claude Sonnet 4.51001001000060.0%
ByteDance Seed 2.0 Mini1001001000060.0%
Gemini 2.5 Flash (Reasoning)1001001000060.0%
Claude Haiku 4.51001001000060.0%
DeepSeek-V2 Chat1001001000060.0%
GPT-4.1 Mini1001001000060.0%
GPT-5 Nano1001001000060.0%
Mistral Large1001001000060.0%
Llama 3.1 8B1001001000060.0%
Mistral Medium 3.1100100950058.9%
o4 Mini High1001005733058.1%
GPT-5.4 Nano (Reasoning)100100750055.0%
GPT-5.4 Nano (Reasoning, Low)100100750055.0%
Z.AI GLM 4.7 Flash100100570051.4%
GPT-5.4 (Reasoning, Low)10057460040.7%
Mistral Large 310010000040.0%
GPT-4o, Aug. 6th (temp=1)10010000040.0%
Mistral Large 210010000040.0%
Ministral 3 14B10010000040.0%
Grok 4.20 (Beta, Reasoning)100000020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)1001001001009799.4%
GPT-5.4 Mini (Reasoning)1001001001009699.2%
GPT-5.41001001001009498.8%
GPT-4o, May 13th (temp=0)1001001001009198.2%
Writer: Palmyra X51001001001009198.2%
DeepSeek V3 (2025-03-24)1001001001008997.8%
Llama 3.1 Nemotron 70B1001001001008997.8%
GPT-4.11001001001008797.4%
Qwen 3.5 Flash1001001001008697.1%
Claude Opus 4.6 (Reasoning)1001001001008496.8%
Gemini 3 Flash (Preview)100100100958595.9%
Qwen 3.5 35B1001001001007995.8%
Qwen 3.5 27B1001001001007595.0%
GPT-51001001001007394.6%
GPT-4o Mini (temp=0)1001001001007294.5%
GPT-5.4 (Reasoning)1001001001007194.2%
Mistral Small 4 (Reasoning)100100100878293.9%
Z.AI GLM 4.61001001001007093.9%
Gemini 2.5 Flash Lite1001001001007093.9%
Mistral Medium 3.1100100100898093.8%
Claude Sonnet 4.61001001001006993.8%
Gemini 3.1 Flash Lite (Preview)1001001001006793.3%
Ministral 3 8B1001001001006292.4%
Ministral 8B1001001001005791.4%
Claude Sonnet 4.510010087858290.8%
Claude 3 Haiku10010095827590.4%
GPT-5.4 Nano10010090897290.3%
Claude Opus 4.6100100100895789.2%
Gemini 2.5 Flash1001001001004288.4%
o4 Mini High100100100806288.4%
Qwen 3.5 Plus (2026-02-15)100100100795486.5%
Z.AI GLM 5 Turbo1001001001003186.2%
Z.AI GLM 4.5100100100874486.2%
GPT-5.4 Mini1001001001003186.1%
Grok 4.1 Fast1009789855785.6%
Z.AI GLM 4.7100100100893685.0%
Z.AI GLM 4.7 Flash10010095715784.6%
WizardLM 2 8x22b100100100754684.2%
Llama 3.1 8B1001001001001883.6%
Mistral Large 310010089785083.3%
Gemini 3 Flash (Preview, Reasoning)100100100753882.6%
Nemotron 3 Super10010080755281.4%
Nemotron 3 Nano10010095862280.5%
Gemini 2.5 Pro100100100713080.2%
Qwen 3.5 397B A17B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Qwen 2.5 72B100100100100080.0%
Ministral 3B100100100100080.0%
GPT-5.2100100100722779.7%
MiniMax M2.510010010095078.9%
Qwen3 235B A22B Instruct 250710010089752978.5%
o4 Mini100100100622877.9%
Ministral 3 3B100100100523377.0%
MoonshotAI: Kimi K2.510010010082076.5%
Qwen 3.5 122B10010080544676.0%
DeepSeek V3 (2024-12-26)10010010077075.4%
Stealth: Healer Alpha10010082801074.4%
GPT-5.110010080711473.2%
Ministral 3 14B1009078553772.0%
GPT-4o, Aug. 6th (temp=0)1008957555070.1%
DeepSeek V3.210010080511869.8%
MiniMax M2.710010010048069.7%
GPT-5.4 Nano (Reasoning)100100100272169.6%
Mistral Large 210010010046069.2%
Grok 4.20 (Beta, Reasoning)1001009550068.9%
GPT-5.4 Nano (Reasoning, Low)1009171383366.6%
Mistral Small Creative1001007260066.5%
Claude Sonnet 410010071401565.2%
Rocinante 12B10010010024064.7%
Hermes 3 70B1001009529064.7%
Z.AI GLM 510010075331464.5%
Claude Opus 4928275393364.3%
Arcee AI: Trinity Large (Preview)1001007049063.8%
Gemini 2.5 Flash (Reasoning)1001006850063.7%
Grok 4 Fast100857054061.6%
Gemini 3.1 Pro (Preview)1001001000060.0%
GPT-5 Nano100936739059.6%
Qwen 3 32B100100970059.5%
Stealth: Hunter Alpha1001005442059.2%
Hermes 3 405B100100950058.9%
Mistral Small 410010048311458.7%
Grok 4100100820056.5%
GPT-5 Mini100100750055.0%
Aion 2.0716765403154.7%
GPT-4o Mini (temp=1)100895725054.1%
Gemma 3 12B1001004618052.9%
Claude Haiku 4.51001005014052.8%
Gemma 3 27B897843312252.5%
Gemini 2.5 Flash Lite (Reasoning)1001003318050.3%
Claude 3.5 Sonnet856746049.4%
GPT-4o, Aug. 6th (temp=1)89676018046.8%
GPT-4o, May 13th (temp=1)100100246046.0%
Gemini 3 Pro (Preview)100573321042.4%
Grok 4.20 (Beta)9567490042.1%
Arcee AI: Trinity Mini10010080041.5%
Claude Opus 4.510062400040.4%
Claude 3.5 Haiku10010000040.0%
Gemma 3 4B10064280038.2%
DeepSeek-V2 Chat1005780033.0%
LFM2 24B10029140028.5%
DeepSeek V3.1781800019.1%
Claude 3.7 Sonnet3728263018.8%
GPT-4.1 Mini3300006.7%
GPT-4.1 Nano600001.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.4 Nano1001001001009799.5%
Grok 4.1 Fast1001001001009598.9%
Gemini 2.5 Flash1001001001009598.9%
GPT-4o Mini (temp=0)1001001001009598.9%
GPT-5.41001001001009498.7%
Qwen 3.5 Plus (2026-02-15)1001001001009398.6%
GPT-51001001001008997.8%
GPT-4o, Aug. 6th (temp=1)1001001001008997.8%
Gemini 3 Pro (Preview)1001001001008797.5%
Gemma 3 27B1001001001008697.1%
Ministral 8B1001001001008296.5%
o4 Mini High1001001001007995.8%
MiniMax M2.5100100100898795.1%
Mistral Large1001001001007595.0%
Gemini 2.5 Pro100100100957994.7%
Claude Sonnet 4.6100100100987594.7%
Qwen 3.5 Flash1001001001007093.9%
Z.AI GLM 4.7100100100957393.6%
Z.AI GLM 5 Turbo1001001001006793.3%
Llama 3.1 Nemotron 70B1001001001006793.3%
Mistral Small Creative1001001001006793.3%
Ministral 3B1001001001006793.3%
MoonshotAI: Kimi K2.5100100100956792.3%
Z.AI GLM 5100100100877191.6%
DeepSeek V3 (2024-12-26)100100100827591.5%
Claude Sonnet 4.6 (Reasoning)1001001001005791.4%
ByteDance Seed 2.0 Lite1001001001005791.4%
Mistral Large 2100100100866790.5%
DeepSeek-V2 Chat1001001001004689.2%
Gemma 3 12B1001001001004689.2%
Arcee AI: Trinity Large (Preview)1001001001004689.2%
GPT-5.4 Nano (Reasoning)10010095726887.0%
Qwen 3.5 27B1001001001003386.7%
WizardLM 2 8x22b1001001001003386.7%
Qwen3 235B A22B Instruct 2507100100100825086.5%
GPT-5 Mini100100100795286.1%
Claude Sonnet 4.51001001001002985.7%
GPT-5.1100100100735485.3%
Gemini 2.5 Flash Lite100100100923385.0%
Claude Opus 4.5100100100823783.8%
Claude Opus 41001001001001883.6%
Mistral Small 4 (Reasoning)100100100922282.8%
Gemini 2.5 Flash Lite (Reasoning)100100100645082.7%
GPT-4o, May 13th (temp=1)100100100714282.6%
Qwen 3.5 9B1001001001001081.9%
Aion 2.01001001001001081.9%
Claude Opus 4.6100100100792981.5%
MiniMax M2.710010010095880.5%
Z.AI GLM 4.510010086704680.3%
Claude Sonnet 4100100100100080.0%
Mistral Large 3100100100100080.0%
Qwen 3 32B100100100100080.0%
DeepSeek V3 (2025-03-24)100100100100080.0%
Llama 3.1 70B100100100100080.0%
Hermes 3 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Rocinante 12B100100100573378.1%
Gemini 2.5 Flash (Reasoning)1001009595077.9%
GPT-4o Mini (temp=1)1001009382676.3%
Gemini 3.1 Pro (Preview)10010092573075.8%
Writer: Palmyra X510010010079075.8%
DeepSeek V3.210010079781373.8%
GPT-4.11001008975072.8%
Claude Opus 4.6 (Reasoning)10010082641071.1%
GPT-5 Nano10010057461363.2%
Grok 4.20 (Beta, Reasoning)1001001006462.1%
Grok 4.20 (Beta)100867152061.7%
Claude 3.5 Haiku1001001000060.0%
Claude 3 Haiku1001005733058.1%
Claude 3.7 Sonnet100905729055.2%
Hermes 3 405B100575046050.7%
Claude 3.5 Sonnet10062570043.8%
Claude Haiku 4.59767240037.5%
GPT-4.1 Mini1005700031.4%
GPT-4.1 Nano821800020.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Qwen 3.5 397B A17B1001001001009799.5%
Qwen 3.5 122B1001001001009799.5%
MiniMax M2.71001001001009799.5%
Qwen 2.5 72B1001001001009598.9%
Grok 4 Fast1001001001008997.8%
Ministral 3 14B1001001001008797.4%
Qwen 3.5 35B1001001001008697.1%
Nemotron 3 Nano1001001001008697.1%
Claude Sonnet 4.6 (Reasoning)1001001001008296.5%
Z.AI GLM 4.61001001001008296.5%
Mistral Small 41001001001008096.0%
Ministral 3 8B1001001001008095.9%
GPT-5.21001001001007995.8%
Gemini 3 Flash (Preview, Reasoning)1001001001007595.0%
Gemini 3.1 Flash Lite (Preview)1001001001007595.0%
Mistral Large 21001001001007595.0%
Grok 4100100100898594.7%
Qwen 3.5 Flash1001001001007294.5%
GPT-5.4 Nano1001001001007294.5%
Ministral 3B1001001001007194.2%
MiniMax M2.5100100100908094.1%
Grok 4.20 (Beta, Reasoning)100100100897993.5%
Claude 3.5 Sonnet100100100927593.4%
Qwen3 235B A22B Instruct 25071001001001006793.3%
Gemma 3 12B1001001001006793.3%
Gemini 3.1 Pro (Preview)10010095898293.2%
Z.AI GLM 5 Turbo100100100828292.9%
LFM2 24B100100100828292.9%
Ministral 8B100100100986792.9%
o4 Mini High1001001001006492.7%
Writer: Palmyra X5100100100986492.5%
Cohere Command R+ (Aug. 2024)1001001001006292.4%
GPT-5.4 Nano (Reasoning, Low)100100100847491.5%
GPT-4o Mini (temp=0)100100100976091.2%
Z.AI GLM 4.7100100100936090.5%
GPT-4.110010095827590.4%
Claude Opus 4.6100100100757590.0%
Gemma 3 4B1001001001005090.0%
o4 Mini10010087827989.6%
GPT-510010098757389.2%
Rocinante 12B100100100895789.2%
Nemotron 3 Super1001001001004388.6%
ByteDance Seed 1.61001001001004288.4%
WizardLM 2 8x22b1001001001004288.4%
DeepSeek V3 (2025-03-24)10010095757188.1%
Claude Opus 4.6 (Reasoning)100100100954487.7%
GPT-5.4 Nano (Reasoning)100100100933786.1%
DeepSeek V3.2100100100785085.5%
Gemini 2.5 Flash100100100893384.4%
GPT-4o, Aug. 6th (temp=1)10010095893383.4%
Gemini 2.5 Pro1001001001001382.5%
Qwen 3.5 9B100100100951481.7%
Grok 4.20 (Beta)989393814481.7%
Stealth: Healer Alpha100100100852281.4%
Stealth: Hunter Alpha100100100545080.7%
Z.AI GLM 4.51001009898680.4%
GPT-5.11009182775080.1%
Claude Opus 4.5100100100100080.0%
Gemini 3 Pro (Preview)100100100100080.0%
GPT-4o, May 13th (temp=1)100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Mistral NeMO100100100100080.0%
Gemma 3 27B10010091604979.9%
Claude 3 Haiku10010010097079.5%
DeepSeek V3.110010010092078.4%
Mistral Small 4 (Reasoning)1009678675178.2%
Llama 3.1 Nemotron 70B10010010075075.0%
ByteDance Seed 2.0 Lite10010010067073.3%
Claude 3.7 Sonnet10010010046670.3%
GPT-4o, May 13th (temp=0)10010082421868.5%
Hermes 3 70B10010010040068.0%
Arcee AI: Trinity Mini1007575671867.0%
Qwen 3.5 Plus (2026-02-15)1009264571866.2%
GPT-4o, Aug. 6th (temp=0)10010048423364.8%
Z.AI GLM 51008273461563.3%
Claude Sonnet 4.61001005752061.8%
Llama 3.1 8B1001006244061.3%
Mistral Small 3.2 24B1001001000060.0%
GPT-5 Mini10010037362158.8%
Claude Sonnet 4.5100100884058.3%
Claude Sonnet 41001005725056.3%
Z.AI GLM 4.7 Flash100897014054.5%
MoonshotAI: Kimi K2.5100100670053.3%
DeepSeek-V2 Chat100725041052.6%
Gemini 2.5 Flash Lite100855018050.6%
GPT-4o Mini (temp=1)100574640048.7%
Llama 3.1 70B10082180040.1%
DeepSeek V3 (2024-12-26)10010000040.0%
Hermes 3 405B10054420039.2%
GPT-4.1 Mini10057180035.1%
Claude Haiku 4.59550300034.9%
GPT-5 Nano7948330032.0%
GPT-4.1 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Z.AI GLM 51001001001009598.9%
Cohere Command R+ (Aug. 2024)1001001001009598.9%
Qwen 3.5 27B1001001001008997.8%
ByteDance Seed 1.61001001001008997.8%
Gemini 3.1 Flash Lite (Preview)1001001001008997.8%
DeepSeek V3.21001001001008997.8%
Claude 3.7 Sonnet1001001001008596.9%
Grok 4.1 Fast1001001001008296.5%
MiniMax M2.51001001001008296.5%
Gemma 3 12B1001001001008296.5%
LFM2 24B1001001001008296.5%
MiniMax M2.71001001001007995.8%
Gemini 2.5 Pro1001001001007595.0%
o4 Mini1001001001007595.0%
ByteDance Seed 2.0 Mini1001001001007595.0%
GPT-4o, May 13th (temp=1)1001001001007595.0%
GPT-4o Mini (temp=1)1001001001007595.0%
Mistral Small 41001001001007194.2%
Qwen 3 32B100100100957593.9%
Claude Opus 4.6 (Reasoning)1001001001005791.4%
MoonshotAI: Kimi K2.51001001001005791.4%
Claude 3 Haiku1001001001005791.4%
Qwen 3.5 122B1001001001004689.2%
Claude Sonnet 41001001001004689.2%
Mistral Large 21001001001004689.2%
Rocinante 12B1001001001004689.2%
Z.AI GLM 4.5100100100855788.4%
GPT-4.1100100100756788.3%
GPT-5.4 (Reasoning)1001001001004088.0%
GPT-4o Mini (temp=0)1001001001003386.7%
Hermes 3 70B1001001001003386.7%
Writer: Palmyra X5100100100924086.4%
Z.AI GLM 4.710010089755784.2%
GPT-5.4 Nano100100100575782.9%
GPT-4o, May 13th (temp=0)100100100821880.1%
Z.AI GLM 5 Turbo100100100100080.0%
Claude Sonnet 4.6 (Reasoning)100100100100080.0%
Grok 4.20 (Beta, Reasoning)100100100100080.0%
Grok 4100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
GPT-5.4 Mini (Reasoning, Low)100100100100080.0%
Mistral Large 3100100100100080.0%
Claude Haiku 4.5100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Qwen3 235B A22B Instruct 2507100100100100080.0%
Inception Mercury100100100100080.0%
GPT-5.4 Nano (Reasoning, Low)100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100673380.0%
Mistral NeMO100100100100080.0%
GPT-5.4 Mini (Reasoning)10010010095078.9%
GPT-5.210010010082076.5%
Qwen 3.5 9B10010010082076.5%
DeepSeek V3 (2024-12-26)100100100463375.9%
Ministral 3 8B10010010075075.0%
GPT-5100939182073.2%
o4 Mini High10010010057071.4%
Claude 3.5 Sonnet10010010057071.4%
DeepSeek V3.110010010057071.4%
Arcee AI: Trinity Mini10010010057071.4%
GPT-5.11008570673370.8%
Hermes 3 405B10010010046069.2%
GPT-4o, Aug. 6th (temp=1)10010010033066.7%
Claude Opus 4.510010054462464.7%
GPT-5.4 (Reasoning, Low)1001008240064.5%
GPT-4o, Aug. 6th (temp=0)10010010018063.6%
ByteDance Seed 2.0 Lite1001001000060.0%
GPT-4.1 Mini1001001000060.0%
Grok 4.20 (Beta)100676757058.1%
GPT-4.1 Nano100100570051.4%
GPT-5 Mini100100180043.6%
DeepSeek V3 (2025-03-24)10010000040.0%
Claude Opus 47957180030.8%
GPT-5 Nano100000020.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-51001001001009799.3%
Gemini 2.5 Flash1001001001009598.9%
Z.AI GLM 4.61001001001008997.8%
Qwen3 235B A22B Instruct 25071001001001008697.1%
MiniMax M2.51001001001008296.5%
Gemma 3 4B1001001001008296.5%
Z.AI GLM 51001001001008296.5%
Grok 4.20 (Beta)100100100858593.8%
Claude Sonnet 4.6 (Reasoning)1001001001005290.4%
GPT-4o, May 13th (temp=1)1001001001005290.4%
DeepSeek V3.11001001001005290.4%
DeepSeek V3 (2024-12-26)1001001001004689.2%
WizardLM 2 8x22b1001001001004689.2%
Rocinante 12B1001001001004288.4%
Arcee AI: Trinity Mini100100100895288.1%
DeepSeek V3 (2025-03-24)1001001001003386.7%
GPT-4o, Aug. 6th (temp=1)10010082826786.3%
Gemini 2.5 Flash Lite (Reasoning)100100100893384.4%
Claude Opus 41001001001001883.6%
GPT-4.1 Mini1001001001001382.5%
Mistral Small 4100100100823082.4%
GPT-4o Mini (temp=1)100100100802480.7%
ByteDance Seed 1.6100100100100080.0%
Hermes 3 405B10010010095078.9%
DeepSeek-V2 Chat10010010092078.4%
Claude 3.7 Sonnet10010097524077.7%
Llama 3.1 Nemotron 70B100100100333373.3%
Claude Haiku 4.51001008967071.1%
GPT-4.1 Nano10089890055.6%
Claude 3.5 Sonnet100100750055.0%
Claude Sonnet 4.6100100180043.6%
Arcee AI: Trinity Large (Preview)100100130042.5%
GPT-5 Nano10062290038.3%
Hermes 3 70B1002600025.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Arcee AI: Trinity Large (Preview)1001001001009598.9%
GPT-4o, May 13th (temp=1)1001001001008997.8%
Mistral Small 41001001001008997.8%
Gemma 3 4B1001001001008997.8%
Nemotron 3 Super1001001001007595.0%
DeepSeek V3 (2024-12-26)1001001001007595.0%
GPT-4.1 Mini1001001001007595.0%
GPT-5.21001001001005791.4%
MiniMax M2.51001001001005791.4%
Claude Opus 41001001001005791.4%
Gemma 3 27B1001001001004689.2%
Qwen 3.5 122B1001001001003386.7%
Claude Sonnet 4.51001001001003386.7%
Mistral Small 4 (Reasoning)1001001001003386.7%
Nemotron 3 Nano1001001001003386.7%
Claude 3.7 Sonnet100100100755786.4%
Gemini 3 Flash (Preview, Reasoning)1001001001001883.6%
GPT-4o Mini (temp=1)100100100891881.4%
Z.AI GLM 5100100100100080.0%
Claude Sonnet 4.6100100100100080.0%
o4 Mini High100100100100080.0%
Gemini 3 Pro (Preview)100100100100080.0%
Claude Sonnet 4100100100100080.0%
Z.AI GLM 4.7100100100100080.0%
GPT-4.1100100100100080.0%
Stealth: Hunter Alpha100100100100080.0%
ByteDance Seed 2.0 Mini100100100100080.0%
Grok 4 Fast100100100100080.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
Claude 3.5 Sonnet100100100100080.0%
Hermes 3 405B100100100100080.0%
GPT-4o, Aug. 6th (temp=0)100100100100080.0%
Gemini 2.5 Flash Lite100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Writer: Palmyra X5100100100100080.0%
Llama 3.1 70B100100100100080.0%
GPT-4.1 Nano100100100100080.0%
Ministral 3 8B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
GPT-4o, Aug. 6th (temp=1)10010010075075.0%
Grok 410010089571872.8%
Mistral Large 210010010057071.4%
Llama 3.1 Nemotron 70B10010010057071.4%
Cohere Command R+ (Aug. 2024)10010010057071.4%
Mistral Large1001001000060.0%
Mistral Small Creative1001001000060.0%
Ministral 3 14B1001001000060.0%
Hermes 3 70B1001004633055.9%
Mistral Large 310010000040.0%
GPT-5 Nano100000020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Stealth: Hunter Alpha1001001001009899.7%
GPT-5 Mini1001001001009799.4%
DeepSeek V3.21001001001009598.9%
DeepSeek-V2 Chat100100100989598.6%
Grok 4 Fast1001001001009398.6%
Claude Sonnet 4.61001001001009198.2%
Qwen3 235B A22B Instruct 25071001001001008697.1%
Qwen 3 32B1001001001008296.5%
GPT-5 Nano1001001001007595.0%
Mistral Large 21001001001007595.0%
Stealth: Healer Alpha1001001001007093.9%
Mistral Large1001001001006492.9%
Gemma 3 4B10010089868291.4%
Z.AI GLM 5100100100777390.0%
Gemini 2.5 Pro100100100975289.9%
Z.AI GLM 4.51001001001004989.8%
Mistral Small 4 (Reasoning)100100100935288.9%
GPT-4o Mini (temp=1)10010097934887.8%
GPT-4o, May 13th (temp=0)1001001001003887.6%
Claude 3.5 Sonnet100100100893384.4%
Claude Sonnet 4100100100645784.3%
Cohere Command R+ (Aug. 2024)1001001001001883.6%
Claude Sonnet 4.6 (Reasoning)1001001001001482.8%
Z.AI GLM 4.7 Flash1009792754682.1%
Claude Opus 4100100100971382.0%
Llama 3.1 8B100100100891380.3%
Grok 4.20 (Beta)100100100100080.0%
Gemini 2.5 Flash Lite10010092575079.8%
DeepSeek V3.1100100100722679.7%
Llama 3.1 70B10010010089077.8%
Ministral 3B100100100622677.6%
LFM2 24B100100100671877.0%
GPT-4o, Aug. 6th (temp=1)100100100423375.1%
Mistral Small 3.2 24B10010010067073.3%
Gemini 2.5 Flash Lite (Reasoning)10010097501872.9%
MiniMax M2.710010067554172.6%
Ministral 3 8B1001009271072.6%
Hermes 3 405B10010010062072.4%
GPT-4o, May 13th (temp=1)10010010060072.1%
Mistral NeMO100100100332671.9%
DeepSeek V3 (2025-03-24)10010010040068.0%
Gemini 2.5 Flash (Reasoning)1001008552067.3%
Claude 3.5 Haiku10010010033066.7%
Rocinante 12B10010046464066.5%
Claude 3.7 Sonnet1001007454065.6%
Claude 3 Haiku100100100181065.5%
Gemini 2.5 Flash95756764060.0%
Hermes 3 70B100100970059.5%
MiniMax M2.593676362858.5%
DeepSeek V3 (2024-12-26)100916222055.0%
Arcee AI: Trinity Mini100624233047.5%
GPT-4o Mini (temp=0)1008018181045.1%
Claude Haiku 4.510064430041.4%
Gemma 3 12B100432121037.2%
GPT-4.1 Mini8254435036.8%
Arcee AI: Trinity Large (Preview)1005000030.0%
GPT-4.1 Nano1004200028.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Qwen 3.5 122B1001001001009999.7%
LFM2 24B1001001001009799.5%
GPT-5 Mini1001001001009598.9%
GPT-4.1 Mini100100100979598.4%
Claude Opus 4.6 (Reasoning)1001001001008997.8%
Grok 4.20 (Beta)1001001001008997.8%
Mistral Small 41001001001008997.8%
Gemini 3 Pro (Preview)1001001001008797.4%
Gemini 2.5 Flash Lite1001001001008296.5%
Arcee AI: Trinity Mini1001001001008296.5%
Gemini 2.5 Flash1001001001008096.0%
Claude Opus 4.51001001001007595.0%
DeepSeek-V2 Chat100100100977594.5%
Gemma 3 4B100100100977294.0%
Claude Sonnet 4.6 (Reasoning)1001001001006793.3%
MiniMax M2.51001001001006793.3%
Gemini 2.5 Pro1001001001006292.4%
WizardLM 2 8x22b100100100877191.6%
Gemini 2.5 Flash Lite (Reasoning)1001001001005791.4%
GPT-4o, Aug. 6th (temp=0)1001001001005791.4%
Mistral Large1001001001005290.4%
Z.AI GLM 4.6100100100866289.6%
GPT-4o, May 13th (temp=1)1001001001004689.2%
Gemini 3.1 Pro (Preview)100100100796789.1%
Gemini 2.5 Flash (Reasoning)1001001001004088.0%
DeepSeek V3 (2025-03-24)1001001001004088.0%
Claude 3.5 Sonnet1001001001003386.7%
Claude 3 Haiku1001001001003386.7%
Claude Sonnet 4.51001001001002184.3%
GPT-4.1 Nano10010089892981.3%
ByteDance Seed 1.6100100100100080.0%
MiniMax M2.7100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Hermes 3 405B100100100100080.0%
Rocinante 12B100100100100080.0%
Ministral 3 8B10010095673378.9%
GPT-5 Nano10010075655278.3%
Gemma 3 27B1001009542568.3%
Ministral 3B1001008218060.1%
Claude Haiku 4.51001006726058.6%
Hermes 3 70B10010000040.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 3.5 35B1001001001009799.4%
Claude Sonnet 4.61001001001009598.9%
MoonshotAI: Kimi K2.51001001001009598.9%
Mistral Small 41001001001009398.6%
Grok 4 Fast1001001001009298.4%
Mistral Large 21001001001009298.4%
DeepSeek V3.11001001001009298.4%
Claude 3.5 Sonnet1001001001008997.8%
Claude 3.7 Sonnet1001001001008997.8%
Writer: Palmyra X51001001001008997.8%
Gemma 3 4B1001001001008997.8%
Ministral 8B1001001001008697.1%
Claude Opus 41001001001008597.0%
GPT-4o Mini (temp=1)100100100978596.2%
Claude Haiku 4.51001001001008096.0%
Grok 4.20 (Beta)1001001001007895.6%
o4 Mini High1001001001007595.0%
GPT-5 Nano1001001001007394.6%
DeepSeek-V2 Chat1001001001007094.0%
o4 Mini1001001001005490.7%
Z.AI GLM 4.7 Flash100100100817190.3%
Mistral Small 4 (Reasoning)100100100934988.5%
Ministral 3 3B1001001001004088.0%
Z.AI GLM 4.51001001001003186.1%
Gemini 2.5 Flash1001001001003086.0%
GPT-5 Mini10010090706885.7%
Cohere Command R+ (Aug. 2024)100100100755285.4%
MiniMax M2.710010082656081.4%
Gemma 3 27B100100100871881.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
MiniMax M2.5100100100573979.2%
Z.AI GLM 5100100100573678.6%
Gemini 2.5 Flash (Reasoning)100100100711877.8%
GPT-4o, Aug. 6th (temp=1)100100100464077.2%
GPT-4.1 Mini100100100522976.1%
Gemini 2.5 Flash Lite (Reasoning)10010010075075.0%
Claude Opus 4.5100100100382572.6%
Hermes 3 405B10010010052070.4%
Gemini 2.5 Flash Lite10010075571068.3%
Qwen 3 32B10010010033066.7%
Claude Sonnet 410010010018063.6%
Rocinante 12B10010010018063.6%
Arcee AI: Trinity Large (Preview)10010010013062.5%
GPT-4o, May 13th (temp=1)100100950058.9%
Gemma 3 12B1001004333055.3%
GPT-4.1 Nano10079710050.0%
Claude 3 Haiku1001003314049.5%
Arcee AI: Trinity Mini10089400045.8%
Hermes 3 70B10057460040.7%
Llama 3.1 8B1008900037.8%
Llama 3.1 70B75571818033.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
MiniMax M2.71001001001009799.5%
Claude Opus 4.61001001001009598.9%
DeepSeek V3 (2024-12-26)1001001001009598.9%
Z.AI GLM 4.51001001001009298.4%
Grok 4.20 (Beta)1001001001008997.8%
o4 Mini1001001001008296.5%
MiniMax M2.51001001001007995.8%
GPT-4.11001001001007595.0%
ByteDance Seed 2.0 Mini1001001001007595.0%
Claude Haiku 4.51001001001007595.0%
Mistral NeMO1001001001007595.0%
LFM2 24B1001001001007595.0%
Gemini 3 Pro (Preview)1001001001006793.3%
WizardLM 2 8x22b100100100897091.7%
Qwen3 235B A22B Instruct 2507100100100896791.1%
DeepSeek-V2 Chat1001001001004689.2%
Grok 4 Fast1001001001003386.7%
MoonshotAI: Kimi K2.51001001001003386.7%
GPT-4o, May 13th (temp=1)1001001001001883.6%
Gemma 3 27B100100100951080.9%
Z.AI GLM 4.7100100100574680.7%
Llama 3.1 Nemotron 70B10010089575780.6%
Claude Opus 4.5100100100100080.0%
Aion 2.0100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Claude Sonnet 4100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
Claude 3.7 Sonnet100100100100080.0%
DeepSeek V3.1100100100100080.0%
DeepSeek V3.2100100100100080.0%
Gemma 3 12B100100100100080.0%
Mistral Medium 3.1100100100100080.0%
Mistral Small 4100100100100080.0%
Mistral Small Creative100100100100080.0%
Z.AI GLM 510010010089077.8%
Claude Sonnet 4.6 (Reasoning)100100100671877.0%
GPT-4.1 Mini10010010067073.3%
GPT-4o Mini (temp=1)10010010067073.3%
GPT-5 Nano10010079503372.4%
Hermes 3 70B100100100381871.2%
DeepSeek V3 (2025-03-24)10010010033066.7%
GPT-4.1 Nano1001007557066.4%
Hermes 3 405B1001001000060.0%