Cliché density

Test: Bad Writing Habits

Avg. Score
93.2%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.3100.0%$0.006930.5s100%
2Gemini 3.1 Flash Lite98.5%$0.003012.1s86%
3Gemini 3.1 Flash Lite (Preview)98.9%$0.00308.4s84%
4Gemma 4 26B99.6%$0.000955.1s93%
5Xiaomi MIMO v2.599.3%$0.005431.8s90%
6Gemini 3.1 Flash Lite (Reasoning)98.9%$0.003011.9s84%
7Gemini 3 Flash (Preview)98.9%$0.007819.6s88%
8Z.AI GLM 5 Turbo99.3%$0.008133.2s90%
9ByteDance Seed 1.6 Flash98.5%$0.001327.3s86%
10Gemini 3 Flash (Preview, Reasoning)99.3%$0.01230.1s90%
11Claude Sonnet 4.5100.0%$0.03538.1s100%
12Gemini 3.5 Flash (Reasoning, Minimal)98.9%$0.01812.0s88%
13Ministral 3 14B97.4%$0.000711.7s82%
14Claude Haiku 4.598.9%$0.01121.6s84%
15Grok 4.2098.9%$0.009345.7s88%
16GPT-5 Mini99.3%$0.010057.4s90%
17Mistral Large 298.5%$0.01329.4s86%
18Mistral Small 497.0%$0.001418.2s81%
19DeepSeek V3 (2025-03-24)98.5%$0.001439.4s83%
20Qwen 3 32B98.5%$0.001554.6s86%
21DeepSeek V4 Flash (Reasoning)97.4%$0.000731.1s82%
22Claude Sonnet 4.699.6%$0.03139.3s93%
23Writer: Palmyra X597.8%$0.01122.0s83%
24GPT-4.1 Mini96.3%$0.002719.0s79%
25Qwen 3.6 Flash98.5%$0.01041.4s83%
26DeepSeek V4 Flash98.1%$0.000631.6s77%
27GPT-4o, Aug. 6th (temp=1)97.8%$0.01824.4s83%
28Gemma 4 31B99.3%$0.00101.6m90%
29LFM2 24B96.3%$0.000228.4s79%
30Qwen3 235B A22B Instruct 250797.8%$0.001159.2s83%
31Arcee AI: Trinity Mini95.6%$0.00039.2s75%
32Claude Sonnet 499.3%$0.03243.7s90%
33Stealth: Healer Alpha96.3%$0.000023.7s77%
34MiniMax M2.598.5%$0.00341.3m86%
35Grok 4.20 (Beta)97.0%$0.01815.8s81%
36MiniMax M2.798.1%$0.00401.1m85%
37DeepSeek V4 Pro98.5%$0.00481.3m86%
38Claude 3.5 Sonnet99.6%$0.04835.5s93%
39Z.AI GLM 598.5%$0.00841.2m86%
40Mistral Large 396.7%$0.003330.3s78%
41Claude Opus 4.7100.0%$0.06930.4s100%
42GPT-5.4 Mini96.3%$0.01516.8s79%
43Gemma 4 31B (Reasoning)99.6%$0.00142.2m93%
44GPT-4.197.8%$0.01844.7s83%
45Gemma 4 26B (Reasoning)99.3%$0.00132.0m90%
46Qwen 3.5 9B97.8%$0.00111.4m83%
47Grok 4.1 Fast95.9%$0.001837.8s76%
48Grok 4.3 (Reasoning)100.0%$0.0212.3m100%
49Mistral Small 4 (Reasoning)96.3%$0.002230.2s73%
50GPT-5.4 Nano (Reasoning)94.8%$0.006124.5s76%
51Qwen 3.5 Flash96.3%$0.002547.5s77%
52GPT-5.4 Mini (Reasoning)96.7%$0.02228.1s80%
53Claude Sonnet 4.6 (Reasoning)100.0%$0.0601.2m100%
54Grok 4.20 (Beta, Reasoning)98.1%$0.03934.0s85%
55Mistral Medium 3.197.0%$0.004836.5s73%
56Xiaomi MIMO v2.5 Pro97.4%$0.008553.5s77%
57Grok 4 Fast94.8%$0.001724.1s72%
58Mistral Large95.9%$0.01430.9s76%
59Qwen 3.6 35B97.0%$0.00831.0m79%
60Gemini 2.5 Flash Lite (Reasoning)94.4%$0.002830.8s73%
61Z.AI GLM 5.198.1%$0.0141.5m85%
62Aion 2.096.7%$0.00641.3m80%
63Grok 4.20 (Reasoning)98.1%$0.0181.5m85%
64Z.AI GLM 4.5 Air95.6%$0.002958.2s75%
65Claude Opus 4.7 (Reasoning)99.6%$0.07632.0s93%
66Z.AI GLM 4.7 Flash96.7%$0.00171.2m76%
67Qwen 3.5 Plus (2026-02-15)94.1%$0.006031.5s73%
68Z.AI GLM 4.594.1%$0.005142.1s75%
69GPT-5.4 Mini (Reasoning, Low)94.4%$0.01516.8s71%
70Rocinante 12B95.6%$0.001438.4s68%
71Gemma 3 4B93.3%$0.000220.0s65%
72DeepSeek V3.196.7%$0.00201.8m80%
73ByteDance Seed 2.0 Lite98.5%$0.0122.2m86%
74Qwen 3.5 35B96.7%$0.0181.0m76%
75Z.AI GLM 4.796.7%$0.0101.4m78%
76DeepSeek V3.296.7%$0.00141.9m80%
77o4 Mini94.1%$0.01525.7s69%
78Z.AI GLM 4.694.8%$0.006551.5s70%
79Mistral Small Creative92.2%$0.00079.1s61%
80Gemma 3 27B93.7%$0.000652.6s69%
81Gemini 2.5 Pro96.3%$0.03636.2s75%
82GPT-5 Nano95.6%$0.00421.4m73%
83Stealth: Hunter Alpha94.4%$0.000055.0s65%
84GPT-5.498.5%$0.0491.4m86%
85Ministral 8B91.1%$0.000410.4s59%
86o4 Mini High95.2%$0.02547.2s71%
87Qwen 3.5 122B95.2%$0.0251.1m75%
88Gemini 3 Pro (Preview)97.0%$0.05554.4s81%
89GPT-5.4 Nano (Reasoning, Low)90.4%$0.005520.6s64%
90Claude Opus 4.699.3%$0.0781.2m90%
91GPT-5.4 (Reasoning, Low)98.1%$0.0551.4m85%
92Gemini 2.5 Flash Lite91.1%$0.00099.5s57%
93Ministral 3B90.7%$0.00018.1s57%
94Hermes 3 405B92.6%$0.003253.2s64%
95Claude Opus 4.597.8%$0.07053.4s83%
96MoonshotAI: Kimi K2.599.3%$0.0193.2m90%
97Claude Opus 4.6 (Reasoning)99.6%$0.0881.4m93%
98Gemini 3.5 Flash (Reasoning)97.0%$0.07137.6s79%
99Nemotron 3 Super92.2%$0.00001.4m67%
100Gemini 2.5 Flash (Reasoning)90.0%$0.01121.5s60%
101Gemma 3 12B90.7%$0.000441.3s58%
102GPT-4o Mini (temp=1)88.9%$0.001234.8s59%
103GPT-5.4 Nano87.0%$0.005726.3s61%
104Grok 496.3%$0.0481.7m79%
105Ministral 3 8B90.7%$0.000819.6s49%
106WizardLM 2 8x22b93.3%$0.00261.8m65%
107ByteDance Seed 1.695.9%$0.0132.5m74%
108GPT-5.196.7%$0.0541.8m80%
109Qwen 3.5 Plus (2026-04-20)93.3%$0.0171.8m70%
110Qwen 3.5 27B93.0%$0.0201.6m68%
111GPT-4.1 Nano86.3%$0.000713.3s51%
112DeepSeek V4 Pro (Reasoning)97.8%$0.0153.1m76%
113Qwen 3.6 27B94.4%$0.0252.3m75%
114GPT-599.3%$0.0652.8m90%
115Claude 3.7 Sonnet91.5%$0.04246.7s63%
116Gemini 3.1 Pro (Preview)99.3%$0.1071.8m90%
117Arcee AI: Trinity Large (Preview)86.7%$0.000043.6s48%
118Gemini 2.5 Flash84.4%$0.005210.6s46%
119GPT-4o, May 13th (temp=1)87.0%$0.03314.4s54%
120Qwen3.7 Max96.7%$0.0682.3m80%
121GPT-5.4 (Reasoning)98.9%$0.0892.6m88%
122Ministral 3 3B85.6%$0.000511.1s40%
123DeepSeek-V2 Chat86.3%$0.002153.3s48%
124Cohere Command R+ (Aug. 2024)86.7%$0.02052.5s55%
125Llama 3.1 Nemotron 70B84.4%$0.003831.7s46%
126Qwen3.6 Max Preview97.8%$0.0503.5m83%
127Hermes 3 70B87.0%$0.00101.2m49%
128DeepSeek V3 (2024-12-26)86.3%$0.002154.6s46%
129Llama 3.1 8B85.9%$0.00031.3m48%
130Qwen 3.5 397B A17B93.0%$0.0143.0m65%
131GPT-5.5 (Reasoning)99.3%$0.1421.8m90%
132GPT-5.5 (Reasoning, Low)98.1%$0.1391.8m85%
133ByteDance Seed 2.0 Mini96.3%$0.00454.9m71%
134Mistral NeMO78.1%$0.000510.1s36%
135GPT-5.288.5%$0.0561.5m59%
136GPT-5.597.0%$0.1391.7m81%
137Nemotron 3 Nano80.4%$0.00101.1m44%
138MoonshotAI: Kimi K2.6100.0%$0.0586.5m100%
139Claude Opus 499.3%$0.2091.4m90%
140GPT-OSS 120B77.4%$0.00151.8m41%
141Llama 3.1 70B70.4%$0.001529.4s24%
142Stealth: Aurora Alpha66.3%$0.00009.8s23%
143Inception Mercury 264.1%$0.00327.0s24%
144Claude 3 Haiku64.4%$0.002514.9s21%
145GPT-4o, Aug. 6th (temp=0)68.1%$0.02322.7s24%
146GPT-4o Mini (temp=0)63.7%$0.001234.8s18%
147Qwen 2.5 72B61.1%$0.001036.7s16%
148Inception Mercury63.0%$0.01117.6s11%
149GPT-4o, May 13th (temp=0)53.0%$0.03514.1s18%
150Mistral Small 3.2 24B52.4%$0.00685.6m9%
93.20%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001006793.3%
GPT-5.5 (Reasoning, Low)1001001001006793.3%
GPT-5.4 (Reasoning, Low)1001001001006793.3%
MoonshotAI: Kimi K2.51001001001006793.3%
o4 Mini High1001001001006793.3%
Qwen 3.6 27B1001001001006793.3%
Grok 4.1 Fast1001001001006793.3%
Aion 2.01001001001006793.3%
Qwen 3.6 35B1001001001006793.3%
DeepSeek V4 Flash (Reasoning)1001001001006793.3%
Grok 41001001001006793.3%
Xiaomi MIMO v2.5 Pro1001001001006793.3%
Qwen 3.5 Flash1001001001006793.3%
Grok 4 Fast1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
DeepSeek V3.11001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
Mistral Large1001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Qwen 2.5 72B1001001001006793.3%
ByteDance Seed 1.6 Flash1001001001006793.3%
Ministral 3 14B1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
LFM2 24B1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)100100100676786.7%
Qwen 3.5 27B1001001001003386.7%
Qwen 3.5 9B100100100676786.7%
Mistral Large 3100100100676786.7%
Grok 4.20 (Beta)100100100676786.7%
DeepSeek V3 (2024-12-26)1001001001003386.7%
Hermes 3 405B1001001001003386.7%
Mistral Small 4 (Reasoning)100100100676786.7%
Llama 3.1 70B1001001001003386.7%
GPT-5.4 Nano100100100676786.7%
Arcee AI: Trinity Large (Preview)1001001001003386.7%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
ByteDance Seed 2.0 Mini100100100100080.0%
GPT-4o, Aug. 6th (temp=0)100100100673380.0%
GPT-4o Mini (temp=0)10010010067073.3%
Claude 3 Haiku1006767676773.3%
GPT-4o, May 13th (temp=1)1001006767066.7%
GPT-4o, May 13th (temp=0)1006767333360.0%
Inception Mercury 21006767333360.0%
Mistral NeMO100676767060.0%
GPT-OSS 120B100100670053.3%
Inception Mercury10067330040.0%
Stealth: Aurora Alpha67333333033.3%
Mistral Small 3.2 24B3300006.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001006793.3%
Grok 4.20 (Reasoning)1001001001006793.3%
Qwen 3.6 27B1001001001006793.3%
Z.AI GLM 4.61001001001006793.3%
MiniMax M2.71001001001006793.3%
Gemma 4 31B1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
GPT-5.4 Mini1001001001006793.3%
Mistral Large 21001001001006793.3%
Qwen 3 32B1001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
Llama 3.1 70B1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Hermes 3 70B1001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Qwen 3.5 Flash1001001001003386.7%
Qwen 3.5 Plus (2026-02-15)100100100676786.7%
GPT-4o, May 13th (temp=1)1001001001003386.7%
GPT-4o Mini (temp=0)100100100676786.7%
Llama 3.1 8B100100100676786.7%
Rocinante 12B1001001001003386.7%
GPT-OSS 120B100100100100080.0%
GPT-4o, May 13th (temp=0)100100100673380.0%
Inception Mercury 2100100100673380.0%
Stealth: Aurora Alpha100100100673380.0%
Inception Mercury10010067673373.3%
Mistral NeMO100100100333373.3%
GPT-4o, Aug. 6th (temp=0)10010010033066.7%
Llama 3.1 Nemotron 70B10010010033066.7%
GPT-4.1 Nano1001006767066.7%
Claude 3 Haiku1001006767066.7%
Qwen 2.5 72B1006767333360.0%
Mistral Small 3.2 24B100100330046.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral NeMO100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Z.AI GLM 5.11001001001006793.3%
Z.AI GLM 5 Turbo1001001001006793.3%
Claude Opus 4.61001001001006793.3%
Qwen 3.5 122B1001001001006793.3%
Gemma 4 26B (Reasoning)1001001001006793.3%
Qwen 3.5 27B1001001001006793.3%
DeepSeek V4 Pro (Reasoning)1001001001006793.3%
Aion 2.01001001001006793.3%
Qwen 3.6 35B1001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
GPT-4.11001001001006793.3%
o4 Mini1001001001006793.3%
Qwen 3.5 Plus (2026-02-15)1001001001006793.3%
Z.AI GLM 4.7 Flash1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
Grok 4.20 (Beta)1001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
GPT-5 Nano1001001001006793.3%
DeepSeek V3.21001001001006793.3%
Qwen 3 32B1001001001006793.3%
Gemini 2.5 Flash Lite1001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Mistral Small 41001001001006793.3%
GPT-5.4 Nano1001001001006793.3%
Ministral 3 14B1001001001006793.3%
Llama 3.1 8B1001001001006793.3%
LFM2 24B1001001001006793.3%
Gemini 3.5 Flash (Reasoning)100100100676786.7%
GPT-5.2100100100676786.7%
Qwen 3.6 27B100100100676786.7%
Gemini 2.5 Flash (Reasoning)100100100676786.7%
Qwen 3.5 Flash100100100676786.7%
Z.AI GLM 4.5100100100676786.7%
Gemini 3.1 Flash Lite100100100676786.7%
DeepSeek-V2 Chat100100100676786.7%
GPT-5.4 Nano (Reasoning)100100100676786.7%
GPT-5.4 Nano (Reasoning, Low)1001001001003386.7%
Gemma 3 27B100100100676786.7%
Llama 3.1 Nemotron 70B1001001001003386.7%
GPT-4.1 Nano100100100676786.7%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
Ministral 8B100100100676786.7%
Arcee AI: Trinity Mini10010067676780.0%
Z.AI GLM 4.610010067676780.0%
GPT-4o, May 13th (temp=0)100100100673380.0%
DeepSeek V3 (2024-12-26)100100100673380.0%
DeepSeek V4 Flash100100100100080.0%
Mistral Small Creative100100100100080.0%
Hermes 3 70B100100100100080.0%
Ministral 3 8B100100100100080.0%
Rocinante 12B100100100100080.0%
GPT-OSS 120B10010067673373.3%
Gemini 2.5 Flash Lite (Reasoning)1006767676773.3%
Arcee AI: Trinity Large (Preview)100100100333373.3%
Ministral 3B100100100333373.3%
Qwen 3.5 397B A17B1001006767066.7%
o4 Mini High10010067333366.7%
GPT-4o, Aug. 6th (temp=0)10010010033066.7%
Nemotron 3 Nano1001006767066.7%
Gemma 3 4B1001006767066.7%
Stealth: Aurora Alpha676767673360.0%
Mistral Small 3.2 24B100100670053.3%
Qwen 2.5 72B100676733053.3%
Ministral 3 3B1001003333053.3%
Llama 3.1 70B100333333040.0%
Inception Mercury 267333333033.3%
Inception Mercury333300013.3%
GPT-4o Mini (temp=0)333300013.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Z.AI GLM 5.11001001001006793.3%
ByteDance Seed 1.61001001001006793.3%
GPT-5.21001001001006793.3%
Qwen 3.6 27B1001001001006793.3%
Z.AI GLM 4.61001001001006793.3%
Qwen 3.6 35B1001001001006793.3%
MiniMax M2.51001001001006793.3%
GPT-4.11001001001006793.3%
Gemini 2.5 Pro1001001001006793.3%
o4 Mini1001001001006793.3%
Qwen 3.5 35B1001001001006793.3%
ByteDance Seed 2.0 Mini1001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
GPT-OSS 120B1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
Claude Haiku 4.51001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
GPT-5.4 Mini1001001001006793.3%
Mistral Large 21001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Llama 3.1 Nemotron 70B1001001001006793.3%
Hermes 3 70B1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
Arcee AI: Trinity Mini1001001001006793.3%
LFM2 24B1001001001006793.3%
Qwen 3.5 397B A17B1001001001003386.7%
Grok 4100100100676786.7%
Gemini 3.5 Flash (Reasoning, Minimal)100100100676786.7%
Nemotron 3 Super100100100676786.7%
Grok 4.20 (Beta)100100100676786.7%
Stealth: Aurora Alpha100100100676786.7%
DeepSeek V3 (2024-12-26)1001001001003386.7%
GPT-5 Nano100100100676786.7%
GPT-5.4 Nano100100100676786.7%
ByteDance Seed 1.6 Flash100100100676786.7%
Grok 4.1 Fast10010067676780.0%
o4 Mini High100100100673380.0%
GPT-4o, May 13th (temp=1)100100100673380.0%
Claude 3.7 Sonnet10010067676780.0%
Grok 4 Fast10010067673373.3%
GPT-4o, Aug. 6th (temp=0)1006767676773.3%
GPT-5.4 Nano (Reasoning, Low)10010067673373.3%
Cohere Command R+ (Aug. 2024)10010067673373.3%
Inception Mercury 21001006767066.7%
Llama 3.1 70B1006767673366.7%
Llama 3.1 8B10010010033066.7%
Qwen 2.5 72B100333333040.0%
GPT-4o Mini (temp=0)67673333040.0%
Claude 3 Haiku10067330040.0%
GPT-4o, May 13th (temp=0)6767330033.3%
Inception Mercury676700026.7%
Mistral Small 3.2 24B1003300026.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Gemini 3.5 Flash (Reasoning)1001001001006793.3%
GPT-5 Mini1001001001006793.3%
Claude Opus 4.61001001001006793.3%
Qwen 3.5 122B1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)1001001001006793.3%
Grok 4.20 (Reasoning)1001001001006793.3%
ByteDance Seed 1.61001001001006793.3%
GPT-5.4 Mini (Reasoning)1001001001006793.3%
o4 Mini High1001001001006793.3%
Aion 2.01001001001006793.3%
MiniMax M2.51001001001006793.3%
GPT-4.11001001001006793.3%
Claude Opus 41001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
GPT-5.41001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
Hermes 3 405B1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
GPT-5 Nano1001001001006793.3%
Mistral Large 21001001001006793.3%
DeepSeek V3.21001001001006793.3%
Grok 4.201001001001006793.3%
GPT-5.4 Nano (Reasoning)1001001001006793.3%
Mistral Medium 3.11001001001006793.3%
Mistral Small 41001001001006793.3%
Ministral 3 14B1001001001006793.3%
Cohere Command R+ (Aug. 2024)1001001001006793.3%
Ministral 3B1001001001006793.3%
Qwen 3.5 397B A17B1001001001003386.7%
Qwen 3.5 27B100100100676786.7%
Qwen 3.6 27B100100100676786.7%
MiniMax M2.7100100100676786.7%
Qwen 3.5 35B1001001001003386.7%
Gemini 3.1 Flash Lite (Reasoning)1001001001003386.7%
Qwen 3.5 Flash100100100676786.7%
Z.AI GLM 4.5100100100676786.7%
Qwen 3.5 Plus (2026-02-15)100100100676786.7%
GPT-4o, May 13th (temp=1)100100100676786.7%
Mistral Large100100100676786.7%
GPT-4o Mini (temp=1)100100100676786.7%
GPT-5.4 Nano100100100676786.7%
Gemma 3 4B100100100676786.7%
Mistral NeMO1001001001003386.7%
Llama 3.1 8B1001001001003386.7%
LFM2 24B100100100676786.7%
Qwen 3.6 Flash100100100673380.0%
Qwen 3.6 35B100100100673380.0%
o4 Mini100100100673380.0%
GPT-OSS 120B100100100673380.0%
Gemini 2.5 Flash Lite (Reasoning)100100100673380.0%
Mistral Large 3100100100673380.0%
DeepSeek V3 (2024-12-26)100100100673380.0%
Gemini 2.5 Flash100100100673380.0%
Hermes 3 70B100100100673380.0%
GPT-5.210010067673373.3%
DeepSeek-V2 Chat10010067673373.3%
Z.AI GLM 4.7 Flash100100100333373.3%
GPT-4o, Aug. 6th (temp=0)10010067673373.3%
Inception Mercury10010010067073.3%
Qwen 2.5 72B10010010067073.3%
Mistral Small Creative10010010067073.3%
Ministral 3 3B10010010067073.3%
Inception Mercury 210010067333366.7%
Gemma 3 12B10010067333366.7%
Gemma 3 27B10010067333366.7%
Arcee AI: Trinity Large (Preview)10010067333366.7%
GPT-4.1 Nano10010010033066.7%
Ministral 8B1006767673366.7%
Gemini 2.5 Flash (Reasoning)1006767333360.0%
Llama 3.1 70B1006767333360.0%
Llama 3.1 Nemotron 70B1006767333360.0%
Ministral 3 8B1001001000060.0%
Claude 3 Haiku100676767060.0%
Stealth: Aurora Alpha100676733053.3%
Nemotron 3 Nano100676733053.3%
Gemini 2.5 Flash Lite100673333046.7%
GPT-4o, May 13th (temp=0)673300020.0%
Mistral Small 3.2 24B67000013.3%
GPT-4o Mini (temp=0)67000013.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)1001001001006793.3%
Grok 4.20 (Reasoning)1001001001006793.3%
ByteDance Seed 1.61001001001006793.3%
Claude Sonnet 41001001001006793.3%
o4 Mini1001001001006793.3%
Qwen 3.5 35B1001001001006793.3%
Xiaomi MIMO v2.5 Pro1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Grok 4 Fast1001001001006793.3%
Qwen 3.5 9B1001001001006793.3%
Qwen 3.5 Plus (2026-02-15)1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
ByteDance Seed 2.0 Lite1001001001006793.3%
Grok 4.20 (Beta)1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
GPT-5 Nano1001001001006793.3%
DeepSeek V3.21001001001006793.3%
Qwen 3 32B1001001001006793.3%
Grok 4.201001001001006793.3%
Gemini 2.5 Flash Lite1001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
Mistral Large1001001001006793.3%
Qwen3 235B A22B Instruct 25071001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
GPT-5.4 Nano1001001001006793.3%
Ministral 3 14B1001001001006793.3%
Ministral 3 3B1001001001006793.3%
GPT-5.2100100100676786.7%
Gemma 3 27B1001001001003386.7%
Arcee AI: Trinity Large (Preview)100100100676786.7%
Ministral 3 8B1001001001003386.7%
Ministral 3B1001001001003386.7%
Gemma 3 12B1001001001003386.7%
Ministral 8B1001001001003386.7%
GPT-4.1 Nano10010067676780.0%
GPT-4o, Aug. 6th (temp=0)10010067676780.0%
GPT-4o Mini (temp=1)100100100100080.0%
GPT-OSS 120B1006767676773.3%
Inception Mercury10010067673373.3%
Nemotron 3 Nano10010010033066.7%
Cohere Command R+ (Aug. 2024)1001006767066.7%
Mistral Small 3.2 24B1001006733060.0%
GPT-4o Mini (temp=0)1001006733060.0%
Claude 3 Haiku100676767060.0%
Llama 3.1 8B100676767060.0%
Stealth: Aurora Alpha100676733053.3%
Inception Mercury 210067670046.7%
Llama 3.1 70B10067670046.7%
GPT-4o, May 13th (temp=0)6767670040.0%
Qwen 2.5 72B10033330033.3%
Mistral NeMO1003300026.7%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Qwen3.6 Max Preview1001001001006793.3%
GPT-5.5 (Reasoning)1001001001006793.3%
GPT-5.11001001001006793.3%
Qwen 3.5 122B1001001001006793.3%
GPT-5.4 (Reasoning, Low)1001001001006793.3%
Z.AI GLM 51001001001006793.3%
Claude Sonnet 4.61001001001006793.3%
GPT-5.4 Mini (Reasoning)1001001001006793.3%
Gemini 3 Flash (Preview, Reasoning)1001001001006793.3%
GPT-5.21001001001006793.3%
DeepSeek V4 Pro (Reasoning)1001001001006793.3%
Aion 2.01001001001006793.3%
Z.AI GLM 4.61001001001006793.3%
Qwen 3.6 35B1001001001006793.3%
o4 Mini1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Grok 4 Fast1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
Mistral Large 31001001001006793.3%
Gemini 3 Flash (Preview)1001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
Hermes 3 405B1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
GPT-5.4 Mini1001001001006793.3%
Qwen 3 32B1001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Mistral Small 3.2 24B1001001001006793.3%
Llama 3.1 70B1001001001006793.3%
Mistral Small 41001001001006793.3%
Mistral Small Creative1001001001006793.3%
Ministral 3 14B1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Llama 3.1 8B1001001001006793.3%
Grok 4.20 (Beta, Reasoning)100100100676786.7%
Mistral Small 4 (Reasoning)1001001001003386.7%
Qwen3 235B A22B Instruct 2507100100100676786.7%
Hermes 3 70B1001001001003386.7%
Ministral 8B100100100676786.7%
Ministral 3B100100100676786.7%
LFM2 24B100100100676786.7%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
GPT-4o Mini (temp=0)10010067676780.0%
ByteDance Seed 2.0 Mini100100100673380.0%
Gemini 2.5 Flash (Reasoning)100100100673380.0%
GPT-OSS 120B10010067676780.0%
Qwen 2.5 72B100100100673380.0%
Mistral NeMO10010067676780.0%
GPT-5.51006767676773.3%
Inception Mercury 210010010067073.3%
DeepSeek V3 (2024-12-26)1006767676773.3%
GPT-4o, Aug. 6th (temp=0)10010010067073.3%
Nemotron 3 Nano10010067673373.3%
Ministral 3 8B10010010067073.3%
Ministral 3 3B10010010067073.3%
Stealth: Aurora Alpha1006767673366.7%
Inception Mercury10010010033066.7%
Claude 3 Haiku10010067333366.7%
GPT-4o, May 13th (temp=0)1001006733060.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen3.6 Max Preview1001001001006793.3%
Gemini 3.1 Pro (Preview)1001001001006793.3%
GPT-5.51001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Gemini 2.5 Pro1001001001006793.3%
o4 Mini1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
Inception Mercury 21001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
DeepSeek V4 Pro1001001001006793.3%
Mistral Small 4 (Reasoning)1001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
Gemma 3 12B1001001001006793.3%
Llama 3.1 Nemotron 70B1001001001006793.3%
GPT-5.4 Nano1001001001006793.3%
Hermes 3 70B1001001001006793.3%
Ministral 3B1001001001006793.3%
LFM2 24B1001001001006793.3%
GPT-5.1100100100676786.7%
GPT-5100100100676786.7%
GPT-4o, May 13th (temp=0)100100100676786.7%
Mistral Small 3.2 24B100100100676786.7%
Mistral Small Creative1001001001003386.7%
Cohere Command R+ (Aug. 2024)100100100676786.7%
Llama 3.1 8B100100100676786.7%
Nemotron 3 Nano10010067676780.0%
GPT-OSS 120B100100100673380.0%
Inception Mercury100100100100080.0%
GPT-4o Mini (temp=0)10010067676780.0%
Qwen 2.5 72B100100100673380.0%
GPT-4o, Aug. 6th (temp=0)10010067673373.3%
Ministral 3 3B10010010067073.3%
Mistral NeMO100100100333373.3%
Llama 3.1 70B1006767673366.7%
GPT-4.1 Nano1006767333360.0%
Stealth: Aurora Alpha100676733053.3%
Claude 3 Haiku100676733053.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Grok 4.3100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Claude Opus 4.6 (Reasoning)1001001001006793.3%
Qwen3.6 Max Preview1001001001006793.3%
GPT-5.4 (Reasoning)1001001001006793.3%
GPT-5.11001001001006793.3%
Gemma 4 31B (Reasoning)1001001001006793.3%
Qwen 3.5 122B1001001001006793.3%
GPT-5.4 (Reasoning, Low)1001001001006793.3%
DeepSeek V4 Pro (Reasoning)1001001001006793.3%
Claude Opus 4.51001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Gemini 2.5 Pro1001001001006793.3%
Xiaomi MIMO v2.5 Pro1001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
Gemini 3 Flash (Preview)1001001001006793.3%
ByteDance Seed 2.0 Lite1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
Writer: Palmyra X51001001001006793.3%
Gemma 3 27B1001001001006793.3%
Mistral Small 41001001001006793.3%
Mistral Small Creative1001001001006793.3%
Cohere Command R+ (Aug. 2024)1001001001006793.3%
Ministral 3B1001001001006793.3%
Qwen 3.5 397B A17B100100100676786.7%
GPT-5.4 Mini (Reasoning)100100100676786.7%
Aion 2.0100100100676786.7%
DeepSeek V4 Flash (Reasoning)100100100676786.7%
Qwen 3.5 35B100100100676786.7%
Gemini 2.5 Flash Lite (Reasoning)100100100676786.7%
Nemotron 3 Super100100100676786.7%
Stealth: Aurora Alpha100100100676786.7%
DeepSeek V3.1100100100676786.7%
GPT-5.4 Nano (Reasoning, Low)100100100676786.7%
GPT-4o Mini (temp=1)1001001001003386.7%
Mistral Medium 3.11001001001003386.7%
Llama 3.1 Nemotron 70B100100100676786.7%
WizardLM 2 8x22b100100100676786.7%
Gemma 3 4B100100100676786.7%
Ministral 3 3B1001001001003386.7%
Rocinante 12B100100100676786.7%
Z.AI GLM 4.61001001001003386.7%
Qwen 3.5 Plus (2026-02-15)1001001001003386.7%
Qwen 3.5 27B10010067676780.0%
Qwen 3.5 Plus (2026-04-20)100100100673380.0%
GPT-OSS 120B100100100100080.0%
Gemini 3.1 Flash Lite (Preview)100100100673380.0%
Inception Mercury 2100100100673380.0%
GPT-4o, May 13th (temp=1)100100100673380.0%
Hermes 3 405B100100100673380.0%
GPT-5.4 Nano (Reasoning)10010067676780.0%
Nemotron 3 Nano100100100673380.0%
GPT-4.1 Nano100100100100080.0%
Claude 3 Haiku100100100673380.0%
GPT-4o, May 13th (temp=0)10010067673373.3%
GPT-4o, Aug. 6th (temp=0)10010010067073.3%
GPT-5.4 Nano1006767676773.3%
Arcee AI: Trinity Large (Preview)100100100333373.3%
Llama 3.1 8B10010010067073.3%
GPT-5.2676767676766.7%
Stealth: Hunter Alpha1001006767066.7%
Gemma 3 12B1006767673366.7%
GPT-4o Mini (temp=0)1001006767066.7%
Hermes 3 70B1001006767066.7%
Inception Mercury1001001000060.0%
Mistral Small 3.2 24B1001001000060.0%
Gemini 2.5 Flash67673333040.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001006793.3%
GPT-5.4 (Reasoning, Low)1001001001006793.3%
MoonshotAI: Kimi K2.51001001001006793.3%
Qwen 3.5 27B1001001001006793.3%
o4 Mini High1001001001006793.3%
Grok 4.1 Fast1001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
Grok 4 Fast1001001001006793.3%
Gemini 3.1 Flash Lite1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
ByteDance Seed 2.0 Lite1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
Mistral Large1001001001006793.3%
Qwen3 235B A22B Instruct 25071001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Gemma 3 12B1001001001006793.3%
Mistral Small 41001001001006793.3%
Hermes 3 70B1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
Mistral NeMO1001001001006793.3%
Gemini 3.5 Flash (Reasoning)100100100676786.7%
GPT-5.21001001001003386.7%
Z.AI GLM 4.5100100100676786.7%
Grok 4.20 (Beta)100100100676786.7%
Hermes 3 405B100100100676786.7%
GPT-5.4 Mini100100100676786.7%
DeepSeek V3.1100100100676786.7%
GPT-5.4 Nano (Reasoning)100100100676786.7%
Gemini 2.5 Flash100100100676786.7%
Writer: Palmyra X5100100100676786.7%
Arcee AI: Trinity Large (Preview)100100100676786.7%
Mistral Small Creative100100100676786.7%
GPT-5 Nano1001001001003386.7%
ByteDance Seed 1.6100100100673380.0%
GPT-5.4 Mini (Reasoning)10010067676780.0%
Gemini 2.5 Pro100100100673380.0%
GPT-5.4 Mini (Reasoning, Low)100100100673380.0%
Nemotron 3 Super100100100673380.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Claude 3.7 Sonnet100100100673380.0%
GPT-4o, Aug. 6th (temp=0)10010067676780.0%
GPT-5.4 Nano (Reasoning, Low)10010067676780.0%
Nemotron 3 Nano10010067676780.0%
GPT-5.4 Nano10010067676780.0%
Cohere Command R+ (Aug. 2024)10010067676780.0%
Ministral 3 3B100100100673380.0%
WizardLM 2 8x22b10010067673373.3%
Llama 3.1 70B10010067673373.3%
GPT-4o Mini (temp=0)1006767676773.3%
Llama 3.1 Nemotron 70B10010010067073.3%
Ministral 8B100100100333373.3%
Llama 3.1 8B1001006767066.7%
Inception Mercury1001006733060.0%
GPT-OSS 120B1006733333353.3%
Inception Mercury 267676767053.3%
Stealth: Aurora Alpha100676733053.3%
Qwen 2.5 72B100676733053.3%
Claude 3 Haiku100673333046.7%
Mistral Small 3.2 24B10033330033.3%
GPT-4o, May 13th (temp=0)673300020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 3.5 122B1001001001006793.3%
Qwen 3.5 27B1001001001006793.3%
Qwen 3.6 Flash1001001001006793.3%
Z.AI GLM 4.61001001001006793.3%
MiniMax M2.71001001001006793.3%
GPT-5.51001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
MiniMax M2.51001001001006793.3%
Grok 41001001001006793.3%
Qwen 3.5 35B1001001001006793.3%
Stealth: Hunter Alpha1001001001006793.3%
GPT-OSS 120B1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Qwen 3.5 9B1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
Mistral Large 31001001001006793.3%
Z.AI GLM 4.7 Flash1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
DeepSeek V3 (2024-12-26)1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
DeepSeek V4 Pro1001001001006793.3%
Mistral Large 21001001001006793.3%
DeepSeek V3.11001001001006793.3%
DeepSeek V4 Flash1001001001006793.3%
Gemini 2.5 Flash Lite1001001001006793.3%
Writer: Palmyra X51001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
GPT-4o Mini (temp=0)1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Mistral Small 41001001001006793.3%
GPT-5.4 Nano1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
Arcee AI: Trinity Mini1001001001006793.3%
Gemma 3 4B1001001001006793.3%
Rocinante 12B1001001001006793.3%
GPT-5.1100100100676786.7%
Qwen 3.5 Plus (2026-04-20)100100100676786.7%
GPT-5.21001001001003386.7%
Xiaomi MIMO v2.5 Pro1001001001003386.7%
Gemini 2.5 Flash Lite (Reasoning)100100100676786.7%
Claude 3.7 Sonnet100100100676786.7%
Mistral Large1001001001003386.7%
Mistral Small 3.2 24B100100100676786.7%
Llama 3.1 Nemotron 70B100100100676786.7%
Cohere Command R+ (Aug. 2024)100100100676786.7%
Ministral 3 3B100100100676786.7%
Llama 3.1 8B100100100676786.7%
Ministral 3B1001001001003386.7%
Qwen 3.5 Plus (2026-02-15)10010067676780.0%
Z.AI GLM 510010067676780.0%
DeepSeek V4 Pro (Reasoning)100100100100080.0%
Qwen 3.6 27B10010067676780.0%
Claude Opus 4.510010067676780.0%
Z.AI GLM 4.710010067676780.0%
Gemini 2.5 Pro100100100673380.0%
Gemini 2.5 Flash (Reasoning)100100100673380.0%
Z.AI GLM 4.5 Air100100100673380.0%
Inception Mercury100100100100080.0%
Gemma 3 27B10010067676780.0%
Mistral Small Creative10010067676780.0%
Claude 3 Haiku100100100673380.0%
Ministral 3 8B10010067673373.3%
WizardLM 2 8x22b10010067673373.3%
GPT-4o, May 13th (temp=1)10010010067073.3%
Gemma 3 12B10010010067073.3%
Llama 3.1 70B10010067673373.3%
Arcee AI: Trinity Large (Preview)10010010067073.3%
Stealth: Aurora Alpha10010067333366.7%
GPT-4o, Aug. 6th (temp=0)10010067333366.7%
GPT-4o Mini (temp=1)10010067333366.7%
Ministral 8B1001006767066.7%
GPT-4o, May 13th (temp=0)1006767333360.0%
DeepSeek-V2 Chat100676733053.3%
Gemini 2.5 Flash67676767053.3%
Mistral NeMO1006733333353.3%
Inception Mercury 2676733333346.7%
Qwen 2.5 72B6767330033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.7 (Reasoning)1001001001006793.3%
GPT-5.11001001001006793.3%
Qwen 3.5 Plus (2026-04-20)1001001001006793.3%
Grok 4.20 (Reasoning)1001001001006793.3%
GPT-5.21001001001006793.3%
Qwen 3.6 35B1001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Grok 41001001001006793.3%
Gemma 4 31B1001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
Gemini 3.5 Flash (Reasoning, Minimal)1001001001006793.3%
Qwen 3.5 Flash1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
Mistral Large 31001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
DeepSeek V3 (2024-12-26)1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
GPT-5 Nano1001001001006793.3%
DeepSeek V3.11001001001006793.3%
DeepSeek V3.21001001001006793.3%
Gemini 2.5 Flash Lite1001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
Mistral Large1001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Gemma 3 27B1001001001006793.3%
Mistral Medium 3.11001001001006793.3%
Mistral Small 41001001001006793.3%
Llama 3.1 Nemotron 70B1001001001006793.3%
GPT-5.4 Nano1001001001006793.3%
Mistral Small Creative1001001001006793.3%
Ministral 3 14B1001001001006793.3%
Ministral 3 8B1001001001006793.3%
LFM2 24B1001001001006793.3%
Qwen3.7 Max100100100676786.7%
ByteDance Seed 1.61001001001003386.7%
Stealth: Hunter Alpha1001001001003386.7%
Grok 4 Fast1001001001003386.7%
Xiaomi MIMO v2.5100100100676786.7%
GPT-4o, May 13th (temp=1)100100100676786.7%
Hermes 3 405B1001001001003386.7%
GPT-4.1 Nano100100100676786.7%
Cohere Command R+ (Aug. 2024)100100100676786.7%
Gemma 3 12B1001001001003386.7%
Ministral 3B10010067676780.0%
GPT-4o Mini (temp=0)10010067676780.0%
Nemotron 3 Nano10010067676780.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Hermes 3 70B100100100673380.0%
Inception Mercury 21006767676773.3%
Stealth: Aurora Alpha10010067673373.3%
Llama 3.1 70B100100100333373.3%
WizardLM 2 8x22b10010010067073.3%
Ministral 3 3B10010010067073.3%
GPT-OSS 120B1001006767066.7%
Mistral NeMO10010010033066.7%
Inception Mercury1001006733060.0%
GPT-4o, Aug. 6th (temp=0)10067670046.7%
Mistral Small 3.2 24B10067670046.7%
GPT-4o, May 13th (temp=0)10033330033.3%
Qwen 2.5 72B10033330033.3%
Claude 3 Haiku6767330033.3%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Z.AI GLM 5.11001001001006793.3%
GPT-5.4 (Reasoning)1001001001006793.3%
GPT-5.5 (Reasoning)1001001001006793.3%
GPT-5.5 (Reasoning, Low)1001001001006793.3%
Qwen 3.5 397B A17B1001001001006793.3%
Claude Opus 4.51001001001006793.3%
GPT-5.51001001001006793.3%
GPT-4.11001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
Gemini 3.1 Flash Lite (Reasoning)1001001001006793.3%
Qwen 3.5 Flash1001001001006793.3%
Qwen 3.5 9B1001001001006793.3%
Gemini 3.1 Flash Lite1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
GPT-5.41001001001006793.3%
Z.AI GLM 4.5 Air1001001001006793.3%
DeepSeek V4 Pro1001001001006793.3%
GPT-5.4 Mini1001001001006793.3%
Grok 4.201001001001006793.3%
GPT-5.4 Nano (Reasoning)1001001001006793.3%
Writer: Palmyra X51001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
GPT-4o Mini (temp=0)1001001001006793.3%
Mistral Medium 3.11001001001006793.3%
Arcee AI: Trinity Large (Preview)1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)100100100676786.7%
Qwen 3.6 27B100100100676786.7%
Stealth: Hunter Alpha1001001001003386.7%
Nemotron 3 Super1001001001003386.7%
GPT-4o, May 13th (temp=1)1001001001003386.7%
Hermes 3 405B100100100676786.7%
GPT-4o, Aug. 6th (temp=0)100100100676786.7%
Mistral Small 4 (Reasoning)1001001001003386.7%
GPT-5.4 Nano (Reasoning, Low)100100100676786.7%
GPT-5.4 Nano100100100676786.7%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
Gemma 3 4B100100100676786.7%
Qwen3.6 Max Preview10010067676780.0%
Qwen 2.5 72B10010067676780.0%
GPT-5.2100100100673380.0%
GPT-OSS 120B100100100673380.0%
Inception Mercury 210010067676780.0%
DeepSeek V3 (2024-12-26)100100100673380.0%
Llama 3.1 70B100100100673380.0%
Hermes 3 70B10010067676780.0%
Claude 3 Haiku100100100673380.0%
Stealth: Aurora Alpha1006767676773.3%
Qwen 3.5 27B10010067333366.7%
GPT-4o, May 13th (temp=0)1006767673366.7%
Inception Mercury1001006767066.7%
Mistral NeMO1001006767066.7%
Nemotron 3 Nano1001006733060.0%
Mistral Small 3.2 24B100100330046.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Z.AI GLM 5.11001001001006793.3%
GPT-5.5 (Reasoning, Low)1001001001006793.3%
GPT-5.11001001001006793.3%
Grok 4.20 (Beta, Reasoning)1001001001006793.3%
GPT-5.4 (Reasoning, Low)1001001001006793.3%
ByteDance Seed 1.61001001001006793.3%
GPT-5.4 Mini (Reasoning)1001001001006793.3%
o4 Mini High1001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Qwen 3.5 Plus (2026-02-15)1001001001006793.3%
Hermes 3 405B1001001001006793.3%
DeepSeek V3.21001001001006793.3%
Llama 3.1 Nemotron 70B1001001001006793.3%
Arcee AI: Trinity Large (Preview)1001001001006793.3%
Arcee AI: Trinity Mini1001001001006793.3%
Mistral NeMO1001001001006793.3%
Qwen 3.5 122B100100100676786.7%
Qwen 3.5 Plus (2026-04-20)1001001001003386.7%
DeepSeek-V2 Chat1001001001003386.7%
GPT-4o Mini (temp=1)100100100676786.7%
Llama 3.1 70B1001001001003386.7%
Qwen 3.5 397B A17B10010067676780.0%
GPT-5.2100100100673380.0%
GPT-OSS 120B100100100673380.0%
Inception Mercury100100100100080.0%
GPT-4.1 Nano100100100673380.0%
Ministral 3 3B100100100100080.0%
Claude 3 Haiku10010067673373.3%
Nemotron 3 Nano10010010067073.3%
Qwen 2.5 72B10010010067073.3%
Stealth: Aurora Alpha1001006767066.7%
Mistral Small 3.2 24B10010067066.7%
GPT-4o, Aug. 6th (temp=0)1006733333353.3%
GPT-4o, May 13th (temp=0)10067330040.0%
Inception Mercury 26767330033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Grok 4.20100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Z.AI GLM 5.11001001001006793.3%
GPT-5.4 (Reasoning)1001001001006793.3%
GPT-5 Mini1001001001006793.3%
GPT-5.5 (Reasoning, Low)1001001001006793.3%
GPT-5.11001001001006793.3%
Qwen 3.5 397B A17B1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)1001001001006793.3%
Grok 4.20 (Reasoning)1001001001006793.3%
Qwen 3.5 27B1001001001006793.3%
GPT-5.4 Mini (Reasoning)1001001001006793.3%
Qwen 3.6 27B1001001001006793.3%
Grok 4.1 Fast1001001001006793.3%
GPT-4.11001001001006793.3%
Gemini 2.5 Pro1001001001006793.3%
Grok 41001001001006793.3%
ByteDance Seed 2.0 Mini1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Gemma 4 26B1001001001006793.3%
GPT-5.4 Mini (Reasoning, Low)1001001001006793.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001006793.3%
Z.AI GLM 4.7 Flash1001001001006793.3%
GPT-5.41001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
GPT-5.4 Mini1001001001006793.3%
DeepSeek V3.11001001001006793.3%
DeepSeek V3 (2025-03-24)1001001001006793.3%
Mistral Large1001001001006793.3%
Inception Mercury1001001001006793.3%
Gemma 3 27B1001001001006793.3%
Hermes 3 70B1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
Cohere Command R+ (Aug. 2024)1001001001006793.3%
Gemma 3 4B1001001001006793.3%
Llama 3.1 8B1001001001006793.3%
Rocinante 12B1001001001006793.3%
Aion 2.0100100100676786.7%
DeepSeek V4 Flash (Reasoning)100100100676786.7%
Z.AI GLM 4.7100100100676786.7%
Stealth: Hunter Alpha1001001001003386.7%
DeepSeek-V2 Chat1001001001003386.7%
Inception Mercury 2100100100676786.7%
GPT-4o, May 13th (temp=1)100100100676786.7%
GPT-5 Nano1001001001003386.7%
DeepSeek V3.2100100100676786.7%
Gemini 2.5 Flash Lite100100100676786.7%
Mistral NeMO100100100676786.7%
Ministral 8B1001001001003386.7%
Arcee AI: Trinity Large (Preview)1001001001003386.7%
GPT-5.4 Nano (Reasoning)10010067676780.0%
GPT-4o, May 13th (temp=0)10010067676780.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
GPT-5.4 Nano10010067676780.0%
Ministral 3 8B100100100100080.0%
Ministral 3 3B100100100100080.0%
GPT-OSS 120B100100100333373.3%
Stealth: Aurora Alpha10010010067073.3%
GPT-4o, Aug. 6th (temp=0)10010067673373.3%
GPT-4o Mini (temp=1)10010067673373.3%
Nemotron 3 Nano10010067673373.3%
Qwen 2.5 72B10010010067073.3%
GPT-5.210010067333366.7%
GPT-5.4 Nano (Reasoning, Low)10010067333366.7%
Gemini 2.5 Flash1001001000060.0%
Gemini 2.5 Flash (Reasoning)1006733333353.3%
Llama 3.1 70B100676733053.3%
Mistral Small 3.2 24B10010000040.0%
Claude 3 Haiku1006700033.3%
GPT-4o Mini (temp=0)67000013.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Qwen 3.5 122B1001001001006793.3%
GPT-5.51001001001006793.3%
DeepSeek V4 Flash (Reasoning)1001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Stealth: Hunter Alpha1001001001006793.3%
ByteDance Seed 2.0 Mini1001001001006793.3%
Qwen 3.5 9B1001001001006793.3%
Mistral Large 31001001001006793.3%
ByteDance Seed 2.0 Lite1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
Claude 3.5 Sonnet1001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
GPT-4.1 Mini1001001001006793.3%
Hermes 3 405B1001001001006793.3%
GPT-4o, Aug. 6th (temp=1)1001001001006793.3%
GPT-5.4 Nano (Reasoning)1001001001006793.3%
Gemini 2.5 Flash Lite1001001001006793.3%
Gemini 2.5 Flash1001001001006793.3%
Qwen3 235B A22B Instruct 25071001001001006793.3%
GPT-5.4 Nano (Reasoning, Low)1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Cohere Command R+ (Aug. 2024)1001001001006793.3%
Mistral NeMO1001001001006793.3%
Ministral 3B1001001001006793.3%
Grok 4100100100676786.7%
GPT-OSS 120B100100100676786.7%
Z.AI GLM 4.5100100100676786.7%
Grok 4 Fast100100100676786.7%
GPT-5.4 Mini (Reasoning, Low)100100100676786.7%
Arcee AI: Trinity Large (Preview)100100100676786.7%
Arcee AI: Trinity Mini100100100676786.7%
o4 Mini1001001001003386.7%
Hermes 3 70B1001001001003386.7%
Grok 4.1 Fast100100100673380.0%
Stealth: Aurora Alpha100100100673380.0%
Claude 3.7 Sonnet100100100673380.0%
GPT-4o, Aug. 6th (temp=0)100100100673380.0%
GPT-5.4 Mini10010067676780.0%
GPT-4o Mini (temp=1)10010067676780.0%
GPT-5.4 Nano100100100673380.0%
DeepSeek V3 (2024-12-26)10010067673373.3%
Mistral Small 3.2 24B10010067673373.3%
GPT-4o Mini (temp=0)10010067673373.3%
Qwen 2.5 72B10010067673373.3%
Llama 3.1 Nemotron 70B100100100333373.3%
Claude 3 Haiku1006767676773.3%
Inception Mercury 21006767673366.7%
Llama 3.1 70B10010067333366.7%
DeepSeek-V2 Chat100676767060.0%
Llama 3.1 8B1006767333360.0%
Inception Mercury100100330046.7%
GPT-4o, May 13th (temp=0)10067330040.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Aion 2.0100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Mistral Large 3100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Grok 4.20100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen3.7 Max1001001001006793.3%
Z.AI GLM 5 Turbo1001001001006793.3%
GPT-5.5 (Reasoning, Low)1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)1001001001006793.3%
Gemma 4 26B (Reasoning)1001001001006793.3%
Qwen 3.5 27B1001001001006793.3%
Gemini 3 Flash (Preview, Reasoning)1001001001006793.3%
o4 Mini High1001001001006793.3%
Claude Opus 4.51001001001006793.3%
Grok 4.1 Fast1001001001006793.3%
MiniMax M2.71001001001006793.3%
DeepSeek V4 Flash (Reasoning)1001001001006793.3%
Gemini 3 Pro (Preview)1001001001006793.3%
Claude Sonnet 41001001001006793.3%
GPT-4.11001001001006793.3%
Claude Opus 41001001001006793.3%
Stealth: Hunter Alpha1001001001006793.3%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
Grok 4 Fast1001001001006793.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001006793.3%
Gemini 3 Flash (Preview)1001001001006793.3%
DeepSeek-V2 Chat1001001001006793.3%
GPT-5.41001001001006793.3%
GPT-4o, May 13th (temp=1)1001001001006793.3%
Claude 3.7 Sonnet1001001001006793.3%
DeepSeek V4 Pro1001001001006793.3%
GPT-5.4 Nano (Reasoning)1001001001006793.3%
Mistral Large1001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Gemma 3 27B1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Arcee AI: Trinity Large (Preview)1001001001006793.3%
ByteDance Seed 1.6 Flash1001001001006793.3%
Ministral 3 14B1001001001006793.3%
Ministral 3 8B1001001001006793.3%
Ministral 3 3B1001001001006793.3%
Ministral 8B1001001001006793.3%
Llama 3.1 8B1001001001006793.3%
ByteDance Seed 1.6100100100676786.7%
Qwen 3.6 27B100100100676786.7%
Z.AI GLM 4.5100100100676786.7%
Z.AI GLM 4.7 Flash100100100676786.7%
GPT-5 Nano100100100676786.7%
Mistral Small 4 (Reasoning)1001001001003386.7%
GPT-5.4 Nano (Reasoning, Low)100100100676786.7%
Llama 3.1 Nemotron 70B100100100676786.7%
Hermes 3 70B1001001001003386.7%
Rocinante 12B1001001001003386.7%
Qwen 3.5 35B1001001001003386.7%
Stealth: Healer Alpha1001001001003386.7%
Claude Haiku 4.51001001001003386.7%
Qwen 3.5 397B A17B10010067676780.0%
Qwen 3.5 Plus (2026-02-15)10010067676780.0%
Gemini 3.5 Flash (Reasoning)100100100673380.0%
o4 Mini100100100673380.0%
GPT-5.4 Mini (Reasoning, Low)100100100673380.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Z.AI GLM 4.5 Air10010067676780.0%
Hermes 3 405B100100100673380.0%
DeepSeek V3 (2025-03-24)100100100673380.0%
Mistral Medium 3.1100100100100080.0%
Mistral Small Creative10010067676780.0%
Arcee AI: Trinity Mini100100100673380.0%
Gemma 3 4B100100100673380.0%
Mistral NeMO10010067676780.0%
Qwen 3.5 122B10010067673373.3%
Z.AI GLM 4.710010067673373.3%
GPT-OSS 120B10010067673373.3%
Claude 3 Haiku10010010067073.3%
Cohere Command R+ (Aug. 2024)10010067673373.3%
Z.AI GLM 4.610010067333366.7%
GPT-4o, May 13th (temp=0)10010067333366.7%
Nemotron 3 Super1006767673366.7%
Inception Mercury 210010067333366.7%
Mistral Small 3.2 24B10010010033066.7%
Gemma 3 12B1006767673366.7%
GPT-5.4 Nano10010067333366.7%
GPT-4o Mini (temp=0)1006767333360.0%
Llama 3.1 70B1001006733060.0%
Gemini 2.5 Flash67676767053.3%
Ministral 3B100100670053.3%
Qwen 2.5 72B100673333046.7%
Gemini 2.5 Flash Lite67673333040.0%
GPT-4o, Aug. 6th (temp=0)6767330033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Qwen 3.5 122B1001001001006793.3%
Grok 4.1 Fast1001001001006793.3%
Aion 2.01001001001006793.3%
MiniMax M2.51001001001006793.3%
Stealth: Hunter Alpha1001001001006793.3%
ByteDance Seed 2.0 Mini1001001001006793.3%
GPT-OSS 120B1001001001006793.3%
Qwen 3.5 Flash1001001001006793.3%
Z.AI GLM 4.51001001001006793.3%
Grok 4 Fast1001001001006793.3%
Qwen 3.5 Plus (2026-02-15)1001001001006793.3%
Stealth: Healer Alpha1001001001006793.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001006793.3%
Nemotron 3 Super1001001001006793.3%
Mistral Small 4 (Reasoning)1001001001006793.3%
DeepSeek V3.11001001001006793.3%
DeepSeek V4 Flash1001001001006793.3%
Mistral Large1001001001006793.3%
Qwen3 235B A22B Instruct 25071001001001006793.3%
Writer: Palmyra X51001001001006793.3%
GPT-4o Mini (temp=1)1001001001006793.3%
Gemma 3 27B1001001001006793.3%
Nemotron 3 Nano1001001001006793.3%
Mistral Small 41001001001006793.3%
Mistral Small Creative1001001001006793.3%
GPT-4.1 Nano1001001001006793.3%
Ministral 3 8B1001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Arcee AI: Trinity Mini1001001001006793.3%
Ministral 3 3B1001001001006793.3%
Ministral 8B1001001001006793.3%
Ministral 3B1001001001006793.3%
LFM2 24B1001001001006793.3%
Rocinante 12B1001001001006793.3%
Qwen 3.5 Plus (2026-04-20)100100100676786.7%
Qwen 3.5 27B100100100676786.7%
Grok 4100100100676786.7%
Xiaomi MIMO v2.5 Pro1001001001003386.7%
Hermes 3 405B1001001001003386.7%
DeepSeek V3.2100100100676786.7%
Gemini 2.5 Flash100100100676786.7%
Cohere Command R+ (Aug. 2024)1001001001003386.7%
Gemma 3 4B1001001001003386.7%
o4 Mini10010067676780.0%
GPT-4.1 Mini10010067676780.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Claude 3.7 Sonnet10010010067073.3%
GPT-5.4 Nano100100100333373.3%
Mistral NeMO10010010067073.3%
GPT-4o, May 13th (temp=1)1006767673366.7%
DeepSeek V3 (2024-12-26)10010010033066.7%
Claude 3 Haiku1006767673366.7%
DeepSeek-V2 Chat1001001000060.0%
Hermes 3 70B1001003333053.3%
Inception Mercury 2100673333046.7%
GPT-4o, Aug. 6th (temp=0)676733333346.7%
Llama 3.1 70B100673333046.7%
GPT-4o Mini (temp=0)100673333046.7%
Llama 3.1 Nemotron 70B676733333346.7%
Inception Mercury10067330040.0%
Mistral Small 3.2 24B10067330040.0%
Stealth: Aurora Alpha10033330033.3%
Qwen 2.5 72B100000020.0%
GPT-4o, May 13th (temp=0)333300013.3%