Sentence length variance

Test: Bad Writing Habits

Avg. Score
99.4%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Ministral 3B100.0%$0.00018.1s100%
2Mistral Small Creative100.0%$0.00079.1s100%
3Ministral 8B100.0%$0.000410.4s100%
4Gemini 2.5 Flash Lite100.0%$0.00099.5s100%
5Ministral 3 3B100.0%$0.000511.1s100%
6Ministral 3 14B100.0%$0.000711.7s100%
7Gemini 3.1 Flash Lite (Preview)100.0%$0.00308.4s99%
8Gemma 3 4B100.0%$0.000220.0s100%
9Mistral Small 4100.0%$0.001418.2s100%
10Ministral 3 8B100.0%$0.000819.6s100%
11Gemini 3.1 Flash Lite (Reasoning)100.0%$0.003011.9s100%
12Stealth: Healer Alpha100.0%$0.000023.7s100%
13GPT-4.1 Mini100.0%$0.002719.0s100%
14Grok 4 Fast100.0%$0.001724.1s100%
15LFM2 24B100.0%$0.000228.4s100%
16ByteDance Seed 1.6 Flash100.0%$0.001327.3s100%
17GPT-5.4 Nano (Reasoning, Low)100.0%$0.005520.6s100%
18DeepSeek V4 Flash (Reasoning)100.0%$0.000731.1s100%
19DeepSeek V4 Flash100.0%$0.000631.6s100%
20Gemini 3 Flash (Preview)100.0%$0.007819.6s100%
21Mistral Small 4 (Reasoning)100.0%$0.002230.2s100%
22GPT-5.4 Nano (Reasoning)100.0%$0.006124.5s100%
23Gemini 2.5 Flash Lite (Reasoning)100.0%$0.002830.8s100%
24Mistral Large 3100.0%$0.003330.3s100%
25GPT-5.4 Nano100.0%$0.005726.3s100%
26Grok 4.1 Fast100.0%$0.001837.8s100%
27Xiaomi MIMO v2.5100.0%$0.005431.8s100%
28DeepSeek V3 (2025-03-24)100.0%$0.001439.4s100%
29Gemma 3 12B100.0%$0.000441.3s100%
30Gemini 2.5 Flash (Reasoning)100.0%$0.01121.5s100%
31Qwen 3.5 Plus (2026-02-15)100.0%$0.006031.5s100%
32Claude Haiku 4.5100.0%$0.01121.6s100%
33Writer: Palmyra X5100.0%$0.01122.0s100%
34Grok 4.3100.0%$0.006930.5s100%
35Arcee AI: Trinity Mini99.8%$0.00039.2s97%
36GPT-5.4 Mini (Reasoning, Low)100.0%$0.01516.8s100%
37GPT-5.4 Mini100.0%$0.01516.8s100%
38Mistral Medium 3.1100.0%$0.004836.5s100%
39Grok 4.20 (Beta)100.0%$0.01815.8s100%
40Z.AI GLM 5 Turbo100.0%$0.008133.2s100%
41Z.AI GLM 4.5100.0%$0.005142.1s100%
42Qwen 3.5 Flash100.0%$0.002547.5s100%
43o4 Mini100.0%$0.01525.7s100%
44Gemini 3 Flash (Preview, Reasoning)100.0%$0.01230.1s100%
45Mistral Large 2100.0%$0.01329.4s100%
46Gemma 3 27B100.0%$0.000652.6s100%
47Stealth: Hunter Alpha100.0%$0.000055.0s100%
48Mistral Large100.0%$0.01430.9s100%
49Gemini 3.1 Flash Lite99.7%$0.003012.1s97%
50Gemma 4 26B100.0%$0.000955.1s100%
51Qwen 3 32B100.0%$0.001554.6s100%
52Rocinante 12B99.9%$0.001438.4s98%
53GPT-4o Mini (temp=0)99.8%$0.001234.8s98%
54DeepSeek-V2 Chat100.0%$0.002153.3s100%
55Qwen 3.6 Flash100.0%$0.01041.4s100%
56Qwen3 235B A22B Instruct 2507100.0%$0.001159.2s100%
57Grok 4.20100.0%$0.009345.7s100%
58Z.AI GLM 4.6100.0%$0.006551.5s100%
59Z.AI GLM 4.5 Air100.0%$0.002958.2s100%
60Gemini 2.5 Flash99.7%$0.005210.6s96%
61GPT-5.4 Mini (Reasoning)100.0%$0.02228.1s100%
62Xiaomi MIMO v2.5 Pro100.0%$0.008553.5s100%
63GPT-4o Mini (temp=1)99.8%$0.001234.8s97%
64MiniMax M2.7100.0%$0.00401.1m100%
65Z.AI GLM 4.7 Flash100.0%$0.00171.2m100%
66GPT-5 Mini100.0%$0.010057.4s100%
67Arcee AI: Trinity Large (Preview)99.8%$0.000043.6s97%
68Qwen 3.6 35B100.0%$0.00831.0m100%
69GPT-4.1100.0%$0.01844.7s100%
70GPT-4.1 Nano99.5%$0.000713.3s95%
71DeepSeek V3 (2024-12-26)99.9%$0.002154.6s98%
72MiniMax M2.5100.0%$0.00341.3m100%
73DeepSeek V4 Pro100.0%$0.00481.3m100%
74Z.AI GLM 5100.0%$0.00841.2m100%
75Aion 2.0100.0%$0.00641.3m100%
76Mistral NeMO99.4%$0.000510.1s93%
77o4 Mini High100.0%$0.02547.2s100%
78Claude Sonnet 4.6100.0%$0.03139.3s100%
79Gemma 4 31B100.0%$0.00101.6m100%
80Claude Sonnet 4.5100.0%$0.03538.1s100%
81Gemini 2.5 Pro100.0%$0.03636.2s100%
82Z.AI GLM 4.7100.0%$0.0101.4m100%
83Qwen 3.5 122B100.0%$0.0251.1m100%
84Grok 4.20 (Beta, Reasoning)100.0%$0.03934.0s99%
85DeepSeek V3.1100.0%$0.00201.8m100%
86DeepSeek V3.2100.0%$0.00141.9m100%
87GPT-5 Nano99.9%$0.00421.4m98%
88Z.AI GLM 5.1100.0%$0.0141.5m100%
89Claude Sonnet 499.9%$0.03243.7s99%
90Grok 4.20 (Reasoning)100.0%$0.0181.5m100%
91Gemma 4 26B (Reasoning)100.0%$0.00132.0m100%
92Claude 3.7 Sonnet100.0%$0.04246.7s100%
93GPT-4o, May 13th (temp=1)99.6%$0.03314.4s96%
94Gemma 4 31B (Reasoning)100.0%$0.00142.2m100%
95Qwen 3.5 Plus (2026-04-20)100.0%$0.0171.8m100%
96Gemini 3 Pro (Preview)100.0%$0.05554.4s100%
97Claude Opus 4.7100.0%$0.06930.4s100%
98Qwen 3.5 35B99.5%$0.0181.0m93%
99Cohere Command R+ (Aug. 2024)99.2%$0.02052.5s94%
100Claude Opus 4.7 (Reasoning)100.0%$0.07632.0s100%
101ByteDance Seed 1.6100.0%$0.0132.5m100%
102GPT-5.4100.0%$0.0491.4m100%
103Claude Sonnet 4.6 (Reasoning)100.0%$0.0601.2m100%
104Qwen 2.5 72B98.3%$0.001036.7s90%
105ByteDance Seed 2.0 Lite99.9%$0.0122.2m97%
106Claude Opus 4.5100.0%$0.07053.4s100%
107GPT-5.4 (Reasoning, Low)100.0%$0.0551.4m100%
108Claude 3 Haiku98.2%$0.002514.9s88%
109Nemotron 3 Super99.1%$0.00001.4m92%
110Grok 4100.0%$0.0481.7m100%
111GPT-5.2100.0%$0.0561.5m100%
112Qwen 3.5 397B A17B100.0%$0.0143.0m100%
113GPT-5.1100.0%$0.0541.8m100%
114Claude 3.5 Sonnet99.1%$0.04835.5s94%
115DeepSeek V4 Pro (Reasoning)100.0%$0.0153.1m100%
116Llama 3.1 Nemotron 70B98.0%$0.003831.7s88%
117Claude Opus 4.6100.0%$0.0781.2m100%
118Hermes 3 70B98.4%$0.00101.2m90%
119MoonshotAI: Kimi K2.5100.0%$0.0193.2m100%
120Hermes 3 405B98.1%$0.003253.2s88%
121GPT-OSS 120B98.6%$0.00151.8m90%
122Claude Opus 4.6 (Reasoning)100.0%$0.0881.4m100%
123Qwen 3.6 27B99.5%$0.0252.3m92%
124GPT-4o, May 13th (temp=0)97.9%$0.03514.1s85%
125Llama 3.1 8B97.7%$0.00031.3m85%
126Inception Mercury 296.4%$0.00327.0s82%
127GPT-5100.0%$0.0652.8m100%
128Qwen3.6 Max Preview100.0%$0.0503.5m100%
129Gemini 3.1 Pro (Preview)100.0%$0.1071.8m100%
130Grok 4.3 (Reasoning)98.7%$0.0212.3m89%
131GPT-4o, Aug. 6th (temp=1)96.3%$0.01824.4s84%
132GPT-5.4 (Reasoning)100.0%$0.0892.6m100%
133Stealth: Aurora Alpha94.8%$0.00009.8s83%
134Qwen 3.5 9B97.9%$0.00111.4m78%
135GPT-5.5100.0%$0.1391.7m100%
136GPT-5.5 (Reasoning, Low)100.0%$0.1391.8m100%
137Qwen 3.5 27B98.7%$0.0201.6m79%
138Llama 3.1 70B94.9%$0.001529.4s81%
139GPT-5.5 (Reasoning)100.0%$0.1421.8m100%
140WizardLM 2 8x22b97.0%$0.00261.8m81%
141ByteDance Seed 2.0 Mini99.5%$0.00454.9m93%
142Claude Opus 4100.0%$0.2091.4m100%
143MoonshotAI: Kimi K2.6100.0%$0.0586.5m100%
144Nemotron 3 Nano94.0%$0.00101.1m72%
145GPT-4o, Aug. 6th (temp=0)93.1%$0.02322.7s73%
146Inception Mercury88.0%$0.01117.6s62%
147Mistral Small 3.2 24B90.6%$0.00695.7m57%
99.44%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009899.6%
Nemotron 3 Nano1001001001009899.5%
Mistral NeMO1001001001009799.4%
Llama 3.1 Nemotron 70B1001001001009498.8%
Cohere Command R+ (Aug. 2024)1001001001009398.5%
Llama 3.1 70B1001001001008897.6%
GPT-4o Mini (temp=1)1001001001008897.5%
ByteDance Seed 2.0 Lite1001001001008797.5%
GPT-4.1 Nano1001001001008697.1%
GPT-OSS 120B1001001001008396.5%
Inception Mercury 2100100100978095.5%
GPT-4o, Aug. 6th (temp=1)100100100897693.0%
Llama 3.1 8B100100100807390.7%
Stealth: Aurora Alpha10010096896589.9%
Mistral Small 3.2 24B100100100766688.5%
GPT-4o, Aug. 6th (temp=0)1009085827285.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4o Mini (temp=0)1001001001009899.7%
Claude 3 Haiku1001001001009298.5%
Llama 3.1 8B1001001001009198.2%
GPT-4o, May 13th (temp=1)1001001001009198.1%
Arcee AI: Trinity Mini1001001001008997.8%
Llama 3.1 Nemotron 70B10010098979397.6%
GPT-OSS 120B100100100949397.5%
Hermes 3 405B1001001001008496.8%
Qwen 2.5 72B10010095928895.0%
GPT-4o, May 13th (temp=0)1001001001007394.7%
GPT-4o, Aug. 6th (temp=1)100100100888294.0%
Mistral Small 3.2 24B1001001001006693.3%
Llama 3.1 70B100100100848293.2%
Stealth: Aurora Alpha10010093858091.6%
Hermes 3 70B100100100847291.2%
Inception Mercury 2100100100935890.2%
GPT-4o, Aug. 6th (temp=0)999588605679.5%
Inception Mercury1007863636173.0%
Nemotron 3 Nano10010062412665.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Cohere Command R+ (Aug. 2024)1001001001009598.9%
Inception Mercury 21001001001009298.5%
GPT-4o, Aug. 6th (temp=0)1001001001009198.3%
Inception Mercury10010088837388.8%
Mistral Small 3.2 24B1001001001002084.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Stealth: Aurora Alpha1001001001009799.4%
Hermes 3 70B1001001001009699.1%
Inception Mercury 21001001001008997.7%
Qwen 3.5 9B1001001001007394.7%
Nemotron 3 Nano100100100947193.0%
Qwen 3.5 27B100100100100080.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)1001001001009999.7%
Llama 3.1 Nemotron 70B1001001001009899.6%
Llama 3.1 70B1001001001009699.1%
Llama 3.1 8B1001001001009398.7%
Stealth: Aurora Alpha1001001001009098.1%
Inception Mercury 21001001001008697.1%
Inception Mercury100100100977995.1%
Qwen 3.5 35B1001001001006993.8%
Nemotron 3 Nano100100100807891.7%
Mistral Small 3.2 24B100100100383173.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 41001001001009999.8%
Nemotron 3 Super1001001001009799.3%
Qwen 3.5 9B1001001001009599.1%
GPT-4o Mini (temp=0)1001001001009599.0%
Arcee AI: Trinity Mini1001001001009198.3%
GPT-4o, Aug. 6th (temp=0)1001001001009198.2%
Llama 3.1 Nemotron 70B100100100998897.4%
Qwen 3.5 27B1001001001008697.2%
Qwen 2.5 72B1001001001008496.8%
Stealth: Aurora Alpha10010097948895.8%
Inception Mercury 210010094949195.7%
Grok 4.3 (Reasoning)100100100908394.6%
ByteDance Seed 2.0 Mini1001001001007094.1%
Hermes 3 405B1001001001006793.5%
Mistral NeMO100100100947193.1%
Inception Mercury100100100897592.8%
Llama 3.1 8B1001001001005891.6%
Llama 3.1 70B100100100876891.1%
Qwen 3.6 27B100100100926190.5%
Mistral Small 3.2 24B100100100672979.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Nemotron 3 Nano1001001001009398.7%
Claude Sonnet 4100100100989698.6%
Mistral NeMO1001001001009398.6%
Hermes 3 405B1001001001008797.4%
Claude 3.5 Sonnet1001001001008797.3%
GPT-4.1 Nano1001001001007995.7%
GPT-4o, Aug. 6th (temp=0)10010096928794.8%
Inception Mercury 210010097968094.6%
Qwen 2.5 72B100100100868393.8%
Inception Mercury1001001001005691.2%
Llama 3.1 Nemotron 70B100100100757489.8%
GPT-4o, Aug. 6th (temp=1)1009889885986.7%
Stealth: Aurora Alpha1009687816786.1%
Llama 3.1 70B1009479716481.7%
WizardLM 2 8x22b908279774474.3%
Mistral Small 3.2 24B966649482857.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Arcee AI: Trinity Large (Preview)10010010010010099.9%
Inception Mercury 21001001001009999.7%
Qwen 2.5 72B1001001001009198.3%
Hermes 3 70B1001001001009198.2%
Claude 3.5 Sonnet1001001001009198.2%
GPT-4o Mini (temp=0)10010099969598.0%
DeepSeek V3 (2024-12-26)1001001001009097.9%
Hermes 3 405B10010096968695.6%
GPT-4o, May 13th (temp=0)10010099928495.1%
Llama 3.1 8B1001001001007394.6%
Llama 3.1 70B1001001001006993.8%
Llama 3.1 Nemotron 70B1001001001006893.6%
Mistral Small 3.2 24B100100100927092.3%
Cohere Command R+ (Aug. 2024)