Matches sentence count

Test: Write N of X

Avg. Score
87.8%
Scenarios
5

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3 Flash (Preview)100.0%$0.00193.2s100%
2GPT-5.4 Mini (Reasoning, Low)99.9%$0.00212.2s99%
3Mistral Large 3100.0%$0.00136.2s100%
4Mistral Small 3.2 24B100.0%$0.00038.7s100%
5GPT-5.4 Mini (Reasoning)100.0%$0.00334.9s100%
6Llama 3.1 Nemotron 70B100.0%$0.000712.6s100%
7Grok 4.20 (Beta)99.4%$0.00231.4s94%
8GPT-5 Mini100.0%$0.002210.8s100%
9Qwen 3.5 Plus (2026-02-15)100.0%$0.001414.5s100%
10Gemini 3.1 Flash Lite (Preview)98.6%$0.00082.0s89%
11Grok 4.20 (Beta, Reasoning)100.0%$0.00714.8s100%
12GPT-5.4100.0%$0.00657.1s100%
13GPT-5.4 (Reasoning, Low)100.0%$0.00736.7s100%
14Z.AI GLM 5 Turbo100.0%$0.005113.8s100%
15GPT-5.299.8%$0.00818.8s99%
16Gemini 3 Flash (Preview, Reasoning)100.0%$0.008013.6s100%
17Inception Mercury 297.0%$0.00091.3s81%
18GPT-5.4 Mini98.2%$0.00202.2s80%
19GPT-5100.0%$0.009516.2s100%
20o4 Mini High99.9%$0.008118.2s99%
21Z.AI GLM 4.7 Flash100.0%$0.000935.2s100%
22GPT-5.199.8%$0.009514.1s98%
23Stealth: Aurora Alpha97.4%5.1s84%
24GPT-5.4 (Reasoning)100.0%$0.01212.0s100%
25Inception Mercury96.8%$0.00041.6s73%
26MoonshotAI: Kimi K2.5100.0%$0.004531.5s100%
27Mistral Small Creative96.8%$0.00032.6s72%
28Qwen 3.5 Flash100.0%$0.002439.4s100%
29ByteDance Seed 1.6100.0%$0.003636.9s100%
30Mistral Medium 3.197.2%$0.00148.1s71%
31Nemotron 3 Super98.0%$0.000014.0s72%
32Gemini 2.5 Flash Lite (Reasoning)96.5%$0.000910.8s71%
33DeepSeek V3 (2025-03-24)98.0%$0.000615.7s72%
34GPT-5.4 Nano (Reasoning)95.8%$0.00103.9s65%
35o4 Mini98.4%$0.006314.9s80%
36Mistral Small 4 (Reasoning)96.1%$0.00109.0s68%
37Gemini 2.5 Pro99.4%$0.01613.2s94%
38Grok 4.1 Fast95.6%$0.00055.6s64%
39DeepSeek-V2 Chat97.9%$0.000320.5s72%
40Qwen 3.5 35B100.0%$0.01134.9s100%
41Claude Opus 4.699.9%$0.02112.9s99%
42Claude Sonnet 4.6 (Reasoning)100.0%$0.02015.2s100%
43Gemma 3 27B95.4%$0.000312.6s69%
44GPT-5.4 Nano (Reasoning, Low)93.2%$0.00072.5s65%
45Gemma 3 4B95.0%$0.00015.0s61%
46Qwen 3 32B95.8%$0.000516.3s69%
47Z.AI GLM 5100.0%$0.007448.6s100%
48GPT-5 Nano98.0%$0.001026.0s72%
49Ministral 3 14B92.1%$0.00043.7s57%
50Gemini 3 Pro (Preview)100.0%$0.02516.6s100%
51ByteDance Seed 2.0 Lite100.0%$0.005559.0s100%
52GPT-5.4 Nano93.5%$0.00073.3s53%
53Claude Opus 4.6 (Reasoning)100.0%$0.02914.3s100%
54ByteDance Seed 2.0 Mini100.0%$0.00141.3m100%
55Qwen 3.5 9B100.0%$0.00101.3m100%
56MiniMax M2.795.5%$0.002934.0s71%
57Nemotron 3 Nano92.8%$0.000421.8s58%
58Claude Opus 4.595.9%$0.0178.5s67%
59Z.AI GLM 4.697.7%$0.003047.4s72%
60Stealth: Healer Alpha90.5%$0.00007.8s45%
61Claude Sonnet 4.694.7%$0.0118.7s59%
62Grok 4 Fast89.9%$0.00054.2s40%
63Claude Sonnet 4.592.0%$0.0107.5s56%
64DeepSeek V3.292.1%$0.000521.2s49%
65Z.AI GLM 4.7100.0%$0.00401.5m100%
66Gemini 3.1 Pro (Preview)100.0%$0.03429.2s100%
67Mistral Small 485.8%$0.00054.2s39%
68Qwen 3.5 122B100.0%$0.02355.5s100%
69DeepSeek V3 (2024-12-26)88.6%$0.000912.2s41%
70Llama 3.1 8B85.6%$0.00041.7s33%
71Stealth: Hunter Alpha87.9%$0.000019.7s42%
72Gemini 2.5 Flash (Reasoning)85.2%$0.00377.0s42%
73GPT-4.186.0%$0.00415.8s32%
74Aion 2.089.1%$0.003127.9s39%
75Claude 3.5 Sonnet93.2%$0.01038.4s54%
76DeepSeek V3.183.3%$0.00096.9s27%
77ByteDance Seed 1.6 Flash82.3%$0.00058.3s26%
78GPT-4.1 Mini79.4%$0.00085.1s28%
79Llama 3.1 70B80.5%$0.00173.1s26%
80Grok 488.0%$0.01115.0s35%
81Gemma 3 12B80.1%$0.00029.4s23%
82Qwen3 235B A22B Instruct 250780.6%$0.000310.5s22%
83Claude Sonnet 484.2%$0.01210.0s31%
84Hermes 3 405B80.0%$0.000018.7s24%
85Writer: Palmyra X579.9%$0.00328.9s20%
86Ministral 3 8B74.1%$0.00033.0s19%
87Ministral 3 3B74.8%$0.00021.6s15%
88Gemini 2.5 Flash Lite73.3%$0.00031.7s18%
89GPT-4.1 Nano75.2%$0.00036.8s19%
90Claude 3.5 Haiku76.8%$0.00256.2s20%
91Qwen 3.5 397B A17B100.0%$0.0172.0m100%
92Z.AI GLM 4.574.8%$0.00128.5s16%
93GPT-4o, Aug. 6th (temp=1)74.2%$0.00603.2s18%
94GPT-4o, Aug. 6th (temp=0)74.0%$0.00583.1s18%
95Qwen 3.5 27B92.5%$0.0141.0m49%
96MiniMax M2.575.6%$0.001419.1s15%
97Claude Haiku 4.570.3%$0.00314.3s14%
98Gemini 2.5 Flash64.8%$0.00112.2s16%
99Qwen 2.5 72B66.7%$0.00076.3s11%
100Ministral 3B62.2%$0.00012.2s9%
101GPT-4o Mini (temp=0)79.5%$0.000453.1s20%
102GPT-4o Mini (temp=1)79.6%$0.000454.0s20%
103Ministral 8B60.4%$0.00022.8s9%
104LFM2 24B63.9%$0.00019.8s4%
105WizardLM 2 8x22b71.5%$0.001738.3s14%
106Claude 3.7 Sonnet66.3%$0.0108.4s12%
107GPT-4o, May 13th (temp=0)71.8%$0.01126.0s17%
108Arcee AI: Trinity Large (Preview)57.2%$0.00005.1s5%
109Hermes 3 70B58.1%$0.00076.7s6%
110GPT-4o, May 13th (temp=1)74.0%$0.01133.0s17%
111Cohere Command R+ (Aug. 2024)60.4%$0.00594.3s6%
112Mistral Large 258.8%$0.00454.2s4%
113Claude 3 Haiku61.9%$0.000836.4s9%
114Mistral NeMO47.6%$0.00033.6s3%
115Rocinante 12B53.1%$0.000624.0s5%
116Arcee AI: Trinity Mini44.0%$0.00024.6s0%
117Claude Opus 480.2%$0.05418.2s24%
118Mistral Large63.8%$0.02052.3s12%
87.81%

Individual Scenarios

sentences

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100090.0%
Rocinante 12B1001001001001009898920078.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
WizardLM 2 8x22b10010010010010010010010010010099.9%
Llama 3.1 8B1001001001001001001001001009899.8%
Ministral 3B1001001001001001001001001009899.8%
Ministral 8B100100100100100100100100989899.7%
GPT-5.2100100100100100100100100989899.7%
Rocinante 12B100100100100100100989877287.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)10010010010010010010010010010099.9%
GPT-4o, Aug. 6th (temp=0)10010010010010010010010010010099.9%
ByteDance Seed 1.6 Flash10010010010010010010010010010099.9%
Gemma 3 12B10010010010010010010010010010099.9%
Gemma 3 27B10010010010010010010010010010099.9%
Mistral Small 4 (Reasoning)1001001001001001001001001009899.8%
DeepSeek V3 (2025-03-24)1001001001001001001001001009899.8%
ByteDance Seed 2.0 Mini1001001001001001001001001009899.8%
o4 Mini1001001001001001001001001009899.8%
GPT-5.11001001001001001001001001009899.8%
Writer: Palmyra X51001001001001001001001001009899.8%
GPT-4o, May 13th (temp=1)1001001001001001001001001009899.8%
GPT-5.21001001001001001001001001009899.8%
LFM2 24B1001001001001001001001001009899.8%
Nemotron 3 Nano1001001001001001001001001009899.8%
Gemma 3 4B1001001001001001001001001009899.8%
Gemini 2.5 Flash (Reasoning)1001001001001001001001001009899.8%
Inception Mercury1001001001001001001001001009899.8%
Mistral Large 31001001001001001001001001009899.8%
Ministral 3 3B1001001001001001001001001009899.8%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100989899.7%
GPT-5.4 Nano (Reasoning)100100100100100100100100989899.7%
Ministral 3 14B100100100100100100100100989899.6%
GPT-5.4 Nano10010010010010010010098989899.5%
Mistral Medium 3.110010010010010010010098989899.5%
DeepSeek-V2 Chat1001001001001001009898989899.4%
Claude Sonnet 4.51001001001001001001001001009299.2%
Llama 3.1 70B1001001001001001001001001009299.2%
MiniMax M2.7100100100100100100100100989299.0%
Inception Mercury 2100100100100100100100100989299.0%
MiniMax M2.510010010010010010010098989298.9%
Stealth: Aurora Alpha1001001001001001009898989298.7%
Llama 3.1 8B100100100100100100100100929298.4%
Qwen 2.5 72B10098989898989892929296.7%
Mistral Small 410010010010098989898927796.3%
Qwen 3 32B100100100100100100100100985495.2%
Cohere Command R+ (Aug. 2024)10098989898989292777793.1%
Mistral Large 2100100989898929292777792.7%
Mistral Large10098989292929277777789.8%
DeepSeek V3 (2024-12-26)10010010010010010010010092289.4%
WizardLM 2 8x22b100100100989898989877987.8%
Ministral 3 8B10098989898989892542786.3%
Arcee AI: Trinity Large (Preview)10098929277777777777784.7%
Gemini 2.5 Flash Lite10010010010010098929254083.6%
Hermes 3 405B989892929292777777280.0%
Hermes 3 70B10010010010092927754542779.6%
Claude 3 Haiku10010010010077777754542776.7%
Claude 3.7 Sonnet9892929277775454542771.8%
Ministral 3B1001009898927777270067.1%
Rocinante 12B10010092927727200049.1%
Ministral 8B9277777754272790044.2%
Gemini 2.5 Flash989277542727922038.9%
Mistral NeMO777754545427999937.9%
Arcee AI: Trinity Mini100980000000019.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Gemma 3 4B10010010010010010010010010010099.9%
Nemotron 3 Super1001001001001001001001001009899.8%
o4 Mini High1001001001001001001001001009899.8%
Gemini 3.1 Flash Lite (Preview)1001001001001001001001001009899.8%
Claude Opus 4.6100100100100100100100100989899.7%
Grok 4 Fast100100100100100100100100989899.6%
GPT-5.21001001001001001009898989899.4%
GPT-5.11001001001001001001001001009299.2%
GPT-5.4 Nano (Reasoning)100100100100100100100100989299.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100989299.0%
Z.AI GLM 4.61001001001001001009898989298.7%
GPT-5.4 Nano1001001001001001009898989298.7%
GPT-4o Mini (temp=1)1001001001001001009898929298.1%
GPT-4o Mini (temp=0)100100989898989898929297.5%
Qwen 3 32B1001001001001001009898987797.2%
Claude Sonnet 4.5100100989898989892929296.9%
Inception Mercury10010010010098989898927796.3%
Inception Mercury 210010010010098989892927795.7%
Mistral Small 4 (Reasoning)100100100100100100100100985495.2%
Claude Opus 41001001001001001009892777794.5%
Stealth: Healer Alpha10010010010010010010098925494.4%
Stealth: Aurora Alpha10010010010010010010092925493.8%
GPT-5.4 Nano (Reasoning, Low)1001001009892929292777792.2%
GPT-4.1100100100100100100989898990.4%
Claude Sonnet 410010010010010098989898990.3%
Stealth: Hunter Alpha100100100100100100100100100290.1%
Llama 3.1 8B100100100100100989898772790.0%
DeepSeek V3.11001001001001001001009898089.7%
ByteDance Seed 1.6 Flash100100100100100100989892088.9%
Grok 4.1 Fast100100100100100100100100542788.0%
DeepSeek V3.21001001001001001001009877287.7%
Mistral Medium 3.110010010010010098989277086.6%
MiniMax M2.7100100100989892929277986.0%
Gemma 3 27B1001001001009898989254985.0%
Mistral Small Creative100100100989898929254984.2%
Claude 3.5 Haiku10010010010098929277542784.1%
DeepSeek V3 (2024-12-26)10010010010098989292272783.6%
Nemotron 3 Nano1001001001001001009877272783.0%
Grok 41001001001001001001001000080.0%
Writer: Palmyra X51001001001001001001001000080.0%
Claude 3.5 Sonnet1001001001001001001002727976.3%
Gemma 3 12B10010010010010010098542075.3%
Ministral 3 8B100100100100927777779974.2%
Llama 3.1 70B10010098929292775427273.5%
Gemini 2.5 Flash (Reasoning)10010098929292772727971.6%
GPT-4o, Aug. 6th (temp=1)10010098927777775427971.3%
GPT-4o, May 13th (temp=1)100100100989277545427270.4%
GPT-4o, Aug. 6th (temp=0)10098989892775427272770.0%
Hermes 3 405B10010010010010010010000070.0%
WizardLM 2 8x22b1001009898929277279269.7%
GPT-4.1 Mini10010092929277542727066.2%
Z.AI GLM 4.510010010010098777790066.2%
Ministral 3 3B100100100100100925422064.9%
Ministral 3 14B100929277775454549060.9%
GPT-4o, May 13th (temp=0)9892929254542727272759.2%
MiniMax M2.51001001001009892000059.1%
Qwen3 235B A22B Instruct 250710010010010010027200052.9%
Mistral Small 4100100989254542700052.5%
Claude Haiku 4.598927777772727279051.4%
Gemini 2.5 Flash9292545454272799041.8%
Rocinante 12B1001009254540000039.9%
Qwen 2.5 72B989854542727900036.8%
Ministral 3B100987727279000034.0%
GPT-4.1 Nano10077545499200030.4%
Claude 3 Haiku100100100000000030.0%
Mistral Large925454272727920029.2%
Ministral 8B10098542700000027.9%
Gemini 2.5 Flash Lite10010054200000025.5%
LFM2 24B100980000000019.8%
Hermes 3 70B10092000000011.1%
Cohere Command R+ (Aug. 2024)779220000009.0%
Mistral Large 292200000001.2%
Arcee AI: Trinity Large (Preview)92000000001.1%
Claude 3.7 Sonnet00000000000.0%
Mistral NeMO00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)1001001001001001001001001009899.8%
Mistral Small 3.2 24B1001001001001001001001001009899.8%
ByteDance Seed 2.0 Lite1001001001001001001001001009899.8%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001009899.8%
o4 Mini High100100100100100100100100989899.7%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100989899.6%
Gemini 2.5 Pro1001001001001001009898987797.2%
Grok 4.20 (Beta)100100100100100989898987797.1%
Stealth: Aurora Alpha100100100100100989892777794.4%
Gemini 3.1 Flash Lite (Preview)10010010010010010010077777793.2%
MiniMax M2.710010010010010010010092775492.3%
Gemma 3 27B10010010010010010010092775492.3%
o4 Mini100100100100100100100100922791.9%
GPT-5.4 Mini1001001009898989898922791.2%
Inception Mercury 21001001001001001009898545490.4%
DeepSeek-V2 Chat100100100100100100100100100090.0%
Grok 4.1 Fast100100100100100100100100100090.0%
Nemotron 3 Super100100100100100100100100100090.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Claude 3.5 Sonnet100100100100100100100100100089.9%
Z.AI GLM 4.610010010010010010010010098089.8%
Inception Mercury1001001001001001001009277987.8%
Qwen 3 32B10010010010010098989277086.7%
Mistral Small 4 (Reasoning)100100100100100100987777085.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001009892927777083.7%
Nemotron 3 Nano100100100100100100987727981.2%
GPT-5.4 Nano (Reasoning)1001001001001001001007727080.5%
Mistral Small 4100100100100100100987727080.3%
Claude Opus 4.5100100100989898925454079.4%
Gemma 3 4B10010010010010010092549075.5%
GPT-5.4 Nano (Reasoning, Low)100100100989877775427974.1%
Claude Sonnet 4.61001009898989292540073.3%
DeepSeek V3.2100100100100100100100270072.7%
DeepSeek V3 (2024-12-26)1001001001001001009800069.8%
GPT-5.4 Nano100100100100100989200069.1%
Claude Sonnet 4.598989898777754279063.8%
Qwen 3.5 27B1001001001001001002700062.7%
Grok 4100100100100100100000060.0%
Claude 3.7 Sonnet100100100100100100000059.9%
Stealth: Healer Alpha10010010010010077200057.9%
Gemini 2.5 Flash Lite10098989277545420057.5%
Gemini 2.5 Flash (Reasoning)10098927754545499054.7%
Hermes 3 405B1001001001001000000050.0%
Qwen3 235B A22B Instruct 25071001001001001000000050.0%
Grok 4 Fast1001001001001000000050.0%
Stealth: Hunter Alpha1009277777754990049.6%
Aion 2.0100100100100542000045.5%
GPT-4.1 Nano10010010077770000045.5%
Gemini 2.5 Flash1001009898279000043.3%
Llama 3.1 8B10010010010000000040.0%
Ministral 8B10010010010000000040.0%
GPT-4.1100100989800000039.7%
GPT-4.1 Mini1001005427270000030.8%
Claude Sonnet 410092772790000030.6%
Llama 3.1 70B100100100000000030.0%
DeepSeek V3.1989277200000027.0%
Gemma 3 12B989854000000025.0%
ByteDance Seed 1.6 Flash100989990000022.6%
MiniMax M2.51001000000000020.0%
Writer: Palmyra X5100980000000019.8%
Rocinante 12B9290000000010.1%
Ministral 3 8B10000000000010.0%
Ministral 3B10000000000010.0%
Ministral 3 3B922200000009.6%
Z.AI GLM 4.5772000000007.9%
Claude Opus 4549200000006.4%
Claude 3 Haiku270000000002.7%
Claude 3.5 Haiku20000000000.2%
GPT-4o Mini (temp=1)00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Claude Haiku 4.500000000000.0%
LFM2 24B00000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
WizardLM 2 8x22b00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Mistral NeMO00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Mistral Large00000000000.0%
Qwen 2.5 72B00000000000.0%
GPT-4o, Aug. 6th (temp=1)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Mistral Large 200000000000.0%
Arcee AI: Trinity Large (Preview)00000000000.0%
Hermes 3 70B00000000000.0%