Forbidden words eliminated

Test: Text Replacement

Avg. Score
94.3%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash Lite100.0%$0.00031.5s100%
2Mistral Small 4100.0%$0.00042.9s100%
3Gemini 3.1 Flash Lite (Preview)100.0%$0.00091.7s100%
4Inception Mercury 2100.0%$0.00101.4s100%
5Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00091.8s100%
6Gemini 3.1 Flash Lite100.0%$0.00092.7s100%
7Gemma 3 4B100.0%$0.00015.9s100%
8Mistral Small 3.2 24B100.0%$0.00025.5s100%
9Gemini 2.5 Flash100.0%$0.00142.1s100%
10Grok 4 Fast100.0%$0.00075.7s100%
11Mistral Medium 3.1100.0%$0.00124.9s100%
12Gemini 3 Flash (Preview)100.0%$0.00183.5s100%
13GPT-4.1 Mini100.0%$0.00106.3s100%
14GPT-4o Mini (temp=1)100.0%$0.00049.0s100%
15GPT-4o Mini (temp=0)100.0%$0.00049.3s100%
16Grok 4.20100.0%$0.00194.2s100%
17Mistral Large 3100.0%$0.00107.2s100%
18Qwen 3.5 Plus (2026-02-15)100.0%$0.00157.1s100%
19Grok 4.20 (Beta)100.0%$0.00321.7s100%
20Claude Haiku 4.5100.0%$0.00332.7s100%
21Gemini 2.5 Flash Lite (Reasoning)100.0%$0.001210.6s100%
22Qwen3 235B A22B Instruct 2507100.0%$0.000314.7s100%
23Gemma 3 27B100.0%$0.000215.6s100%
24DeepSeek V3 (2024-12-26)100.0%$0.000715.6s100%
25DeepSeek-V2 Chat100.0%$0.000716.0s100%
26Grok 4.1 Fast100.0%$0.001114.8s100%
27Llama 3.1 Nemotron 70B100.0%$0.001415.1s100%
28Gemma 4 26B100.0%$0.000219.4s100%
29DeepSeek V4 Pro100.0%$0.001117.3s100%
30Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00542.6s100%
31Mistral Large100.0%$0.00427.0s100%
32Mistral Large 2100.0%$0.00427.1s100%
33GPT-4.1100.0%$0.00524.2s100%
34Gemma 4 31B100.0%$0.000322.0s100%
35Gemini 2.5 Flash (Reasoning)100.0%$0.00497.8s100%
36GPT-4o, Aug. 6th (temp=0)100.0%$0.00642.5s100%
37GPT-4o, Aug. 6th (temp=1)100.0%$0.00642.6s100%
38Z.AI GLM 4.5 Air100.0%$0.001121.6s100%
39Hermes 3 405B100.0%$0.001123.2s100%
40DeepSeek V4 Flash (Reasoning)100.0%$0.000525.5s100%
41Xiaomi MIMO v2.5100.0%$0.003815.4s100%
42Mistral Small 4 (Reasoning)100.0%$0.002122.4s100%
43Z.AI GLM 4.5100.0%$0.002721.2s100%
44Llama 3.1 70B100.0%$0.000530.4s100%
45GPT-5.2100.0%$0.00795.5s100%
46Z.AI GLM 5 Turbo100.0%$0.006213.5s100%
47Nemotron 3 Super100.0%$0.000036.7s100%
48DeepSeek V3.1100.0%$0.000735.8s100%
49Mistral Small Creative97.6%$0.00023.0s83%
50Xiaomi MIMO v2.5 Pro100.0%$0.004822.2s100%
51GPT-4o, May 13th (temp=0)100.0%$0.0102.6s100%
52WizardLM 2 8x22b100.0%$0.000736.8s100%
53Claude Sonnet 4.6100.0%$0.0104.4s100%
54GPT-4o, May 13th (temp=1)100.0%$0.0103.2s100%
55Claude Sonnet 4.5100.0%$0.0104.6s100%
56Claude Sonnet 4100.0%$0.0105.5s100%
57Claude 3.7 Sonnet100.0%$0.0105.9s100%
58DeepSeek V3 (2025-03-24)100.0%$0.000540.7s100%
59Gemma 3 12B97.6%$0.00018.1s83%
60GPT-5 Mini100.0%$0.004727.4s100%
61DeepSeek V3.2100.0%$0.000444.9s100%
62Stealth: Healer Alpha97.6%$0.000010.9s83%
63Qwen 2.5 72B97.6%$0.000310.1s83%
64Qwen 3.6 Flash100.0%$0.007021.9s100%
65Grok 4.397.6%$0.00204.8s83%
66Hermes 3 70B100.0%$0.000745.2s100%
67Qwen 3.6 35B100.0%$0.005429.2s100%
68ByteDance Seed 1.6 Flash97.6%$0.000711.7s83%
69Z.AI GLM 4.6100.0%$0.003935.8s100%
70ByteDance Seed 1.6100.0%$0.003640.0s100%
71GPT-5 Nano100.0%$0.001946.4s100%
72Qwen 3.5 Flash100.0%$0.002848.6s100%
73Grok 4.20 (Reasoning)100.0%$0.008230.3s100%
74GPT-5.4 (Reasoning)100.0%$0.01411.3s100%
75Nemotron 3 Nano100.0%$0.000956.9s100%
76Claude Opus 4.5100.0%$0.0175.4s100%
77Claude Opus 4.6100.0%$0.0175.4s100%
78o4 Mini100.0%$0.01320.7s100%
79Mistral NeMO92.9%$0.00023.0s73%
80GPT-5.4 Mini (Reasoning)97.6%$0.00769.3s83%
81GPT-5.5 (Reasoning, Low)100.0%$0.0195.6s100%
82Grok 4.20 (Beta, Reasoning)100.0%$0.01711.3s100%
83Gemini 3 Flash (Preview, Reasoning)100.0%$0.01423.7s100%
84Aion 2.0100.0%$0.004358.1s100%
85Z.AI GLM 5.1100.0%$0.008344.3s100%
86GPT-5.197.6%$0.00878.2s83%
87ByteDance Seed 2.0 Lite100.0%$0.005157.6s100%
88Claude Sonnet 4.6 (Reasoning)100.0%$0.01912.0s100%
89GPT-5.5 (Reasoning)100.0%$0.0216.8s100%
90MiniMax M2.597.6%$0.001639.6s83%
91Claude 3.5 Sonnet100.0%$0.0209.7s100%
92Gemini 3.5 Flash (Reasoning)100.0%$0.0219.0s100%
93Writer: Palmyra X592.9%$0.00348.3s73%
94Claude Opus 4.7 (Reasoning)100.0%$0.0233.8s100%
95Claude Opus 4.7100.0%$0.0234.2s100%
96Cydonia 24B V4.190.5%$0.000412.3s70%
97Qwen 3.5 35B100.0%$0.01340.1s100%
98GPT-OSS 120B92.9%$0.000423.5s73%
99Claude Opus 4.6 (Reasoning)100.0%$0.0239.1s100%
100GPT-5.4 Nano (Reasoning)88.1%$0.00104.3s68%
101Qwen 3 32B95.2%$0.001036.9s77%
102Cohere Command R+ (Aug. 2024)97.6%$0.006633.5s83%
103Z.AI GLM 4.7 Flash100.0%$0.00141.5m100%
104GPT-5.4 Mini (Reasoning, Low)88.1%$0.00293.3s68%
105Grok 4.3 (Reasoning)100.0%$0.01155.1s100%
106Qwen 3.6 27B100.0%$0.01351.7s100%
107Grok 4100.0%$0.02026.3s100%
108GPT-5100.0%$0.02027.9s100%
109Qwen 3.5 27B100.0%$0.01354.9s100%
110GPT-5.597.6%$0.0184.7s83%
111Z.AI GLM 5100.0%$0.00851.2m100%
112MiniMax M2.795.2%$0.003144.2s77%
113Qwen 3.5 Plus (2026-04-20)100.0%$0.0111.2m100%
114o4 Mini High100.0%$0.02234.0s100%
115GPT-5.4 (Reasoning, Low)90.5%$0.00925.5s70%
116Qwen 3.5 122B100.0%$0.01848.0s100%
117DeepSeek V4 Flash90.5%$0.00029.0s47%
118Qwen 3.5 9B100.0%$0.00142.0m100%
119GPT-5.485.7%$0.00896.0s67%
120Stealth: Hunter Alpha92.9%$0.000025.3s48%
121Qwen3.7 Max100.0%$0.02545.2s100%
122Gemma 4 31B (Reasoning)100.0%$0.00092.2m100%
123Gemini 3.1 Pro (Preview)100.0%$0.03028.5s100%
124Ministral 3 14B81.0%$0.00023.9s49%
125Z.AI GLM 4.7100.0%$0.00641.9m100%
126Skyfall 36B V285.7%$0.00079.2s45%
127Gemini 3 Pro (Preview)100.0%$0.03322.3s100%
128Inception Mercury81.0%$0.00043.7s45%
129GPT-5.4 Mini81.0%$0.00272.2s49%
130MoonshotAI: Kimi K2.5100.0%$0.0101.8m100%
131Gemini 2.5 Pro100.0%$0.03627.8s100%
132ByteDance Seed 2.0 Mini100.0%$0.00242.6m100%
133Gemma 4 26B (Reasoning)100.0%$0.00182.6m100%
134Claude 3 Haiku81.0%$0.00084.2s30%
135MoonshotAI: Kimi K2.6100.0%$0.0162.0m100%
136Claude Opus 4100.0%$0.0507.9s100%
137Qwen 3.5 397B A17B100.0%$0.00932.6m100%
138Ministral 8B76.2%$0.00013.2s23%
139Ministral 3 8B71.4%$0.00023.0s24%
140DeepSeek V4 Pro (Reasoning)100.0%$0.0103.0m100%
141Qwen3.6 Max Preview100.0%$0.0311.8m100%
142Rocinante 12B64.3%$0.00037.7s21%
143Arcee AI: Trinity Mini69.0%$0.001034.9s18%
144Llama 3.1 8B42.9%$0.00016.2s18%
145GPT-4.1 Nano45.2%$0.00033.8s11%
146GPT-5.4 Nano35.7%$0.00073.1s16%
147GPT-5.4 Nano (Reasoning, Low)38.1%$0.00083.3s7%
148LFM2 24B40.5%$0.000110.6s5%
149Arcee AI: Trinity Large (Preview)40.5%$0.000022.8s12%
150Ministral 3B16.7%$0.00001.8s11%
151Ministral 3 3B19.0%$0.00011.9s0%
94.29%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
GPT-5.11001001001001001006795.2%
GPT-5.4 Mini (Reasoning)1001001001001001006795.2%
MiniMax M2.51001001001001001006795.2%
Stealth: Healer Alpha1001001001001001006795.2%
Qwen 2.5 72B1001001001001001006795.2%
Mistral Small Creative1001001001001001006795.2%
Cohere Command R+ (Aug. 2024)1001001001001001006795.2%
MiniMax M2.7100100100100100676790.5%
GPT-5.4 Mini (Reasoning, Low)100100100100100676790.5%
GPT-5.4100100100100100676790.5%
Qwen 3 32B100100100100100676790.5%
GPT-5.4 Nano (Reasoning)100100100100100676790.5%
Arcee AI: Trinity Mini100100100100100676790.5%
Skyfall 36B V2100100100100100676790.5%
GPT-OSS 120B10010010010067676785.7%
Cydonia 24B V4.110010010010067676785.7%
DeepSeek V4 Flash100100100100100100085.7%
Inception Mercury10010010010067333376.2%
GPT-4.1 Nano100100100676767071.4%
Claude 3 Haiku100100100100670066.7%
Rocinante 12B100100676767333366.7%
Ministral 3 14B6767676767673361.9%
Llama 3.1 8B10067676767333361.9%
Ministral 8B10010010033330052.4%
Ministral 3 8B67676767330042.9%
GPT-5.4 Nano (Reasoning, Low)100100333300038.1%
GPT-5.4 Nano67673333330033.3%
Arcee AI: Trinity Large (Preview)10000000014.3%
Ministral 3B3333000009.5%
Ministral 3 3B330000004.8%
LFM2 24B00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
GPT-5.51001001001001001006795.2%
DeepSeek V4 Flash1001001001001001006795.2%
Grok 4.31001001001001001006795.2%
Gemma 3 12B1001001001001001006795.2%
Cydonia 24B V4.11001001001001001006795.2%
ByteDance Seed 1.6 Flash1001001001001001006795.2%
Claude 3 Haiku1001001001001001006795.2%
GPT-5.4 Mini (Reasoning, Low)10010010010067676785.7%
Writer: Palmyra X510010010010067676785.7%
Stealth: Hunter Alpha100100100100100100085.7%
GPT-5.4 Nano (Reasoning)10010010010067676785.7%
Inception Mercury100100100100100673385.7%
Mistral NeMO10010010010067676785.7%
GPT-5.4 (Reasoning, Low)1001001006767676781.0%
GPT-5.41001001006767676781.0%
Skyfall 36B V210010010010010067081.0%
LFM2 24B1001001006767676781.0%
Arcee AI: Trinity Large (Preview)10067676767673366.7%
GPT-5.4 Mini6767676767673361.9%
Rocinante 12B10010010067670061.9%
Arcee AI: Trinity Mini1001001003300047.6%
GPT-5.4 Nano (Reasoning, Low)10067673300038.1%
GPT-5.4 Nano67676733330038.1%
Ministral 3 3B67673333330033.3%
Ministral 3B33333333330023.8%
Llama 3.1 8B676733000023.8%
GPT-4.1 Nano673333000019.0%