No hallucinated or fabricated content

Test: Text Replacement

Avg. Score
96.7%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash Lite100.0%$0.00042.5s100%
2Ministral 8B100.0%$0.00023.9s100%
3Ministral 3 8B100.0%$0.00024.2s100%
4Mistral Small Creative100.0%$0.00034.1s100%
5GPT-4.1 Nano100.0%$0.00045.0s100%
6Gemini 3.1 Flash Lite (Preview)100.0%$0.00132.4s100%
7Gemini 3.1 Flash Lite100.0%$0.00132.4s100%
8Mistral Small 4100.0%$0.00064.5s100%
9Ministral 3 14B100.0%$0.00035.7s100%
10GPT-5.4 Nano100.0%$0.00113.8s100%
11Mistral Small 3.2 24B100.0%$0.00036.9s100%
12Gemini 2.5 Flash100.0%$0.00213.0s100%
13Gemma 3 4B100.0%$0.000110.1s100%
14GPT-5.4 Nano (Reasoning, Low)100.0%$0.00147.4s100%
15Mistral Medium 3.1100.0%$0.00186.3s100%
16DeepSeek V4 Flash100.0%$0.000310.6s100%
17Gemini 3 Flash (Preview)100.0%$0.00274.3s100%
18Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00138.2s100%
19Arcee AI: Trinity Mini100.0%$0.000311.2s100%
20Gemma 3 12B100.0%$0.000112.8s100%
21GPT-5.4 Mini100.0%$0.00403.2s100%
22GPT-4o Mini (temp=1)100.0%$0.000612.4s100%
23Grok 4.3100.0%$0.00306.0s100%
24Grok 4.20100.0%$0.00296.5s100%
25GPT-4.1 Mini100.0%$0.001510.3s100%
26Mistral Large 3100.0%$0.001610.5s100%
27Grok 4.20 (Beta)100.0%$0.00472.5s100%
28GPT-4o Mini (temp=0)100.0%$0.000613.8s100%
29Grok 4 Fast100.0%$0.001411.7s100%
30LFM2 24B100.0%$0.000115.5s100%
31Qwen 3.5 Plus (2026-02-15)100.0%$0.002210.2s100%
32Qwen 2.5 72B100.0%$0.000416.3s100%
33Claude Haiku 4.5100.0%$0.00515.8s100%
34GPT-5.4 Mini (Reasoning, Low)100.0%$0.00527.9s100%
35Stealth: Healer Alpha100.0%$0.000024.2s100%
36Gemma 3 27B100.0%$0.000324.0s100%
37Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00813.5s100%
38Xiaomi MIMO v2.5100.0%$0.004015.0s100%
39DeepSeek V3 (2024-12-26)100.0%$0.001223.1s100%
40GPT-4.1100.0%$0.00766.4s100%
41Grok 4.1 Fast100.0%$0.001722.5s100%
42GPT-4o, Aug. 6th (temp=1)100.0%$0.00864.1s100%
43Mistral Large100.0%$0.006210.4s100%
44Mistral Large 2100.0%$0.006210.5s100%
45DeepSeek V4 Pro100.0%$0.002022.3s100%
46Gemma 4 26B100.0%$0.000327.1s100%
47GPT-4o, Aug. 6th (temp=0)100.0%$0.00913.9s100%
48Stealth: Hunter Alpha100.0%$0.000029.8s100%
49Arcee AI: Trinity Large (Preview)100.0%$0.000031.5s100%
50GPT-5.4 Nano (Reasoning)100.0%$0.003422.3s100%
51Hermes 3 405B100.0%$0.001731.7s100%
52DeepSeek V3.1100.0%$0.000935.4s100%
53GPT-5.4100.0%$0.0138.3s100%
54Gemma 4 31B100.0%$0.000444.2s100%
55GPT-4o, May 13th (temp=1)100.0%$0.0154.7s100%
56Qwen 3 32B100.0%$0.001043.4s100%
57GPT-4o, May 13th (temp=0)100.0%$0.0154.6s100%
58Claude Sonnet 4.6100.0%$0.0157.5s100%
59Claude Sonnet 4.5100.0%$0.0157.2s100%
60GPT-OSS 120B100.0%$0.001046.5s100%
61Claude 3.7 Sonnet100.0%$0.0158.5s100%
62Claude Sonnet 4100.0%$0.0159.4s100%
63Xiaomi MIMO v2.5 Pro100.0%$0.007332.3s100%
64DeepSeek V3.2100.0%$0.000753.2s100%
65Gemini 2.5 Flash (Reasoning)100.0%$0.01422.4s100%
66GPT-5.4 (Reasoning, Low)100.0%$0.02014.1s100%
67Z.AI GLM 4.5100.0%$0.006352.4s100%
68GPT-5 Mini100.0%$0.008250.8s100%
69Claude Opus 4.5100.0%$0.0268.1s100%
70Claude Opus 4.6100.0%$0.0268.7s100%
71GPT-5.5100.0%$0.0266.8s100%
72Z.AI GLM 4.6100.0%$0.007857.8s100%
73MiniMax M2.5100.0%$0.00211.3m100%
74Qwen 3.6 Flash100.0%$0.01445.2s100%
75GPT-5.4 Mini (Reasoning)100.0%$0.01837.8s100%
76GPT-5.2100.0%$0.02522.2s100%
77Qwen 3.6 35B100.0%$0.0111.0m100%
78Claude 3.5 Sonnet100.0%$0.03013.8s100%
79Gemini 3 Flash (Preview, Reasoning)100.0%$0.02236.9s100%
80GPT-5.5 (Reasoning, Low)100.0%$0.03211.5s100%
81Claude Opus 4.7100.0%$0.0367.3s100%
82ByteDance Seed 1.6100.0%$0.00771.4m100%
83Z.AI GLM 4.5 Air100.0%$0.00591.5m100%
84o4 Mini100.0%$0.02639.2s100%
85Z.AI GLM 5 Turbo100.0%$0.02250.4s100%
86GPT-5.1100.0%$0.02934.1s100%
87Grok 4.20 (Reasoning)100.0%$0.0181.1m100%
88ByteDance Seed 2.0 Lite100.0%$0.00851.6m100%
89Grok 4.20 (Beta, Reasoning)100.0%$0.04026.6s100%
90Grok 4.3 (Reasoning)100.0%$0.0191.5m100%
91Nemotron 3 Super100.0%$0.00002.5m100%
92Grok 4100.0%$0.03947.6s100%
93Llama 3.1 Nemotron 70B96.4%$0.002120.8s74%
94Qwen 3.5 Plus (2026-04-20)100.0%$0.0202.1m100%
95Claude Opus 4.7 (Reasoning)100.0%$0.06413.9s100%
96Qwen 3.6 27B100.0%$0.0271.9m100%
97GPT-5.4 (Reasoning)100.0%$0.05247.2s100%
98Gemini 2.5 Pro100.0%$0.05641.9s100%
99Mistral Small 4 (Reasoning)96.4%$0.003634.3s74%
100GPT-5.5 (Reasoning)100.0%$0.06825.1s100%
101GPT-5100.0%$0.0501.3m100%
102Gemma 4 26B (Reasoning)100.0%$0.00423.4m100%
103Gemini 3 Pro (Preview)100.0%$0.06443.7s100%
104Claude Opus 4100.0%$0.07713.4s100%
105MiniMax M2.7100.0%$0.0143.1m100%
106Z.AI GLM 4.7100.0%$0.0173.0m100%
107Inception Mercury92.9%$0.00053.8s65%
108Qwen 3.5 122B100.0%$0.0411.9m100%
109o4 Mini High100.0%$0.0561.4m100%
110Gemini 3.5 Flash (Reasoning)100.0%$0.07733.0s100%
111Llama 3.1 70B92.9%$0.000718.9s65%
112Gemini 2.5 Flash Lite (Reasoning)95.2%$0.003332.9s66%
113DeepSeek V4 Pro (Reasoning)100.0%$0.0133.5m100%
114Writer: Palmyra X592.9%$0.005013.7s65%
115Qwen 3.5 397B A17B100.0%$0.0113.7m100%
116ByteDance Seed 2.0 Mini100.0%$0.00364.0m100%
117Z.AI GLM 5100.0%$0.0233.2m100%
118Qwen 3.5 27B100.0%$0.0372.9m100%
119Ministral 3 3B89.3%$0.00022.9s59%
120DeepSeek V4 Flash (Reasoning)96.4%$0.00132.3m74%
121Cydonia 24B V4.191.1%$0.000618.0s55%
122DeepSeek-V2 Chat90.5%$0.001121.5s53%
123Qwen3.7 Max100.0%$0.0722.1m100%
124Z.AI GLM 5.1100.0%$0.0443.4m100%
125Ministral 3B85.7%$0.00012.9s55%
126Qwen 3.5 9B96.4%$0.00202.9m74%
127ByteDance Seed 1.6 Flash89.9%$0.001221.9s50%
128Qwen3 235B A22B Instruct 250785.7%$0.000419.2s55%
129DeepSeek V3 (2025-03-24)88.1%$0.000842.6s54%
130Qwen 3.5 35B92.9%$0.0251.3m65%
131Z.AI GLM 4.7 Flash92.9%$0.00342.4m65%
132Gemma 4 31B (Reasoning)100.0%$0.00276.3m100%
133Qwen 3.5 Flash92.9%$0.00501.4m49%
134Aion 2.092.9%$0.00691.5m49%
135Claude Opus 4.6 (Reasoning)100.0%$0.1291.1m100%
136Qwen3.6 Max Preview100.0%$0.0664.0m100%
137Claude Sonnet 4.6 (Reasoning)100.0%$0.1231.5m100%
138Inception Mercury 281.5%$0.00273.7s41%
139Llama 3.1 8B77.4%$0.000115.3s47%
140GPT-5 Nano88.1%$0.00521.9m54%
141Mistral NeMO76.2%$0.00023.1s44%
142MoonshotAI: Kimi K2.5100.0%$0.0296.5m100%
143WizardLM 2 8x22b85.7%$0.00141.5m41%
144Gemini 3.1 Pro (Preview)100.0%$0.1682.6m100%
145MoonshotAI: Kimi K2.6100.0%$0.0727.6m100%
146Nemotron 3 Nano82.1%$0.00344.0m42%
147Claude 3 Haiku62.5%$0.00136.8s16%
148Skyfall 36B V256.1%$0.001013.0s19%
149Rocinante 12B39.0%$0.000510.0s8%
150Hermes 3 70B60.8%$0.00222.8m6%
151Cohere Command R+ (Aug. 2024)37.9%$0.009218.4s7%
96.68%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)1001001001001001005092.9%
Qwen 3.5 9B1001001001001001005092.9%
Z.AI GLM 4.7 Flash1001001001001001005092.9%
GPT-5 Nano1001001001001001005092.9%
Mistral Small 4 (Reasoning)1001001001001001005092.9%
DeepSeek V3 (2025-03-24)1001001001001001005092.9%
Llama 3.1 Nemotron 70B1001001001001001005092.9%
Hermes 3 70B1001001001001001005092.9%
Qwen 3.5 35B100100100100100505085.7%
Writer: Palmyra X5100100100100100505085.7%
Inception Mercury100100100100100505085.7%
Llama 3.1 70B100100100100100505085.7%
Nemotron 3 Nano100100100100100505085.7%
Inception Mercury 2100100100100100333381.0%
Ministral 3 3B10010010010050505078.6%
WizardLM 2 8x22b1001001001005050071.5%
Qwen3 235B A22B Instruct 25071001001005050505071.4%
Ministral 3B1001001005050505071.4%
Llama 3.1 8B100100505050505064.3%
Mistral NeMO10050505050333352.4%
Skyfall 36B V25050503325171033.6%
Claude 3 Haiku2525252525252525.0%
Cohere Command R+ (Aug. 2024)202020141311815.1%
Rocinante 12B2525201186614.4%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Z.AI GLM 4.7 Flash1001001001001001005092.9%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001003390.5%
Llama 3.1 8B1001001001001001003390.5%
Qwen 3.5 Flash100100100100100100185.8%
Aion 2.0100100100100100100085.8%
GPT-5 Nano100100100100100503383.3%
DeepSeek V3 (2025-03-24)100100100100100503383.3%
Inception Mercury 2100100100100100502582.1%
Cydonia 24B V4.1100100100100100502582.1%
DeepSeek-V2 Chat100100100100100333381.0%
ByteDance Seed 1.6 Flash100100100100100332579.8%
Skyfall 36B V210010010010050505078.6%
Nemotron 3 Nano100100100100100331778.6%
Rocinante 12B1001001005050331163.5%
Cohere Command R+ (Aug. 2024)1001001005025252560.7%
Hermes 3 70B1001000000028.7%