Possessive traps preserved

Test: Text Replacement

Avg. Score
97.4%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash Lite100.0%$0.00021.3s100%
2Mistral NeMO100.0%$0.00012.2s100%
3Ministral 8B100.0%$0.00012.4s100%
4Mistral Small Creative100.0%$0.00022.3s100%
5Ministral 3 8B100.0%$0.00013.1s100%
6GPT-4.1 Nano100.0%$0.00023.0s100%
7Mistral Small 4100.0%$0.00032.5s100%
8Ministral 3 14B100.0%$0.00023.2s100%
9Mistral Small 3.2 24B100.0%$0.00023.6s100%
10Gemini 3.1 Flash Lite (Preview)100.0%$0.00071.6s100%
11Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00071.6s100%
12GPT-5.4 Nano (Reasoning, Low)100.0%$0.00062.4s100%
13Gemini 3.1 Flash Lite100.0%$0.00071.7s100%
14GPT-5.4 Nano100.0%$0.00062.7s100%
15Gemma 3 4B100.0%$0.00015.4s100%
16GPT-5.4 Nano (Reasoning)100.0%$0.00073.1s100%
17DeepSeek V4 Flash100.0%$0.00016.3s100%
18Llama 3.1 8B100.0%$0.00006.9s100%
19Gemma 3 12B100.0%$0.00016.6s100%
20Gemini 2.5 Flash100.0%$0.00121.7s100%
21GPT-4o Mini (temp=1)100.0%$0.00037.3s100%
22GPT-4.1 Mini100.0%$0.00084.9s100%
23Qwen 2.5 72B100.0%$0.00028.1s100%
24GPT-4o Mini (temp=0)100.0%$0.00037.7s100%
25Grok 4 Fast100.0%$0.00076.0s100%
26Inception Mercury100.0%$0.00038.1s100%
27Mistral Medium 3.1100.0%$0.00104.9s100%
28Skyfall 36B V2100.0%$0.00057.2s100%
29Gemini 3 Flash (Preview)100.0%$0.00152.7s100%
30Arcee AI: Trinity Mini100.0%$0.00028.7s100%
31Mistral Large 3100.0%$0.00095.8s100%
32Stealth: Healer Alpha100.0%$0.000010.6s100%
33Grok 4.20100.0%$0.00163.4s100%
34Qwen 3.5 Plus (2026-02-15)100.0%$0.00125.4s100%
35GPT-5.4 Mini100.0%$0.00211.9s100%
36Qwen3 235B A22B Instruct 2507100.0%$0.000310.9s100%
37Gemini 2.5 Flash Lite (Reasoning)100.0%$0.00117.2s100%
38Llama 3.1 70B100.0%$0.000412.3s100%
39Gemma 3 27B100.0%$0.000114.0s100%
40ByteDance Seed 1.6 Flash100.0%$0.000612.0s100%
41Stealth: Hunter Alpha100.0%$0.000015.2s100%
42Claude Haiku 4.5100.0%$0.00272.2s100%
43Llama 3.1 Nemotron 70B100.0%$0.001110.9s100%
44DeepSeek-V2 Chat100.0%$0.000613.3s100%
45DeepSeek V3 (2024-12-26)100.0%$0.000613.5s100%
46Arcee AI: Trinity Large (Preview)100.0%$0.000019.6s100%
47Gemma 4 26B100.0%$0.000219.0s100%
48Hermes 3 405B100.0%$0.000916.2s100%
49DeepSeek V3 (2025-03-24)100.0%$0.000518.8s100%
50Mistral Large 2100.0%$0.00345.8s100%
51Mistral Large100.0%$0.00346.0s100%
52GPT-4.1100.0%$0.00413.4s100%
53Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00442.2s100%
54GPT-5.4 Mini (Reasoning)100.0%$0.00375.7s100%
55Gemma 4 31B100.0%$0.000222.7s100%
56DeepSeek V4 Flash (Reasoning)100.0%$0.000523.7s100%
57GPT-4o, Aug. 6th (temp=1)100.0%$0.00512.2s100%
58GPT-4o, Aug. 6th (temp=0)100.0%$0.00512.3s100%
59Writer: Palmyra X5100.0%$0.002714.3s100%
60Z.AI GLM 4.5 Air100.0%$0.001223.3s100%
61DeepSeek V4 Pro100.0%$0.001423.8s100%
62Xiaomi MIMO v2.5100.0%$0.004015.1s100%
63WizardLM 2 8x22b100.0%$0.000633.4s100%
64Gemini 2.5 Flash (Reasoning)100.0%$0.004515.2s100%
65GPT-5.4100.0%$0.00714.5s100%
66DeepSeek V3.2100.0%$0.000337.5s100%
67Nemotron 3 Super100.0%$0.000039.1s100%
68GPT-5.4 (Reasoning, Low)100.0%$0.00744.6s100%
69GPT-4o, May 13th (temp=1)100.0%$0.00832.3s100%
70GPT-5.2100.0%$0.00765.6s100%
71Claude Sonnet 4.6100.0%$0.00813.6s100%
72GPT-4o, May 13th (temp=0)100.0%$0.00833.0s100%
73Claude Sonnet 4.5100.0%$0.00814.0s100%
74Claude 3.7 Sonnet100.0%$0.00814.6s100%
75Xiaomi MIMO v2.5 Pro100.0%$0.004721.0s100%
76Grok 4.20 (Reasoning)100.0%$0.005218.8s100%
77Claude Sonnet 4100.0%$0.00814.9s100%
78GPT-OSS 120B100.0%$0.001043.4s100%
79GPT-5 Mini100.0%$0.004230.5s100%
80Qwen 3.6 35B100.0%$0.005028.5s100%
81MiniMax M2.5100.0%$0.001545.7s100%
82Qwen 3.6 Flash100.0%$0.007221.7s100%
83Z.AI GLM 4.6100.0%$0.003838.3s100%
84ByteDance Seed 1.6100.0%$0.003839.9s100%
85ByteDance Seed 2.0 Lite100.0%$0.003843.6s100%
86Gemini 3 Flash (Preview, Reasoning)100.0%$0.009615.6s100%
87DeepSeek V4 Pro (Reasoning)100.0%$0.002052.4s100%
88Z.AI GLM 5 Turbo100.0%$0.008820.0s100%
89Aion 2.0100.0%$0.003645.3s100%
90GPT-5 Nano100.0%$0.002454.7s100%
91Grok 4.20 (Beta, Reasoning)100.0%$0.0127.3s100%
92GPT-5.1100.0%$0.01211.2s100%
93o4 Mini100.0%$0.01117.5s100%
94Claude Opus 4.5100.0%$0.0144.7s100%
95Claude Opus 4.6100.0%$0.0144.9s100%
96GPT-5.5100.0%$0.0144.0s100%
97Inception Mercury 296.4%$0.00152.1s74%
98Cydonia 24B V4.196.4%$0.00039.0s74%
99Z.AI GLM 4.7 Flash100.0%$0.00141.1m100%
100MiniMax M2.7100.0%$0.004556.4s100%
101GPT-5.4 (Reasoning)100.0%$0.01413.0s100%
102GPT-5.4 Mini (Reasoning, Low)96.4%$0.00263.4s74%
103Claude Opus 4.7 (Reasoning)100.0%$0.0163.3s100%
104Claude Opus 4.7100.0%$0.0163.5s100%
105ByteDance Seed 2.0 Mini100.0%$0.00141.3m100%
106Claude 3.5 Sonnet100.0%$0.0167.1s100%
107Mistral Small 4 (Reasoning)96.4%$0.001613.6s74%
108GPT-5.5 (Reasoning, Low)100.0%$0.0186.2s100%
109Z.AI GLM 5100.0%$0.007756.2s100%
110Z.AI GLM 4.7100.0%$0.00551.1m100%
111Qwen 3.5 27B100.0%$0.01146.8s100%
112o4 Mini High100.0%$0.01626.9s100%
113Z.AI GLM 5.1100.0%$0.01154.4s100%
114Gemma 4 26B (Reasoning)100.0%$0.00121.7m100%
115Gemma 4 31B (Reasoning)100.0%$0.00091.8m100%
116GPT-5.5 (Reasoning)100.0%$0.0228.3s100%
117Grok 4100.0%$0.01923.4s100%
118Claude Opus 4.6 (Reasoning)100.0%$0.0229.5s100%
119Qwen 3.5 Plus (2026-04-20)100.0%$0.0111.1m100%
120Grok 4.3 (Reasoning)96.4%$0.006728.8s74%
121Qwen 3.6 27B100.0%$0.01458.3s100%
122Qwen 3.5 122B100.0%$0.01841.8s100%
123Gemini 2.5 Pro100.0%$0.02517.5s100%
124Qwen 3.5 Flash96.4%$0.00301.0m74%
125Grok 4.392.9%$0.00163.5s48%
126Grok 4.20 (Beta)92.9%$0.00271.6s48%
127Grok 4.1 Fast92.9%$0.000711.8s48%
128MoonshotAI: Kimi K2.5100.0%$0.0111.6m100%
129Gemini 3.5 Flash (Reasoning)100.0%$0.02912.2s100%
130GPT-5100.0%$0.02541.3s100%
131Ministral 3B82.1%$0.00001.5s52%
132Gemini 3 Pro (Preview)100.0%$0.03120.3s100%
133Claude Sonnet 4.6 (Reasoning)100.0%$0.03120.5s100%
134Qwen 3.5 397B A17B100.0%$0.00952.4m100%
135Qwen3.7 Max100.0%$0.03051.4s100%
136Gemini 3.1 Pro (Preview)100.0%$0.03431.5s100%
137Claude Opus 4100.0%$0.0416.4s100%
138Qwen 3.5 35B92.9%$0.01251.3s65%
139MoonshotAI: Kimi K2.6100.0%$0.0172.3m100%
140Qwen 3 32B85.7%$0.000525.5s30%
141Ministral 3 3B75.0%$0.00011.5s38%
142Qwen3.6 Max Preview100.0%$0.0301.6m100%
143DeepSeek V3.182.1%$0.000533.2s28%
144Qwen 3.5 9B92.9%$0.00131.9m48%
145Rocinante 12B78.6%$0.00036.5s18%
146Claude 3 Haiku64.3%$0.00073.6s4%
147Cohere Command R+ (Aug. 2024)64.3%$0.005414.3s15%
148Z.AI GLM 4.557.1%$0.001920.2s1%
149LFM2 24B50.0%$0.00019.0s0%
150Nemotron 3 Nano67.9%$0.00172.2m19%
151Hermes 3 70B57.1%$0.00233.3m1%
97.40%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Skyfall 36B V2100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Rocinante 12B100100100100100100100100.0%
Grok 4.3 (Reasoning)1001001001001001005092.9%
Qwen 3.5 Flash1001001001001001005092.9%
Inception Mercury 21001001001001001005092.9%
Grok 4.1 Fast100100100100100100085.7%
Qwen 3.5 35B100100100100100505085.7%
Qwen 3.5 9B100100100100100100085.7%
Grok 4.20 (Beta)100100100100100100085.7%
Grok 4.3100100100100100100085.7%
Qwen 3 32B1001001001001000071.4%
Nemotron 3 Nano100100100100500064.3%
Ministral 3B100100505050505064.3%
Ministral 3 3B5050505050505050.0%
Claude 3 Haiku1001000000028.6%
Cohere Command R+ (Aug. 2024)5050505000028.6%
Z.AI GLM 4.510000000014.3%
LFM2 24B00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Skyfall 36B V2100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)1001001001001001005092.9%
Mistral Small 4 (Reasoning)1001001001001001005092.9%
Cydonia 24B V4.11001001001001001005092.9%
Nemotron 3 Nano1001001001005050071.4%
DeepSeek V3.1100100100100500064.3%
Rocinante 12B10010010010000057.1%
Hermes 3 70B10000000014.3%