Structural similarity to original

Test: Text Replacement

Avg. Score
98.3%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash Lite100.0%$0.00032.0s100%
2Ministral 8B100.0%$0.00013.6s100%
3Ministral 3 8B100.0%$0.00023.6s100%
4Mistral Small Creative100.0%$0.00033.6s100%
5Mistral Small 4100.0%$0.00053.7s100%
6Gemini 3.1 Flash Lite (Preview)100.0%$0.00112.0s100%
7GPT-4.1 Nano100.0%$0.00034.4s100%
8Ministral 3 14B100.0%$0.00034.8s100%
9Gemini 3.1 Flash Lite100.0%$0.00112.6s100%
10GPT-5.4 Nano100.0%$0.00093.5s100%
11Mistral Small 3.2 24B100.0%$0.00036.2s100%
12Gemini 2.5 Flash100.0%$0.00182.5s100%
13Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00115.0s100%
14Gemma 3 4B100.0%$0.00018.0s100%
15GPT-5.4 Nano (Reasoning, Low)100.0%$0.00115.4s100%
16Mistral Medium 3.1100.0%$0.00155.6s100%
17Gemini 3 Flash (Preview)100.0%$0.00223.9s100%
18DeepSeek V4 Flash100.0%$0.00029.8s100%
19Gemma 3 12B100.0%$0.000110.4s100%
20Grok 4 Fast100.0%$0.00118.7s100%
21GPT-4.1 Mini100.0%$0.00138.3s100%
22GPT-4o Mini (temp=1)100.0%$0.000510.7s100%
23Grok 4.20100.0%$0.00245.3s100%
24GPT-5.4 Mini100.0%$0.00332.7s100%
25Grok 4.3100.0%$0.00255.4s100%
26Mistral Large 3100.0%$0.00138.8s100%
27GPT-4o Mini (temp=0)100.0%$0.000511.6s100%
28Grok 4.20 (Beta)100.0%$0.00402.1s100%
29Qwen 3.5 Plus (2026-02-15)100.0%$0.00188.7s100%
30Qwen 2.5 72B100.0%$0.000313.2s100%
31Claude Haiku 4.5100.0%$0.00424.2s100%
32Cydonia 24B V4.1100.0%$0.000515.2s100%
33GPT-5.4 Mini (Reasoning, Low)100.0%$0.00405.6s100%
34Stealth: Healer Alpha100.0%$0.000017.6s100%
35Qwen3 235B A22B Instruct 2507100.0%$0.000417.0s100%
36GPT-5.4 Nano (Reasoning)100.0%$0.002213.3s100%
37Gemma 3 27B100.0%$0.000219.8s100%
38Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00673.0s100%
39Grok 4.1 Fast100.0%$0.001418.6s100%
40Writer: Palmyra X5100.0%$0.004211.0s100%
41Mistral Large100.0%$0.00528.7s100%
42GPT-4.1100.0%$0.00645.3s100%
43Mistral Large 2100.0%$0.00528.8s100%
44Gemma 4 26B100.0%$0.000323.3s100%
45DeepSeek V4 Pro100.0%$0.001619.8s100%
46Xiaomi MIMO v2.5100.0%$0.003915.2s100%
47Arcee AI: Trinity Large (Preview)100.0%$0.000027.1s100%
48Gemini 2.5 Flash Lite (Reasoning)100.0%$0.002221.7s100%
49Hermes 3 405B100.0%$0.001427.5s100%
50Gemma 4 31B100.0%$0.000433.1s100%
51GPT-5.4100.0%$0.0117.2s100%
52GPT-4o, May 13th (temp=0)100.0%$0.0133.6s100%
53GPT-4o, May 13th (temp=1)100.0%$0.0134.0s100%
54Claude Sonnet 4.6100.0%$0.0136.0s100%
55Claude Sonnet 4.5100.0%$0.0135.9s100%
56Gemini 2.5 Flash (Reasoning)100.0%$0.009615.1s100%
57Claude 3.7 Sonnet100.0%$0.0137.2s100%
58Claude Sonnet 4100.0%$0.0137.4s100%
59Xiaomi MIMO v2.5 Pro100.0%$0.006127.2s100%
60DeepSeek V3.2100.0%$0.000549.1s100%
61GPT-5.4 (Reasoning, Low)100.0%$0.0159.8s100%
62GPT-5 Mini100.0%$0.006439.1s100%
63GPT-5.2100.0%$0.01613.8s100%
64MiniMax M2.5100.0%$0.001957.8s100%
65Z.AI GLM 4.6100.0%$0.005946.8s100%
66Z.AI GLM 4.5 Air100.0%$0.003556.7s100%
67Claude Opus 4.5100.0%$0.0216.7s100%
68Claude Opus 4.6100.0%$0.0217.1s100%
69GPT-5.5100.0%$0.0225.8s100%
70Z.AI GLM 5 Turbo100.0%$0.01432.0s100%
71GPT-5.1100.0%$0.01921.1s100%
72DeepSeek V3 (2024-12-26)98.8%$0.001019.4s88%
73ByteDance Seed 1.6100.0%$0.00571.0m100%
74GPT-5.5 (Reasoning, Low)100.0%$0.0258.6s100%
75Gemini 3 Flash (Preview, Reasoning)100.0%$0.01830.3s100%
76Stealth: Hunter Alpha98.8%$0.000027.6s88%
77Claude 3.5 Sonnet100.0%$0.02511.8s100%
78Grok 4.20 (Reasoning)100.0%$0.01347.9s100%
79Ministral 3B97.6%$0.00012.3s83%
80Claude Opus 4.7100.0%$0.0305.7s100%
81Mistral Small 4 (Reasoning)98.8%$0.002828.3s88%
82Qwen 3 32B98.8%$0.001040.2s88%
83DeepSeek V3 (2025-03-24)98.8%$0.000741.6s88%
84Grok 4.20 (Beta, Reasoning)100.0%$0.02919.0s100%
85ByteDance Seed 1.6 Flash97.6%$0.000916.8s83%
86Z.AI GLM 4.598.8%$0.004536.8s88%
87DeepSeek-V2 Chat97.6%$0.000918.8s83%
88Llama 3.1 70B97.6%$0.000624.7s83%
89Ministral 3 3B96.4%$0.00012.4s79%
90GPT-5.4 Mini (Reasoning)98.8%$0.01323.6s88%
91Qwen 3.6 Flash98.8%$0.01133.6s88%
92GPT-OSS 120B97.6%$0.000735.0s83%
93Grok 4100.0%$0.03036.9s100%
94GPT-5.4 (Reasoning)100.0%$0.03329.2s100%
95Claude Opus 4.7 (Reasoning)100.0%$0.0448.9s100%
96GPT-5.5 (Reasoning)100.0%$0.04415.9s100%
97Qwen 3.5 Plus (2026-04-20)100.0%$0.0161.6m100%
98ByteDance Seed 2.0 Lite98.8%$0.00681.3m88%
99GPT-5100.0%$0.03551.8s100%
100Qwen 3.6 35B97.6%$0.008244.9s83%
101Inception Mercury 294.0%$0.00182.5s74%
102Llama 3.1 Nemotron 70B95.0%$0.001718.0s76%
103Llama 3.1 8B94.0%$0.000110.7s74%
104Gemini 2.5 Pro100.0%$0.04634.9s100%
105DeepSeek V3.195.2%$0.000835.6s77%
106o4 Mini97.6%$0.01930.0s83%
107Qwen 3.5 35B98.8%$0.0191.0m88%
108Gemini 3 Pro (Preview)100.0%$0.04933.0s100%
109Grok 4.3 (Reasoning)98.8%$0.0151.2m88%
110Cohere Command R+ (Aug. 2024)95.4%$0.007925.9s78%
111Mistral NeMO93.2%$0.00023.1s70%
112Z.AI GLM 5100.0%$0.0162.2m100%
113Z.AI GLM 4.7100.0%$0.0112.5m100%
114Qwen 3.5 27B100.0%$0.0251.9m100%
115Gemma 4 26B (Reasoning)100.0%$0.00303.0m100%
116GPT-4o, Aug. 6th (temp=0)96.4%$0.00783.2s63%
117Claude Opus 4100.0%$0.06310.6s100%
118Z.AI GLM 5.1100.0%$0.0262.1m100%
119Inception Mercury93.3%$0.00053.7s63%
120DeepSeek V4 Flash (Reasoning)95.2%$0.00091.3m77%
121Qwen 3.5 397B A17B100.0%$0.0103.1m100%
122LFM2 24B90.9%$0.000113.0s68%
123GPT-5 Nano95.2%$0.00351.3m77%
124Qwen 3.5 122B98.8%$0.0301.4m88%
125Qwen 3.6 27B97.6%$0.0201.4m83%
126MiniMax M2.797.6%$0.00841.9m83%
127o4 Mini High98.8%$0.03958.8s88%
128Qwen3.7 Max100.0%$0.0491.4m100%
129DeepSeek V4 Pro (Reasoning)100.0%$0.0123.2m100%
130Z.AI GLM 4.7 Flash96.4%$0.00242.0m79%
131Skyfall 36B V291.1%$0.000811.1s62%
132Qwen 3.5 Flash96.4%$0.00391.1m63%
133Claude Sonnet 4.6 (Reasoning)100.0%$0.07150.0s100%
134Claude Opus 4.6 (Reasoning)100.0%$0.07637.6s100%
135Gemma 4 31B (Reasoning)100.0%$0.00184.2m100%
136Aion 2.095.2%$0.00561.2m61%
137GPT-4o, Aug. 6th (temp=1)92.9%$0.00753.4s48%
138Nemotron 3 Super90.5%$0.00001.6m70%
139WizardLM 2 8x22b93.1%$0.00101.1m50%
140Qwen3.6 Max Preview100.0%$0.0492.9m100%
141ByteDance Seed 2.0 Mini96.4%$0.00303.3m79%
142Gemini 3.5 Flash (Reasoning)96.4%$0.04921.0s63%
143Arcee AI: Trinity Mini85.7%$0.000623.0s55%
144MoonshotAI: Kimi K2.598.8%$0.0204.2m88%
145Gemini 3.1 Pro (Preview)100.0%$0.0991.5m100%
146Qwen 3.5 9B91.7%$0.00172.4m58%
147Claude 3 Haiku81.0%$0.00115.5s42%
148Nemotron 3 Nano87.0%$0.00222.5m64%
149MoonshotAI: Kimi K2.698.8%$0.0444.8m88%
150Rocinante 12B72.8%$0.00048.8s34%
151Hermes 3 70B78.6%$0.00141.8m18%
98.28%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Qwen 3.5 122B1001001001001001006795.2%
Z.AI GLM 4.7 Flash1001001001001001006795.2%
DeepSeek V3.11001001001001001006795.2%
Skyfall 36B V21001001001001001006795.2%
Ministral 3B1001001001001001006795.2%
ByteDance Seed 2.0 Mini1001001001001001006795.2%
ByteDance Seed 1.6 Flash1001001001001001006795.2%
Llama 3.1 Nemotron 70B100100100100100676189.7%
LFM2 24B1001001006767676781.0%
Claude 3 Haiku100100676767673671.7%
Arcee AI: Trinity Mini6767676767676766.7%
Rocinante 12B676767676751055.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)10010010010097949297.5%
Stealth: Hunter Alpha1001001001001001006795.2%
DeepSeek V3.11001001001001001006795.2%
Inception Mercury1001001001001001006795.2%
Grok 4.3 (Reasoning)1001001001001001006795.2%
o4 Mini High1001001001001001006795.2%
Qwen 3.5 35B1001001001001001006795.2%
Qwen 3.6 27B100100100100100676790.5%
Qwen 3.6 35B100100100100100676790.5%
o4 Mini100100100100100676790.5%
Qwen 3.5 9B100100100100100676790.5%
Llama 3.1 8B100100100100100676790.5%
GPT-4o, Aug. 6th (temp=1)100100100100100100085.7%
GPT-5 Nano10010010010067676785.7%
GPT-4o, Aug. 6th (temp=0)100100100100100100085.7%
Skyfall 36B V210010010010067671978.9%
Rocinante 12B1001001008564544578.3%
Mistral NeMO1001001006767585277.6%
Nemotron 3 Super100100676767676776.2%
Nemotron 3 Nano100100676767674673.2%
WizardLM 2 8x22b1001001001001007072.5%
Claude 3 Haiku6767676767676766.7%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)1001001001001001006795.2%
Z.AI GLM 4.51001001001001001006795.2%
DeepSeek V3 (2024-12-26)1001001001001001006795.2%
Qwen 3.6 Flash1001001001001001006795.2%
Aion 2.01001001001001001006795.2%
DeepSeek V3.11001001001001001006795.2%
ByteDance Seed 1.6 Flash1001001001001001006795.2%
Mistral NeMO1001001001001001006795.2%
Llama 3.1 8B1001001001001001006795.2%
Llama 3.1 Nemotron 70B100100100100100676790.5%
ByteDance Seed 2.0 Mini100100100100100676790.5%
Llama 3.1 70B100100100100100676790.5%
Skyfall 36B V2100100100100100676790.5%
Gemini 3.5 Flash (Reasoning)100100100100100100085.7%
Nemotron 3 Nano10010010010067676785.7%
Hermes 3 70B100100100100100100085.7%
Claude 3 Haiku10010010010067676785.7%
Cohere Command R+ (Aug. 2024)1001001009767646184.1%
LFM2 24B10010010010065654982.8%
DeepSeek V4 Flash (Reasoning)1001001006767676781.0%
Arcee AI: Trinity Mini1001001001006767076.2%
Rocinante 12B9567676765393862.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Skyfall 36B V2100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5 Nano1001001001001001006795.2%
Mistral Small 4 (Reasoning)1001001001001001006795.2%
DeepSeek V3.11001001001001001006795.2%
Ministral 3B1001001001001001006795.2%
MoonshotAI: Kimi K2.61001001001001001006795.2%
MoonshotAI: Kimi K2.51001001001001001006795.2%
ByteDance Seed 2.0 Lite1001001001001001006795.2%
Qwen 3 32B1001001001001001006795.2%
DeepSeek V3 (2025-03-24)1001001001001001006795.2%
Rocinante 12B1001001001001001006795.2%
MiniMax M2.7100100100100100676790.5%
GPT-OSS 120B100100100100100676790.5%
DeepSeek-V2 Chat100100100100100676790.5%
Z.AI GLM 4.7 Flash100100100100100676790.5%
Llama 3.1 8B100100100100100676790.5%
Nemotron 3 Nano100100100100100675789.1%
Aion 2.0100100100100100100085.7%
Qwen 3.5 Flash100100100100100100085.7%
Nemotron 3 Super10010010010067676785.7%
GPT-4o, Aug. 6th (temp=1)100100100100100100085.7%
Ministral 3 3B10010010010067676785.7%
Inception Mercury10010010010067671378.1%
Qwen 3.5 9B1001001001006767076.2%
Inception Mercury 2100100676767676776.2%
Hermes 3 70B1001000000028.6%