Dialogue content preserved

Test: Text Replacement

Avg. Score
94.6%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Qwen 2.5 72B99.6%$0.000312.0s96%
2Gemini 3.1 Flash Lite (Preview)98.2%$0.00111.9s92%
3Grok 4.2098.8%$0.00225.0s93%
4Gemini 3 Flash (Preview)98.6%$0.00213.5s93%
5DeepSeek V4 Flash98.2%$0.00028.7s92%
6Gemini 3.1 Flash Lite97.5%$0.00112.5s91%
7Gemini 3.1 Flash Lite (Reasoning)97.5%$0.00114.2s91%
8Gemini 2.5 Flash Lite97.0%$0.00031.9s91%
9Xiaomi MIMO v2.598.9%$0.003614.2s94%
10Qwen 3.5 Plus (2026-02-15)97.9%$0.00177.8s92%
11Writer: Palmyra X598.6%$0.004011.7s93%
12Mistral Large 397.5%$0.00128.5s91%
13Gemma 4 31B99.5%$0.000435.2s95%
14Claude Haiku 4.597.5%$0.00403.8s91%
15Gemma 4 26B97.9%$0.000317.2s92%
16Gemini 3.5 Flash (Reasoning, Minimal)98.2%$0.00642.9s91%
17Cydonia 24B V4.197.5%$0.000513.7s91%
18Grok 4.20 (Beta)98.0%$0.00372.0s88%
19Gemma 3 4B96.8%$0.00016.9s88%
20Qwen3 235B A22B Instruct 250798.0%$0.000415.1s88%
21Mistral Large97.5%$0.00498.3s91%
22Mistral Large 297.5%$0.00498.4s91%
23Claude Opus 4.5100.0%$0.0206.0s100%
24Claude Opus 4.6100.0%$0.0206.6s100%
25Arcee AI: Trinity Large (Preview)97.9%$0.000026.5s92%
26Claude Sonnet 4.698.8%$0.0125.4s93%
27Claude Sonnet 4.598.7%$0.0125.4s93%
28DeepSeek V4 Pro97.5%$0.001521.7s91%
29Stealth: Healer Alpha97.5%$0.000018.7s87%
30GPT-5.4 Mini (Reasoning, Low)95.9%$0.00374.5s89%
31Xiaomi MIMO v2.5 Pro97.7%$0.004319.3s91%
32Grok 4 Fast96.4%$0.00108.5s86%
33Claude Sonnet 497.9%$0.0126.9s92%
34Mistral Small 495.9%$0.00053.9s82%
35Claude 3.7 Sonnet97.5%$0.0126.5s91%
36GPT-4o, May 13th (temp=1)97.1%$0.0124.0s91%
37GPT-4o, May 13th (temp=0)97.0%$0.0123.8s91%
38Gemini 2.5 Flash95.2%$0.00172.4s83%
39GPT-4.1 Mini94.8%$0.00128.2s86%
40Grok 4.395.4%$0.00235.3s83%
41GPT-5.4 Nano (Reasoning)94.8%$0.002011.3s86%
42GPT-5.4 Mini95.0%$0.00312.5s83%
43GPT-5.598.8%$0.0215.4s93%
44GPT-4.196.6%$0.00604.8s82%
45Mistral Small 3.2 24B97.3%$0.00025.5s73%
46Gemini 3 Flash (Preview, Reasoning)98.9%$0.01524.9s93%
47Stealth: Hunter Alpha96.1%$0.000019.9s81%
48Grok 4.1 Fast95.7%$0.001215.5s80%
49Grok 4.20 (Reasoning)98.6%$0.01136.3s91%
50Mistral Medium 3.193.8%$0.00146.3s80%
51Hermes 3 405B95.9%$0.001325.3s82%
52GPT-5.4 Mini (Reasoning)96.4%$0.01017.8s87%
53Claude Opus 4.798.9%$0.0285.2s94%
54GPT-5.495.9%$0.0106.5s83%
55Gemma 3 12B93.9%$0.000110.2s78%
56GPT-5.4 (Reasoning, Low)96.2%$0.0138.2s84%
57Ministral 3 14B93.2%$0.00034.5s77%
58Z.AI GLM 5 Turbo98.2%$0.01431.1s89%
59Z.AI GLM 4.5 Air96.4%$0.002851.0s90%
60DeepSeek V4 Flash (Reasoning)98.2%$0.000858.4s86%
61Claude 3.5 Sonnet97.5%$0.02410.6s91%
62GPT-5.4 Nano (Reasoning, Low)92.9%$0.00095.5s76%
63GPT-4o, Aug. 6th (temp=0)96.3%$0.00743.0s73%
64Llama 3.1 8B93.0%$0.000111.5s76%
65Grok 4.20 (Beta, Reasoning)97.9%$0.02213.6s89%
66GPT-5.4 Nano91.8%$0.00093.3s76%
67Ministral 3 3B92.0%$0.00012.3s74%
68Claude Opus 4.7 (Reasoning)99.8%$0.0387.8s97%
69Qwen 3.6 35B96.2%$0.007942.1s87%
70Mistral Small Creative91.6%$0.00023.3s74%
71GPT-5.5 (Reasoning, Low)97.9%$0.0268.9s88%
72DeepSeek V3.296.4%$0.000552.3s82%
73GPT-5 Mini96.8%$0.007446.2s86%
74Gemma 3 27B93.0%$0.000218.5s76%
75Z.AI GLM 4.696.1%$0.005950.0s86%
76GPT-5.197.7%$0.02122.6s86%
77Ministral 3B90.9%$0.00012.4s72%
78Llama 3.1 70B93.0%$0.000627.8s76%
79MiniMax M2.596.1%$0.00191.2m88%
80DeepSeek V3 (2024-12-26)94.6%$0.000917.7s67%
81ByteDance Seed 2.0 Lite97.3%$0.00581.1m85%
82Llama 3.1 Nemotron 70B91.4%$0.001616.4s75%
83LFM2 24B92.1%$0.000113.3s70%
84Z.AI GLM 4.593.0%$0.004436.1s80%
85GPT-5.293.4%$0.01310.9s78%
86Gemini 2.5 Flash (Reasoning)92.9%$0.008715.8s76%
87ByteDance Seed 1.695.5%$0.00571.0m84%
88Grok 499.1%$0.03341.7s94%
89Qwen 3.5 Flash96.3%$0.00371.1m79%
90Gemini 3.5 Flash (Reasoning)100.0%$0.04820.7s100%
91GPT-4o Mini (temp=1)89.3%$0.000510.5s70%
92Ministral 3 8B88.4%$0.00023.6s69%
93Ministral 8B87.9%$0.00013.7s71%
94GPT-4o, Aug. 6th (temp=1)93.7%$0.00733.1s63%
95Grok 4.3 (Reasoning)95.9%$0.01259.8s85%
96Gemini 2.5 Pro98.8%$0.03928.9s93%
97DeepSeek-V2 Chat93.4%$0.000919.1s61%
98GPT-4o Mini (temp=0)88.7%$0.000410.7s69%
99Qwen 3.6 Flash95.2%$0.01131.9s71%
100GPT-OSS 120B92.5%$0.000952.9s75%
101Arcee AI: Trinity Mini92.7%$0.00027.9s55%
102GPT-4.1 Nano88.0%$0.00034.1s66%
103Qwen 3.5 35B96.8%$0.01958.8s85%
104Qwen 3 32B93.0%$0.000943.2s68%
105Z.AI GLM 599.3%$0.0141.9m95%
106GPT-5.4 (Reasoning)96.2%$0.02825.1s79%
107Gemini 3 Pro (Preview)100.0%$0.05335.9s100%
108Inception Mercury 288.4%$0.00223.1s59%
109ByteDance Seed 1.6 Flash88.9%$0.000917.2s61%
110Aion 2.095.2%$0.00511.1m69%
111Mistral NeMO90.2%$0.00022.8s50%
112WizardLM 2 8x22b92.1%$0.000951.4s65%
113Gemma 4 26B (Reasoning)99.3%$0.00232.8m95%
114Mistral Small 4 (Reasoning)89.1%$0.002422.3s61%
115Claude Opus 4.6 (Reasoning)100.0%$0.06329.0s100%
116GPT-5.5 (Reasoning)96.3%$0.04216.0s81%
117DeepSeek V3 (2025-03-24)92.1%$0.000740.7s56%
118Gemini 2.5 Flash Lite (Reasoning)88.4%$0.003026.8s62%
119Qwen 3.5 Plus (2026-04-20)96.4%$0.0151.6m83%
120Z.AI GLM 5.199.3%$0.0251.9m94%
121Inception Mercury85.7%$0.00054.2s55%
122Z.AI GLM 4.798.8%$0.0112.4m91%
123GPT-5 Nano90.9%$0.00351.3m75%
124Claude Opus 497.3%$0.0609.4s91%
125Qwen 3.5 27B98.8%$0.0251.9m93%
126Skyfall 36B V286.8%$0.000810.6s53%
127GPT-598.0%$0.03959.9s86%
128Qwen 3.5 122B98.0%$0.0351.4m90%
129o4 Mini High95.9%$0.03452.1s79%
130Gemma 4 31B (Reasoning)100.0%$0.00183.6m100%
131Claude Sonnet 4.6 (Reasoning)99.6%$0.06441.7s96%
132Qwen3.7 Max99.5%$0.0471.4m95%
133o4 Mini91.6%$0.01929.9s62%
134DeepSeek V4 Pro (Reasoning)98.4%$0.0112.7m86%
135Qwen 3.5 397B A17B98.8%$0.0113.0m91%
136ByteDance Seed 2.0 Mini93.7%$0.00262.8m83%
137DeepSeek V3.185.0%$0.000737.9s41%
138Z.AI GLM 4.7 Flash87.3%$0.00231.6m55%
139Qwen3.6 Max Preview99.5%$0.0472.6m95%
140Qwen 3.5 9B89.6%$0.00162.4m63%
141Qwen 3.6 27B89.5%$0.0191.4m57%
142Gemini 3.1 Pro (Preview)100.0%$0.0861.3m100%
143MiniMax M2.791.4%$0.00922.0m55%
144Nemotron 3 Super83.9%$0.00001.5m44%
145MoonshotAI: Kimi K2.596.2%$0.0183.8m87%
146MoonshotAI: Kimi K2.698.9%$0.0353.6m94%
147Cohere Command R+ (Aug. 2024)72.0%$0.007633.3s31%
148Rocinante 12B65.0%$0.00049.2s16%
149Claude 3 Haiku64.8%$0.00105.1s7%
150Nemotron 3 Nano73.0%$0.00263.2m40%
151Hermes 3 70B63.9%$0.00162.1m10%
94.61%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5.4 (Reasoning)1001001001001001009098.6%
GPT-5 Mini1001001001001001009098.6%
GPT-5.5 (Reasoning, Low)1001001001001001009098.6%
MoonshotAI: Kimi K2.61001001001001001009098.6%
MoonshotAI: Kimi K2.51001001001001001009098.6%
Qwen 3.5 27B1001001001001001009098.6%
o4 Mini High1001001001001001009098.6%
DeepSeek V4 Flash (Reasoning)1001001001001001009098.6%
o4 Mini1001001001001001009098.6%
Grok 41001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
Claude Opus 41001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
GPT-5.4 Mini (Reasoning, Low)1001001001001001009098.6%
ByteDance Seed 2.0 Lite1001001001001001009098.6%
Ministral 3 14B1001001001001001009098.6%
Grok 4.3 (Reasoning)100100100100100909097.1%
Z.AI GLM 4.7100100100100100909097.1%
Qwen3 235B A22B Instruct 25071001001001001001008097.1%
GPT-5.5 (Reasoning)10010010010090909095.7%
GPT-51001001001001001007095.7%
Qwen 3.5 122B10010010010090909095.7%
Qwen 3.5 Plus (2026-04-20)10010010010090909095.7%
Qwen 3.6 35B10010010010090909095.7%
Grok 4 Fast10010010010090909095.7%
Gemini 2.5 Flash1001001001001001007095.7%
Gemma 3 4B10010010010090909095.7%
Ministral 3 3B10010010010090909095.7%
GPT-OSS 120B1001001009090909094.3%
Z.AI GLM 4.7 Flash1001001009090909094.3%
GPT-5.4 Nano (Reasoning)10010010010090908094.3%
WizardLM 2 8x22b10010010010090908094.3%
Ministral 3B10010010010090908094.3%
ByteDance Seed 1.610010010010090908094.3%
GPT-5.4 Nano (Reasoning, Low)1001001009090909094.3%
MiniMax M2.5100100909090909092.9%
Grok 4.31001001009090908092.9%
Z.AI GLM 4.5 Air100100909090909092.9%
GPT-5.4 Mini10090909090909091.4%
Z.AI GLM 4.6100100909090908091.4%
Qwen 3 32B1001001009090907091.4%
GPT-4o Mini (temp=1)10090909090909091.4%
Inception Mercury 29090909090909090.0%
Inception Mercury9090909090909090.0%
GPT-4o Mini (temp=0)9090909090909090.0%
Mistral Small 4100100100100100904090.0%
Qwen 3.5 9B1001001009090806088.6%
Skyfall 36B V21001001008080808088.6%
ByteDance Seed 1.6 Flash10010010010090804087.1%
GPT-5.4 (Reasoning, Low)1001001009080707087.1%
GPT-5.21001001009080707087.1%
Grok 4.1 Fast100100909090805085.7%
ByteDance Seed 2.0 Mini100100908080807085.7%
Gemini 2.5 Flash Lite (Reasoning)10010010010070706085.7%
Mistral Small 4 (Reasoning)100100100100100901085.7%
Mistral NeMO100100808080808085.7%
Qwen 3.6 27B100100909090704082.9%
Llama 3.1 8B100100908070707082.9%
Llama 3.1 70B10080808080808082.9%
Z.AI GLM 4.59090908070707080.0%
Nemotron 3 Super100100909080801078.6%
GPT-5 Nano9090807070707077.1%
Llama 3.1 Nemotron 70B9080808070707077.1%
Ministral 8B8080807070606071.4%
Arcee AI: Trinity Mini1001001001001000071.4%
Ministral 3 8B8070707070605067.1%
Nemotron 3 Nano10010090806010062.9%
Cohere Command R+ (Aug. 2024)10050202020101032.9%
Claude 3 Haiku1001000000028.6%
Rocinante 12B902020201010024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Skyfall 36B V2100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)1001001001001001009098.6%
Nemotron 3 Super1001001001001001009098.6%
GPT-5.41001001001001001009098.6%
DeepSeek V3 (2025-03-24)1001001001001001009098.6%
Ministral 3 3B10010010010090909095.7%
Mistral Small 4 (Reasoning)1001001001001001006094.3%
Claude Opus 4.710090909090909091.4%
Ministral 3B10090909090909091.4%
DeepSeek V3.1100100100100100100085.7%
Nemotron 3 Nano100100100100100504084.3%
Rocinante 12B10010010010010080082.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001009098.6%
GPT-5 Mini1001001001001001009098.6%
Grok 4.20 (Reasoning)1001001001001001009098.6%
Z.AI GLM 51001001001001001009098.6%
Qwen 3.6 Flash1001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
DeepSeek V4 Flash (Reasoning)1001001001001001009098.6%
GPT-4.11001001001001001009098.6%
Gemini 2.5 Pro1001001001001001009098.6%
Gemini 3.5 Flash (Reasoning, Minimal)1001001001001001009098.6%
Gemini 3.1 Flash Lite (Preview)1001001001001001009098.6%
Gemini 3 Flash (Preview)1001001001001001009098.6%
Grok 4.20 (Beta)1001001001001001009098.6%
Inception Mercury1001001001001001009098.6%
Gemini 3 Flash (Preview, Reasoning)100100100100100909097.1%
DeepSeek V4 Pro (Reasoning)100100100100100909097.1%
DeepSeek-V2 Chat100100100100100909097.1%
Z.AI GLM 4.7 Flash100100100100100909097.1%
DeepSeek V3 (2024-12-26)100100100100100909097.1%
MoonshotAI: Kimi K2.6100100100100100909097.1%
Gemma 4 26B (Reasoning)100100100100100909097.1%
Grok 4.20 (Beta, Reasoning)100100100100100909097.1%
Aion 2.0100100100100100909097.1%
GPT-5.110010010010090909095.7%
Qwen 3.5 35B100100100100100908095.7%
Grok 4 Fast10010010010090909095.7%
Qwen 3.5 9B100100100100100908095.7%
Xiaomi MIMO v2.510010010010090909095.7%
GPT-5.410010010010090909095.7%
GPT-5 Nano10010010010090909095.7%
DeepSeek V3.210010010010090909095.7%
ByteDance Seed 1.61001001009090909094.3%
MiniMax M2.71001001009090909094.3%
Stealth: Healer Alpha1001001009090909094.3%
Z.AI GLM 4.5 Air1001001009090909094.3%
DeepSeek V3 (2025-03-24)10010010010090908094.3%
Grok 4.3 (Reasoning)1001001009090909094.3%
Grok 41001001009090909094.3%
Claude Haiku 4.51001001009090909094.3%
Hermes 3 405B10010010010090908094.3%
Arcee AI: Trinity Mini1001001009090909094.3%
GPT-4o, May 13th (temp=0)100100909090909092.9%
Gemma 3 4B1001001009090908092.9%
MoonshotAI: Kimi K2.5100100909090909092.9%
Qwen 3.6 35B1001001009090908092.9%
MiniMax M2.5100100909090909092.9%
GPT-OSS 120B100100909090909092.9%
Gemma 4 26B100100909090909092.9%
GPT-5.210090909090909091.4%
Z.AI GLM 4.610090909090909091.4%
Z.AI GLM 4.510090909090909091.4%
Gemini 2.5 Flash Lite (Reasoning)10090909090909091.4%
GPT-4o, May 13th (temp=1)10090909090909091.4%
Writer: Palmyra X510090909090909091.4%
Cydonia 24B V4.110090909090909091.4%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
ByteDance Seed 1.6 Flash10090909090909091.4%
GPT-4.1 Nano1001001009090808091.4%
GPT-5.4 Mini (Reasoning)9090909090909090.0%
o4 Mini High100100100100100904090.0%
Claude Sonnet 49090909090909090.0%
Claude Sonnet 4.59090909090909090.0%
Claude Opus 49090909090909090.0%
Xiaomi MIMO v2.5 Pro9090909090909090.0%
Qwen 3.5 Plus (2026-02-15)9090909090909090.0%
GPT-5.4 Mini (Reasoning, Low)9090909090909090.0%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
DeepSeek V4 Pro9090909090909090.0%
GPT-5.4 Mini9090909090909090.0%
Mistral Large 29090909090909090.0%
DeepSeek V4 Flash9090909090909090.0%
GPT-5.4 Nano (Reasoning)9090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Gemini 2.5 Flash9090909090909090.0%
Mistral Large9090909090909090.0%
GPT-4o Mini (temp=1)9090909090909090.0%
GPT-4o Mini (temp=0)9090909090909090.0%
Mistral Small 49090909090909090.0%
LFM2 24B9090909090909090.0%
Grok 4.31001001009090807090.0%
Hermes 3 70B100100909090808090.0%
ByteDance Seed 2.0 Mini9090909090908088.6%
GPT-5.4 Nano (Reasoning, Low)9090909090908088.6%
Gemma 3 12B9090909090908088.6%
Gemma 3 27B9090909090908088.6%
Gemini 2.5 Flash (Reasoning)10090909090808088.6%
Inception Mercury 2100100909090906088.6%
Qwen3 235B A22B Instruct 2507100100909090807088.6%
GPT-5.4 Nano9090909090908088.6%
Llama 3.1 8B1001001009080707087.1%
DeepSeek V3.11001001001001001001087.1%
Llama 3.1 70B9090909090808087.1%
Stealth: Hunter Alpha100100909090904085.7%
Mistral Small 4 (Reasoning)9090909090807085.7%
GPT-4o, Aug. 6th (temp=0)100100100909090081.4%
Llama 3.1 Nemotron 70B9090808080706078.6%
GPT-4o, Aug. 6th (temp=1)909090909090077.1%
o4 Mini10090909090403075.7%
WizardLM 2 8x22b1009090909040071.4%
Mistral Medium 3.17070707070707070.0%
Ministral 3 14B7070707070707070.0%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070706068.6%
Qwen 3.6 27B10010010090900068.6%
Ministral 3 3B8070707060606067.1%
Nemotron 3 Nano9090909040401064.3%
Mistral Small Creative7070706060606064.3%
Ministral 3B7070707060605064.3%
Nemotron 3 Super100100100903030064.3%
Skyfall 36B V290807060500050.0%
Mistral NeMO8080702000035.7%
Rocinante 12B60302020100020.0%
Cohere Command R+ (Aug. 2024)3020101010101014.3%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Xiaomi MIMO v2.51001001001001001009098.6%
Mistral Small 41001001001001001009098.6%
Claude Sonnet 4.6 (Reasoning)100100100100100909097.1%
Qwen 2.5 72B100100100100100909097.1%
Cydonia 24B V4.11001001001001001008097.1%
Ministral 8B1001001001001001008097.1%
Qwen3.7 Max10010010010090909095.7%
Z.AI GLM 510010010010090909095.7%
Gemini 3 Flash (Preview, Reasoning)100100100100100908095.7%
Xiaomi MIMO v2.5 Pro100100100100100908095.7%
Gemma 4 31B10010010010090909095.7%
DeepSeek V4 Flash10010010010090909095.7%
Gemini 2.5 Flash Lite10010010010090909095.7%
Qwen3.6 Max Preview10010010010090909095.7%
MoonshotAI: Kimi K2.610010010010090909095.7%
Z.AI GLM 5.110010010010090908094.3%
MiniMax M2.710010010010090908094.3%
Ministral 3 3B1001001009090909094.3%
Z.AI GLM 4.710010010010090907092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
DeepSeek V4 Flash (Reasoning)1001001001001001005092.9%
Llama 3.1 8B100100909090909092.9%
Qwen 3.5 27B10090909090909091.4%
Gemini 2.5 Pro10090909090909091.4%
Grok 4.20 (Reasoning)1001001009090808091.4%
Stealth: Hunter Alpha10010010010090807091.4%
GPT-4.1 Nano100100909090908091.4%
Claude Sonnet 4.69090909090909090.0%
DeepSeek V4 Pro (Reasoning)10010010010090905090.0%
GPT-5.59090909090909090.0%
MiniMax M2.5100100909090907090.0%
Claude Opus 49090909090909090.0%
Gemini 3.1 Flash Lite (Reasoning)9090909090909090.0%
Stealth: Healer Alpha10010010010090707090.0%
Gemini 3.1 Flash Lite (Preview)9090909090909090.0%
Gemini 3.1 Flash Lite9090909090909090.0%
GPT-4o, May 13th (temp=0)9090909090909090.0%
Gemini 3 Flash (Preview)9090909090909090.0%
Claude Haiku 4.59090909090909090.0%
Z.AI GLM 4.5 Air10090909090908090.0%
GPT-4o, Aug. 6th (temp=1)10090909090908090.0%
GPT-4o, Aug. 6th (temp=0)9090909090909090.0%
Grok 4.209090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Llama 3.1 Nemotron 70B9090909090909090.0%
LFM2 24B9090909090909090.0%
Qwen 3.5 397B A17B100100909090907090.0%
Qwen 3.5 122B1001001009080808090.0%
GPT-5.4 (Reasoning, Low)1001001009090807090.0%
Ministral 3B100100909090808090.0%
GPT-5.4 Mini (Reasoning, Low)9090909090908088.6%
GPT-5.4 Nano (Reasoning)9090909090908088.6%
GPT-510010010010080707088.6%
Mistral Small Creative10090909090907088.6%
ByteDance Seed 2.0 Lite1001001008080808088.6%
Z.AI GLM 5 Turbo100100909080807087.1%
Grok 4.3 (Reasoning)10010010010070707087.1%
GPT-5.4 Mini (Reasoning)100100909080807087.1%
Gemini 3.5 Flash (Reasoning, Minimal)9090909090808087.1%
Mistral Small 4 (Reasoning)100100909080807087.1%
Z.AI GLM 4.61001001008080807087.1%
Grok 4.20 (Beta, Reasoning)10090908080808085.7%
MoonshotAI: Kimi K2.510090909090707085.7%
Grok 4.20 (Beta)9090909090906085.7%
Hermes 3 70B9090909080808085.7%
GPT-5.11001001008080707085.7%
Inception Mercury 29090909090807085.7%
Arcee AI: Trinity Mini100100100100100100085.7%
GPT-5.5 (Reasoning, Low)9090909080807084.3%
Qwen 3.6 35B9090909080807084.3%
ByteDance Seed 2.0 Mini10090908080807084.3%
GPT-OSS 120B9090909090707084.3%
GPT-5.4 Mini100100909090705084.3%
Grok 4.1 Fast100100908070707082.9%
Qwen 3.5 35B10090908080707082.9%
Grok 4 Fast10090908080707082.9%
Qwen 3.5 Flash9090909080707082.9%
Grok 4.39090909080806082.9%
GPT-5 Mini9090908080707081.4%
DeepSeek-V2 Chat9090909070707081.4%
GPT-4.1 Mini9090808080807081.4%
GPT-5.4 Nano (Reasoning, Low)9090908080707081.4%
o4 Mini High9090808080707080.0%
Aion 2.010090908080705080.0%
Cohere Command R+ (Aug. 2024)1001001009080504080.0%
Gemini 2.5 Flash9090808080806080.0%
Qwen 3 32B9090908070706078.6%
ByteDance Seed 1.69080808080707078.6%
Skyfall 36B V21001001001008070078.6%
ByteDance Seed 1.6 Flash8080808080807078.6%
Qwen 3.5 Plus (2026-04-20)9090807070707077.1%
Qwen 3.6 27B10090907070705077.1%
GPT-5.410080807070707077.1%
Z.AI GLM 4.59080808070707077.1%
Z.AI GLM 4.7 Flash9090808080606077.1%
DeepSeek V3 (2025-03-24)9090909090701075.7%
o4 Mini9090808070705075.7%
DeepSeek V3 (2024-12-26)9090909070505075.7%
Hermes 3 405B8080808080706075.7%
DeepSeek V3.29090807070706075.7%
GPT-4.19080707070707074.3%
GPT-5.4 Nano8080807070707074.3%
WizardLM 2 8x22b9090807070606074.3%
GPT-5.5 (Reasoning)8080808080705074.3%
GPT-5.4 (Reasoning)10080707070705072.9%
Gemini 2.5 Flash (Reasoning)9090807070505071.4%
GPT-4o Mini (temp=1)8070707070707071.4%
Rocinante 12B1001001001008020071.4%
GPT-5.27070707070707070.0%
Gemma 3 12B8080808070505070.0%
GPT-4o Mini (temp=0)7070707070707070.0%
Qwen 3.6 Flash909080807070068.6%
GPT-5 Nano8070707070605067.1%
Inception Mercury8070707070605067.1%
Gemma 3 27B9080806060505067.1%
Qwen 3.5 9B9070707050505064.3%
Gemini 2.5 Flash Lite (Reasoning)9080706050505064.3%
Nemotron 3 Super7070707070701061.4%
Nemotron 3 Nano8080707070501061.4%
DeepSeek V3.1707050505050048.6%
Claude 3 Haiku00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)1001001001001001009098.6%
ByteDance Seed 1.61001001001001001009098.6%
DeepSeek V4 Flash (Reasoning)1001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
ByteDance Seed 2.0 Mini1001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
ByteDance Seed 2.0 Lite1001001001001001009098.6%
Z.AI GLM 4.5 Air1001001001001001009098.6%
Gemma 3 27B1001001001001001009098.6%
Llama 3.1 8B1001001001001001009098.6%
Qwen 3.6 Flash100100100100100909097.1%
Qwen 3.6 35B100100100100100909097.1%
GPT-OSS 120B100100100100100909097.1%
GPT-4.1 Mini100100100100100909097.1%
GPT-5 Nano100100100100100909097.1%
GPT-5.4 Nano (Reasoning, Low)100100100100100909097.1%
Grok 4.3 (Reasoning)100100100100100909097.1%
Ministral 8B1001001001001001008097.1%
Qwen 3.5 Flash100100100100100908095.7%
Qwen 3 32B1001001001001001007095.7%
Qwen 3.6 27B100100100100100907094.3%
Llama 3.1 70B100100100100100907094.3%
GPT-5.4 Nano10010010010090908094.3%
GPT-5.4 Nano (Reasoning)1001001009090908092.9%
GPT-4o Mini (temp=1)100100909090909092.9%
Skyfall 36B V2100100909090908091.4%
Inception Mercury 21001001009090808091.4%
GPT-4o Mini (temp=0)9090909090909090.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009080807088.6%
DeepSeek V3.11001001001001001002088.6%
Nemotron 3 Super1001001001001001001087.1%
o4 Mini100100100100100100085.7%
DeepSeek V3 (2024-12-26)100100100100100100085.7%
DeepSeek V3 (2025-03-24)100100100100100100085.7%
Gemini 2.5 Flash (Reasoning)100100909080707085.7%
Mistral Small 4 (Reasoning)1001001009080705084.3%
Inception Mercury100100908080707084.3%
LFM2 24B9090909090702077.1%
Nemotron 3 Nano9090909060504072.9%
ByteDance Seed 1.6 Flash10080808080601070.0%
MiniMax M2.71001001008070101067.1%
Z.AI GLM 4.7 Flash10090908070301067.1%
GPT-4.1 Nano9080606060606067.1%
Hermes 3 70B9000000012.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Skyfall 36B V2100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5 Mini1001001001001001009098.6%
Qwen 3.6 Flash1001001001001001009098.6%
Gemini 3 Flash (Preview, Reasoning)1001001001001001009098.6%
Qwen 3.5 9B1001001001001001006094.3%
DeepSeek V3.11001001001001001005092.9%
Inception Mercury1001001001001001001087.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100085.7%
Mistral Small 3.2 24B100100100100100100085.7%
Rocinante 12B10010010010010070081.4%
Nemotron 3 Nano100100907040404068.6%
Hermes 3 70B1001001009000055.7%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
GPT-5.4 (Reasoning)1001001001001001009098.6%
Qwen 3.5 122B1001001001001001009098.6%
Qwen 3.6 Flash1001001001001001009098.6%
o4 Mini High1001001001001001009098.6%
GPT-5.21001001001001001009098.6%
Z.AI GLM 4.61001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
DeepSeek V3 (2024-12-26)1001001001001001009098.6%
Z.AI GLM 4.5 Air1001001001001001009098.6%
GPT-4o, Aug. 6th (temp=0)1001001001001001009098.6%
Qwen3 235B A22B Instruct 25071001001001001001009098.6%
Gemma 4 26B (Reasoning)100100100100100909097.1%
Gemini 2.5 Flash (Reasoning)100100100100100909097.1%
Gemini 3.1 Flash Lite (Preview)100100100100100909097.1%
Grok 4.3100100100100100909097.1%
ByteDance Seed 1.6 Flash100100100100100909097.1%
WizardLM 2 8x22b100100100100100909097.1%
GPT-5 Mini100100100100100909097.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100909097.1%
Xiaomi MIMO v2.5100100100100100909097.1%
GPT-5.4 Nano (Reasoning)100100100100100909097.1%
Writer: Palmyra X5100100100100100909097.1%
Gemini 2.5 Flash10010010010090909095.7%
MoonshotAI: Kimi K2.510010010010090909095.7%
Z.AI GLM 4.510010010010090909095.7%
GPT-5.410010010010090909095.7%
GPT-4o Mini (temp=1)10010010010090909095.7%
Qwen 3.6 27B10010010010090909095.7%
Xiaomi MIMO v2.5 Pro10010010010090909095.7%
Claude Haiku 4.510010010010090909095.7%
GPT-5.4 Mini (Reasoning)1001001009090909094.3%
GPT-5 Nano10010010010090908094.3%
GPT-5.4 Mini1001001009090909094.3%
Skyfall 36B V210010010010090908094.3%
Qwen 3.5 Plus (2026-02-15)100100909090909092.9%
Mistral Small 3.2 24B100100909090909092.9%
Arcee AI: Trinity Mini100100909090909092.9%
GPT-5.4 (Reasoning, Low)100100909090909092.9%
Claude Sonnet 4100100909090909092.9%
MiniMax M2.5100100909090909092.9%
ByteDance Seed 2.0 Mini100100909090909092.9%
GPT-4o, May 13th (temp=0)100100909090909092.9%
ByteDance Seed 2.0 Lite100100100100100906092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
GPT-5.4 Nano (Reasoning, Low)100100909090909092.9%
Gemma 3 12B100100909090909092.9%
GPT-5.4 Mini (Reasoning, Low)10090909090909091.4%
Grok 4.3 (Reasoning)10090909090909091.4%
Stealth: Hunter Alpha10090909090909091.4%
Qwen 3.5 Flash1001001001001001004091.4%
Cydonia 24B V4.110090909090909091.4%
GPT-5.4 Nano10090909090909091.4%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
Gemini 3.1 Flash Lite (Reasoning)9090909090909090.0%
Gemma 4 26B9090909090909090.0%
Gemini 3.1 Flash Lite9090909090909090.0%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
DeepSeek V4 Pro9090909090909090.0%
Mistral Large 29090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Mistral Large9090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Gemma 3 27B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Claude 3 Haiku9090909090909090.0%
LFM2 24B9090909090909090.0%
Mistral Small 49090909090908088.6%
Ministral 3 3B9090909090908088.6%
Ministral 3B10090909090808088.6%
DeepSeek V3.11001001001001001001087.1%
Z.AI GLM 4.7 Flash10010010010090902085.7%
Qwen 3 32B100100100100100100085.7%
Llama 3.1 Nemotron 70B9090909080808085.7%
Aion 2.010010010010010090084.3%
GPT-4o, Aug. 6th (temp=1)1001001001009090082.9%
DeepSeek V3 (2025-03-24)10010010010010080082.9%
Llama 3.1 8B100100909080804082.9%
GPT-OSS 120B10010010010090404081.4%
Nemotron 3 Super10010010010090404081.4%
Nemotron 3 Nano1001001009090504081.4%
GPT-4.1 Nano9090808080707080.0%
Mistral Small Creative8080808080808080.0%
Mistral Small 4 (Reasoning)1009090909090078.6%
MiniMax M2.710010010010010040077.1%
Ministral 3 14B8080808080707077.1%
Inception Mercury10010010010040404074.3%
Qwen 3.5 9B10010010010040404074.3%
Cohere Command R+ (Aug. 2024)100100707060604071.4%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070707070.0%
DeepSeek-V2 Chat10010010090900068.6%
Inception Mercury 2100100404040404057.1%
Rocinante 12B100909060400054.3%
Hermes 3 70B90800000024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Grok 4.20 (Reasoning)1001001001001001009098.6%
ByteDance Seed 1.61001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
MiniMax M2.71001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
GPT-5.4 Mini (Reasoning, Low)1001001001001001009098.6%
Ministral 8B1001001001001001009098.6%
Ministral 3B1001001001001001009098.6%
MoonshotAI: Kimi K2.5100100100100100909097.1%
o4 Mini100100100100100909097.1%
Z.AI GLM 4.5 Air100100100100100909097.1%
Hermes 3 405B100100100100100909097.1%
Arcee AI: Trinity Mini100100100100100909097.1%
Qwen 3.6 27B1001001001001001008097.1%
DeepSeek V4 Flash (Reasoning)100100100100100909097.1%
Mistral Small 4 (Reasoning)100100100100100909097.1%
GPT-5.4 Nano (Reasoning)1001001001001001007095.7%
GPT-5 Nano10010010010090909095.7%
Gemini 2.5 Flash Lite (Reasoning)100100100100100907094.3%
Inception Mercury 210010010010090908094.3%
Ministral 3 3B1001001009090909094.3%
Qwen 3 32B1001001009090908092.9%
Skyfall 36B V2100100909090908091.4%
GPT-OSS 120B10090909090908090.0%
LFM2 24B9090909090909090.0%
DeepSeek V3.11001001001001001003090.0%
GPT-5.4 Nano (Reasoning, Low)10090909090808088.6%
Nemotron 3 Nano100100909090807088.6%
ByteDance Seed 1.6 Flash10090909080808087.1%
Gemma 3 4B9090909080808085.7%
GPT-5.4 Nano100100909080707085.7%
Rocinante 12B100100100100100100085.7%
Inception Mercury10090808080808084.3%
Cohere Command R+ (Aug. 2024)100100908060605077.1%
Z.AI GLM 4.7 Flash10090908080802077.1%
GPT-4.1 Nano8080807070707074.3%
GPT-4o Mini (temp=1)8080707070707072.9%
GPT-4o Mini (temp=0)7070707070707070.0%
Hermes 3 70B100100100000042.9%