Dialogue content preserved

Test: Text Replacement

Avg. Score
93.8%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Qwen 2.5 72B99.6%$0.000312.0s96%
2Gemini 3.1 Flash Lite (Preview)98.2%$0.00111.9s92%
3Gemini 3 Flash (Preview)98.6%$0.00213.5s93%
4Gemini 2.5 Flash Lite97.0%$0.00031.9s91%
5Qwen 3.5 Plus (2026-02-15)97.9%$0.00177.8s92%
6Writer: Palmyra X598.6%$0.004011.7s93%
7Mistral Large 397.5%$0.00128.5s91%
8Claude Haiku 4.597.5%$0.00403.8s91%
9Grok 4.20 (Beta)98.0%$0.00372.0s88%
10Gemma 3 4B96.8%$0.00016.9s88%
11Qwen3 235B A22B Instruct 250798.0%$0.000415.1s88%
12Mistral Large97.5%$0.00498.3s91%
13Mistral Large 297.5%$0.00498.4s91%
14Claude Opus 4.5100.0%$0.0206.0s100%
15Claude Opus 4.6100.0%$0.0206.6s100%
16Arcee AI: Trinity Large (Preview)97.9%$0.000026.5s92%
17Claude Sonnet 4.698.8%$0.0125.4s93%
18Claude Sonnet 4.598.7%$0.0125.4s93%
19Stealth: Healer Alpha97.5%$0.000018.7s87%
20GPT-5.4 Mini (Reasoning, Low)95.9%$0.00374.5s89%
21Grok 4 Fast96.4%$0.00108.5s86%
22Claude Sonnet 497.9%$0.0126.9s92%
23Mistral Small 495.9%$0.00053.9s82%
24Claude 3.7 Sonnet97.5%$0.0126.5s91%
25GPT-4o, May 13th (temp=1)97.1%$0.0124.0s91%
26GPT-4o, May 13th (temp=0)97.0%$0.0123.8s91%
27Gemini 2.5 Flash95.2%$0.00172.4s83%
28GPT-4.1 Mini94.8%$0.00128.2s86%
29GPT-5.4 Nano (Reasoning)94.8%$0.002011.3s86%
30GPT-5.4 Mini95.0%$0.00312.5s83%
31GPT-4.196.6%$0.00604.8s82%
32Mistral Small 3.2 24B97.3%$0.00025.5s73%
33Gemini 3 Flash (Preview, Reasoning)98.9%$0.01524.9s93%
34Stealth: Hunter Alpha96.1%$0.000019.9s81%
35Grok 4.1 Fast95.7%$0.001215.5s80%
36Mistral Medium 3.193.8%$0.00146.3s80%
37Hermes 3 405B95.9%$0.001325.3s82%
38GPT-5.4 Mini (Reasoning)96.4%$0.01017.8s87%
39GPT-5.495.9%$0.0106.5s83%
40Gemma 3 12B93.9%$0.000110.2s78%
41GPT-5.4 (Reasoning, Low)96.2%$0.0138.2s84%
42Ministral 3 14B93.2%$0.00034.5s77%
43Z.AI GLM 5 Turbo98.2%$0.01431.1s89%
44Claude 3.5 Sonnet97.5%$0.02410.6s91%
45GPT-5.4 Nano (Reasoning, Low)92.9%$0.00095.5s76%
46GPT-4o, Aug. 6th (temp=0)96.3%$0.00743.0s73%
47Llama 3.1 8B93.0%$0.000111.5s76%
48Grok 4.20 (Beta, Reasoning)97.9%$0.02213.6s89%
49GPT-5.4 Nano91.8%$0.00093.3s76%
50Ministral 3 3B92.0%$0.00012.3s74%
51Mistral Small Creative91.6%$0.00023.3s74%
52DeepSeek V3.296.4%$0.000552.3s82%
53GPT-5 Mini96.8%$0.007446.2s86%
54Gemma 3 27B93.0%$0.000218.5s76%
55Z.AI GLM 4.696.1%$0.005950.0s86%
56GPT-5.197.7%$0.02122.6s86%
57Ministral 3B90.9%$0.00012.4s72%
58Llama 3.1 70B93.0%$0.000627.8s76%
59MiniMax M2.596.1%$0.00191.2m88%
60DeepSeek V3 (2024-12-26)94.6%$0.000917.7s67%
61ByteDance Seed 2.0 Lite97.3%$0.00581.1m85%
62Llama 3.1 Nemotron 70B91.4%$0.001616.4s75%
63LFM2 24B92.1%$0.000113.3s70%
64Z.AI GLM 4.593.0%$0.004436.1s80%
65GPT-5.293.4%$0.01310.9s78%
66Gemini 2.5 Flash (Reasoning)92.9%$0.008715.8s76%
67ByteDance Seed 1.695.5%$0.00571.0m84%
68Grok 499.1%$0.03341.7s94%
69Qwen 3.5 Flash96.3%$0.00371.1m79%
70GPT-4o Mini (temp=1)89.3%$0.000510.5s70%
71Ministral 3 8B88.4%$0.00023.6s69%
72Ministral 8B87.9%$0.00013.7s71%
73GPT-4o, Aug. 6th (temp=1)93.7%$0.00733.1s63%
74Gemini 2.5 Pro98.8%$0.03928.9s93%
75DeepSeek-V2 Chat93.4%$0.000919.1s61%
76GPT-4o Mini (temp=0)88.7%$0.000410.7s69%
77Arcee AI: Trinity Mini92.7%$0.00027.9s55%
78GPT-4.1 Nano88.0%$0.00034.1s66%
79Qwen 3.5 35B96.8%$0.01958.8s85%
80Qwen 3 32B93.0%$0.000943.2s68%
81Z.AI GLM 599.3%$0.0141.9m95%
82GPT-5.4 (Reasoning)96.2%$0.02825.1s79%
83Gemini 3 Pro (Preview)100.0%$0.05335.9s100%
84Inception Mercury 288.4%$0.00223.1s59%
85ByteDance Seed 1.6 Flash88.9%$0.000917.2s61%
86Aion 2.095.2%$0.00511.1m69%
87Mistral NeMO90.2%$0.00022.8s50%
88WizardLM 2 8x22b92.1%$0.000951.4s65%
89Mistral Small 4 (Reasoning)89.1%$0.002422.3s61%
90Claude Opus 4.6 (Reasoning)100.0%$0.06329.0s100%
91DeepSeek V3 (2025-03-24)92.1%$0.000740.7s56%
92Gemini 2.5 Flash Lite (Reasoning)88.4%$0.003026.8s62%
93Inception Mercury85.7%$0.00054.2s55%
94Z.AI GLM 4.798.8%$0.0112.4m91%
95GPT-5 Nano90.9%$0.00351.3m75%
96Claude Opus 497.3%$0.0609.4s91%
97Qwen 3.5 27B98.8%$0.0251.9m93%
98GPT-598.0%$0.03959.9s86%
99Qwen 3.5 122B98.0%$0.0351.4m90%
100o4 Mini High95.9%$0.03452.1s79%
101Claude Sonnet 4.6 (Reasoning)99.6%$0.06441.7s96%
102o4 Mini91.6%$0.01929.9s62%
103Qwen 3.5 397B A17B98.8%$0.0113.0m91%
104ByteDance Seed 2.0 Mini93.7%$0.00262.8m83%
105DeepSeek V3.185.0%$0.000737.9s41%
106Z.AI GLM 4.7 Flash87.3%$0.00231.6m55%
107Qwen 3.5 9B89.6%$0.00162.4m63%
108Gemini 3.1 Pro (Preview)100.0%$0.0861.3m100%
109MiniMax M2.791.4%$0.00922.0m55%
110Nemotron 3 Super83.9%$0.00001.5m44%
111MoonshotAI: Kimi K2.596.2%$0.0183.8m87%
112Cohere Command R+ (Aug. 2024)72.0%$0.007633.3s31%
113Rocinante 12B65.0%$0.00049.2s16%
114Claude 3 Haiku64.8%$0.00105.1s7%
115Nemotron 3 Nano73.0%$0.00263.2m40%
116Hermes 3 70B63.9%$0.00162.1m10%
93.79%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5.4 (Reasoning)1001001001001001009098.6%
GPT-5 Mini1001001001001001009098.6%
MoonshotAI: Kimi K2.51001001001001001009098.6%
Qwen 3.5 27B1001001001001001009098.6%
o4 Mini High1001001001001001009098.6%
o4 Mini1001001001001001009098.6%
Grok 41001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
Claude Opus 41001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
GPT-5.4 Mini (Reasoning, Low)1001001001001001009098.6%
ByteDance Seed 2.0 Lite1001001001001001009098.6%
Ministral 3 14B1001001001001001009098.6%
Z.AI GLM 4.7100100100100100909097.1%
Qwen3 235B A22B Instruct 25071001001001001001008097.1%
GPT-51001001001001001007095.7%
Qwen 3.5 122B10010010010090909095.7%
Grok 4 Fast10010010010090909095.7%
Gemini 2.5 Flash1001001001001001007095.7%
Gemma 3 4B10010010010090909095.7%
Ministral 3 3B10010010010090909095.7%
ByteDance Seed 1.610010010010090908094.3%
Z.AI GLM 4.7 Flash1001001009090909094.3%
GPT-5.4 Nano (Reasoning, Low)1001001009090909094.3%
WizardLM 2 8x22b10010010010090908094.3%
GPT-5.4 Nano (Reasoning)10010010010090908094.3%
Ministral 3B10010010010090908094.3%
MiniMax M2.5100100909090909092.9%
GPT-5.4 Mini10090909090909091.4%
Z.AI GLM 4.6100100909090908091.4%
Qwen 3 32B1001001009090907091.4%
GPT-4o Mini (temp=1)10090909090909091.4%
Inception Mercury 29090909090909090.0%
Inception Mercury9090909090909090.0%
GPT-4o Mini (temp=0)9090909090909090.0%
Mistral Small 4100100100100100904090.0%
Qwen 3.5 9B1001001009090806088.6%
GPT-5.4 (Reasoning, Low)1001001009080707087.1%
GPT-5.21001001009080707087.1%
ByteDance Seed 1.6 Flash10010010010090804087.1%
Grok 4.1 Fast100100909090805085.7%
ByteDance Seed 2.0 Mini100100908080807085.7%
Gemini 2.5 Flash Lite (Reasoning)10010010010070706085.7%
Mistral Small 4 (Reasoning)100100100100100901085.7%
Mistral NeMO100100808080808085.7%
Llama 3.1 70B10080808080808082.9%
Llama 3.1 8B100100908070707082.9%
Z.AI GLM 4.59090908070707080.0%
Nemotron 3 Super100100909080801078.6%
GPT-5 Nano9090807070707077.1%
Llama 3.1 Nemotron 70B9080808070707077.1%
Arcee AI: Trinity Mini1001001001001000071.4%
Ministral 8B8080807070606071.4%
Ministral 3 8B8070707070605067.1%
Nemotron 3 Nano10010090806010062.9%
Cohere Command R+ (Aug. 2024)10050202020101032.9%
Claude 3 Haiku1001000000028.6%
Rocinante 12B902020201010024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Nemotron 3 Super1001001001001001009098.6%
GPT-5.41001001001001001009098.6%
DeepSeek V3 (2025-03-24)1001001001001001009098.6%
Ministral 3 3B10010010010090909095.7%
Mistral Small 4 (Reasoning)1001001001001001006094.3%
Ministral 3B10090909090909091.4%
DeepSeek V3.1100100100100100100085.7%
Nemotron 3 Nano100100100100100504084.3%
Rocinante 12B10010010010010080082.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001009098.6%
GPT-5 Mini1001001001001001009098.6%
Z.AI GLM 51001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
GPT-4.11001001001001001009098.6%
Gemini 2.5 Pro1001001001001001009098.6%
Gemini 3.1 Flash Lite (Preview)1001001001001001009098.6%
Gemini 3 Flash (Preview)1001001001001001009098.6%
Grok 4.20 (Beta)1001001001001001009098.6%
Inception Mercury1001001001001001009098.6%
Gemini 3 Flash (Preview, Reasoning)100100100100100909097.1%
Aion 2.0100100100100100909097.1%
DeepSeek-V2 Chat100100100100100909097.1%
Z.AI GLM 4.7 Flash100100100100100909097.1%
DeepSeek V3 (2024-12-26)100100100100100909097.1%
Grok 4.20 (Beta, Reasoning)100100100100100909097.1%
GPT-5 Nano10010010010090909095.7%
GPT-5.110010010010090909095.7%
Qwen 3.5 35B100100100100100908095.7%
Grok 4 Fast10010010010090909095.7%
Qwen 3.5 9B100100100100100908095.7%
GPT-5.410010010010090909095.7%
DeepSeek V3.210010010010090909095.7%
ByteDance Seed 1.61001001009090909094.3%
Grok 41001001009090909094.3%
Stealth: Healer Alpha1001001009090909094.3%
DeepSeek V3 (2025-03-24)10010010010090908094.3%
Arcee AI: Trinity Mini1001001009090909094.3%
MiniMax M2.71001001009090909094.3%
Claude Haiku 4.51001001009090909094.3%
Hermes 3 405B10010010010090908094.3%
GPT-4o, May 13th (temp=0)100100909090909092.9%
MoonshotAI: Kimi K2.5100100909090909092.9%
MiniMax M2.5100100909090909092.9%
Gemma 3 4B1001001009090908092.9%
GPT-5.210090909090909091.4%
Z.AI GLM 4.610090909090909091.4%
Z.AI GLM 4.510090909090909091.4%
Gemini 2.5 Flash Lite (Reasoning)10090909090909091.4%
GPT-4o, May 13th (temp=1)10090909090909091.4%
Writer: Palmyra X510090909090909091.4%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
ByteDance Seed 1.6 Flash10090909090909091.4%
GPT-4.1 Nano1001001009090808091.4%
GPT-5.4 Mini (Reasoning)9090909090909090.0%
Claude Sonnet 49090909090909090.0%
Claude Sonnet 4.59090909090909090.0%
Claude Opus 49090909090909090.0%
Qwen 3.5 Plus (2026-02-15)9090909090909090.0%
GPT-5.4 Mini (Reasoning, Low)9090909090909090.0%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
GPT-5.4 Mini9090909090909090.0%
Mistral Large 29090909090909090.0%
GPT-5.4 Nano (Reasoning)9090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Gemini 2.5 Flash9090909090909090.0%
Mistral Large9090909090909090.0%
GPT-4o Mini (temp=1)9090909090909090.0%
GPT-4o Mini (temp=0)9090909090909090.0%
Mistral Small 49090909090909090.0%
LFM2 24B9090909090909090.0%
o4 Mini High100100100100100904090.0%
Hermes 3 70B100100909090808090.0%
ByteDance Seed 2.0 Mini9090909090908088.6%
Gemini 2.5 Flash (Reasoning)10090909090808088.6%
Inception Mercury 2100100909090906088.6%
GPT-5.4 Nano (Reasoning, Low)9090909090908088.6%
Gemma 3 12B9090909090908088.6%
GPT-5.4 Nano9090909090908088.6%
Qwen3 235B A22B Instruct 2507100100909090807088.6%
Gemma 3 27B9090909090908088.6%
Llama 3.1 70B9090909090808087.1%
Llama 3.1 8B1001001009080707087.1%
DeepSeek V3.11001001001001001001087.1%
Stealth: Hunter Alpha100100909090904085.7%
Mistral Small 4 (Reasoning)9090909090807085.7%
GPT-4o, Aug. 6th (temp=0)100100100909090081.4%
Llama 3.1 Nemotron 70B9090808080706078.6%
GPT-4o, Aug. 6th (temp=1)909090909090077.1%
o4 Mini10090909090403075.7%
WizardLM 2 8x22b1009090909040071.4%
Mistral Medium 3.17070707070707070.0%
Ministral 3 14B7070707070707070.0%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070706068.6%
Ministral 3 3B8070707060606067.1%
Nemotron 3 Nano9090909040401064.3%
Nemotron 3 Super100100100903030064.3%
Mistral Small Creative7070706060606064.3%
Ministral 3B7070707060605064.3%
Mistral NeMO8080702000035.7%
Rocinante 12B60302020100020.0%
Cohere Command R+ (Aug. 2024)3020101010101014.3%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Mistral Small 41001001001001001009098.6%
Qwen 2.5 72B100100100100100909097.1%
Claude Sonnet 4.6 (Reasoning)100100100100100909097.1%
Ministral 8B1001001001001001008097.1%
Z.AI GLM 510010010010090909095.7%
Gemini 3 Flash (Preview, Reasoning)100100100100100908095.7%
Gemini 2.5 Flash Lite10010010010090909095.7%
MiniMax M2.710010010010090908094.3%
Ministral 3 3B1001001009090909094.3%
Z.AI GLM 4.710010010010090907092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
Llama 3.1 8B100100909090909092.9%
Gemini 2.5 Pro10090909090909091.4%
Qwen 3.5 27B10090909090909091.4%
GPT-4.1 Nano100100909090908091.4%
Stealth: Hunter Alpha10010010010090807091.4%
Claude Sonnet 4.69090909090909090.0%
MiniMax M2.5100100909090907090.0%
Claude Opus 49090909090909090.0%
Gemini 3.1 Flash Lite (Preview)9090909090909090.0%
GPT-4o, May 13th (temp=0)9090909090909090.0%
Gemini 3 Flash (Preview)9090909090909090.0%
Claude Haiku 4.59090909090909090.0%
GPT-4o, Aug. 6th (temp=1)10090909090908090.0%
GPT-4o, Aug. 6th (temp=0)9090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Llama 3.1 Nemotron 70B9090909090909090.0%
Ministral 3B100100909090808090.0%
LFM2 24B9090909090909090.0%
Qwen 3.5 397B A17B100100909090907090.0%
Qwen 3.5 122B1001001009080808090.0%
GPT-5.4 (Reasoning, Low)1001001009090807090.0%
Stealth: Healer Alpha10010010010090707090.0%
GPT-5.4 Nano (Reasoning)9090909090908088.6%
Mistral Small Creative10090909090907088.6%
GPT-510010010010080707088.6%
GPT-5.4 Mini (Reasoning, Low)9090909090908088.6%
ByteDance Seed 2.0 Lite1001001008080808088.6%
Z.AI GLM 5 Turbo100100909080807087.1%
GPT-5.4 Mini (Reasoning)100100909080807087.1%
Z.AI GLM 4.61001001008080807087.1%
Mistral Small 4 (Reasoning)100100909080807087.1%
MoonshotAI: Kimi K2.510090909090707085.7%
GPT-5.11001001008080707085.7%
Grok 4.20 (Beta, Reasoning)10090908080808085.7%
Grok 4.20 (Beta)9090909090906085.7%
Inception Mercury 29090909090807085.7%
Hermes 3 70B9090909080808085.7%
Arcee AI: Trinity Mini100100100100100100085.7%
ByteDance Seed 2.0 Mini10090908080807084.3%
GPT-5.4 Mini100100909090705084.3%
Qwen 3.5 Flash9090909080707082.9%
Grok 4.1 Fast100100908070707082.9%
Grok 4 Fast10090908080707082.9%
Qwen 3.5 35B10090908080707082.9%
GPT-5 Mini9090908080707081.4%
DeepSeek-V2 Chat9090909070707081.4%
GPT-5.4 Nano (Reasoning, Low)9090908080707081.4%
GPT-4.1 Mini9090808080807081.4%
Aion 2.010090908080705080.0%
o4 Mini High9090808080707080.0%
Gemini 2.5 Flash9090808080806080.0%
Cohere Command R+ (Aug. 2024)1001001009080504080.0%
ByteDance Seed 1.69080808080707078.6%
ByteDance Seed 1.6 Flash8080808080807078.6%
Qwen 3 32B9090908070706078.6%
Z.AI GLM 4.59080808070707077.1%
Z.AI GLM 4.7 Flash9090808080606077.1%
GPT-5.410080807070707077.1%
o4 Mini9090808070705075.7%
DeepSeek V3 (2025-03-24)9090909090701075.7%
DeepSeek V3 (2024-12-26)9090909070505075.7%
Hermes 3 405B8080808080706075.7%
DeepSeek V3.29090807070706075.7%
GPT-4.19080707070707074.3%
GPT-5.4 Nano8080807070707074.3%
WizardLM 2 8x22b9090807070606074.3%
GPT-5.4 (Reasoning)10080707070705072.9%
Gemini 2.5 Flash (Reasoning)9090807070505071.4%
GPT-4o Mini (temp=1)8070707070707071.4%
Rocinante 12B1001001001008020071.4%
GPT-5.27070707070707070.0%
GPT-4o Mini (temp=0)7070707070707070.0%
Gemma 3 12B8080808070505070.0%
Inception Mercury8070707070605067.1%
Gemma 3 27B9080806060505067.1%
GPT-5 Nano8070707070605067.1%
Qwen 3.5 9B9070707050505064.3%
Gemini 2.5 Flash Lite (Reasoning)9080706050505064.3%
Nemotron 3 Super7070707070701061.4%
Nemotron 3 Nano8080707070501061.4%
DeepSeek V3.1707050505050048.6%
Claude 3 Haiku00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
ByteDance Seed 2.0 Mini1001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
ByteDance Seed 2.0 Lite1001001001001001009098.6%
Gemma 3 27B1001001001001001009098.6%
Llama 3.1 8B1001001001001001009098.6%
GPT-4.1 Mini100100100100100909097.1%
GPT-5 Nano100100100100100909097.1%
GPT-5.4 Nano (Reasoning, Low)100100100100100909097.1%
Ministral 8B1001001001001001008097.1%
Qwen 3.5 Flash100100100100100908095.7%
Qwen 3 32B1001001001001001007095.7%
GPT-5.4 Nano10010010010090908094.3%
Llama 3.1 70B100100100100100907094.3%
GPT-4o Mini (temp=1)100100909090909092.9%
GPT-5.4 Nano (Reasoning)1001001009090908092.9%
Inception Mercury 21001001009090808091.4%
GPT-4o Mini (temp=0)9090909090909090.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009080807088.6%
DeepSeek V3.11001001001001001002088.6%
Nemotron 3 Super1001001001001001001087.1%
Gemini 2.5 Flash (Reasoning)100100909080707085.7%
o4 Mini100100100100100100085.7%
DeepSeek V3 (2024-12-26)100100100100100100085.7%
DeepSeek V3 (2025-03-24)100100100100100100085.7%
Mistral Small 4 (Reasoning)1001001009080705084.3%
Inception Mercury100100908080707084.3%
LFM2 24B9090909090702077.1%
Nemotron 3 Nano9090909060504072.9%
ByteDance Seed 1.6 Flash10080808080601070.0%
MiniMax M2.71001001008070101067.1%
Z.AI GLM 4.7 Flash10090908070301067.1%
GPT-4.1 Nano9080606060606067.1%
Hermes 3 70B9000000012.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5 Mini1001001001001001009098.6%
Gemini 3 Flash (Preview, Reasoning)1001001001001001009098.6%
Qwen 3.5 9B1001001001001001006094.3%
DeepSeek V3.11001001001001001005092.9%
Inception Mercury1001001001001001001087.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100085.7%
Mistral Small 3.2 24B100100100100100100085.7%
Rocinante 12B10010010010010070081.4%
Nemotron 3 Nano100100907040404068.6%
Hermes 3 70B1001001009000055.7%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
GPT-5.4 (Reasoning)1001001001001001009098.6%
Qwen 3.5 122B1001001001001001009098.6%
o4 Mini High1001001001001001009098.6%
GPT-5.21001001001001001009098.6%
Z.AI GLM 4.61001001001001001009098.6%
Qwen 3.5 35B1001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
DeepSeek V3 (2024-12-26)1001001001001001009098.6%
GPT-4o, Aug. 6th (temp=0)1001001001001001009098.6%
Qwen3 235B A22B Instruct 25071001001001001001009098.6%
GPT-5 Mini100100100100100909097.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100909097.1%
Writer: Palmyra X5100100100100100909097.1%
ByteDance Seed 1.6 Flash100100100100100909097.1%
Gemini 2.5 Flash (Reasoning)100100100100100909097.1%
Gemini 3.1 Flash Lite (Preview)100100100100100909097.1%
GPT-5.4 Nano (Reasoning)100100100100100909097.1%
WizardLM 2 8x22b100100100100100909097.1%
MoonshotAI: Kimi K2.510010010010090909095.7%
Z.AI GLM 4.510010010010090909095.7%
Claude Haiku 4.510010010010090909095.7%
GPT-5.410010010010090909095.7%
Gemini 2.5 Flash10010010010090909095.7%
GPT-4o Mini (temp=1)10010010010090909095.7%
GPT-5.4 Mini (Reasoning)1001001009090909094.3%
GPT-5 Nano10010010010090908094.3%
GPT-5.4 Mini1001001009090909094.3%
Claude Sonnet 4100100909090909092.9%
ByteDance Seed 2.0 Mini100100909090909092.9%
GPT-4o, May 13th (temp=0)100100909090909092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
Mistral Small 3.2 24B100100909090909092.9%
GPT-5.4 (Reasoning, Low)100100909090909092.9%
MiniMax M2.5100100909090909092.9%
Qwen 3.5 Plus (2026-02-15)100100909090909092.9%
ByteDance Seed 2.0 Lite100100100100100906092.9%
GPT-5.4 Nano (Reasoning, Low)100100909090909092.9%
Gemma 3 12B100100909090909092.9%
Arcee AI: Trinity Mini100100909090909092.9%
Stealth: Hunter Alpha10090909090909091.4%
Qwen 3.5 Flash1001001001001001004091.4%
GPT-5.4 Mini (Reasoning, Low)10090909090909091.4%
GPT-5.4 Nano10090909090909091.4%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
Mistral Large 29090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Mistral Large9090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Gemma 3 27B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Claude 3 Haiku9090909090909090.0%
LFM2 24B9090909090909090.0%
Ministral 3 3B9090909090908088.6%
Mistral Small 49090909090908088.6%
Ministral 3B10090909090808088.6%
DeepSeek V3.11001001001001001001087.1%
Z.AI GLM 4.7 Flash10010010010090902085.7%
Qwen 3 32B100100100100100100085.7%
Llama 3.1 Nemotron 70B9090909080808085.7%
Aion 2.010010010010010090084.3%
GPT-4o, Aug. 6th (temp=1)1001001001009090082.9%
DeepSeek V3 (2025-03-24)10010010010010080082.9%
Llama 3.1 8B100100909080804082.9%
Nemotron 3 Super10010010010090404081.4%
Nemotron 3 Nano1001001009090504081.4%
GPT-4.1 Nano9090808080707080.0%
Mistral Small Creative8080808080808080.0%
Mistral Small 4 (Reasoning)1009090909090078.6%
MiniMax M2.710010010010010040077.1%
Ministral 3 14B8080808080707077.1%
Qwen 3.5 9B10010010010040404074.3%
Inception Mercury10010010010040404074.3%
Cohere Command R+ (Aug. 2024)100100707060604071.4%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070707070.0%
DeepSeek-V2 Chat10010010090900068.6%
Inception Mercury 2100100404040404057.1%
Rocinante 12B100909060400054.3%
Hermes 3 70B90800000024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
MiniMax M2.71001001001001001009098.6%
Stealth: Healer Alpha1001001001001001009098.6%
GPT-5.4 Mini (Reasoning, Low)1001001001001001009098.6%
Ministral 8B1001001001001001009098.6%
Ministral 3B1001001001001001009098.6%
MoonshotAI: Kimi K2.5100100100100100909097.1%
o4 Mini100100100100100909097.1%
Mistral Small 4 (Reasoning)100100100100100909097.1%
Hermes 3 405B100100100100100909097.1%
Arcee AI: Trinity Mini100100100100100909097.1%
GPT-5 Nano10010010010090909095.7%
GPT-5.4 Nano (Reasoning)1001001001001001007095.7%
Gemini 2.5 Flash Lite (Reasoning)100100100100100907094.3%
Inception Mercury 210010010010090908094.3%
Ministral 3 3B1001001009090909094.3%
Qwen 3 32B1001001009090908092.9%
LFM2 24B9090909090909090.0%
DeepSeek V3.11001001001001001003090.0%
GPT-5.4 Nano (Reasoning, Low)10090909090808088.6%
Nemotron 3 Nano100100909090807088.6%
ByteDance Seed 1.6 Flash10090909080808087.1%
GPT-5.4 Nano100100909080707085.7%
Gemma 3 4B9090909080808085.7%
Rocinante 12B100100100100100100085.7%
Inception Mercury10090808080808084.3%
Z.AI GLM 4.7 Flash10090908080802077.1%
Cohere Command R+ (Aug. 2024)100100908060605077.1%
GPT-4.1 Nano8080807070707074.3%
GPT-4o Mini (temp=1)8080707070707072.9%
GPT-4o Mini (temp=0)7070707070707070.0%
Hermes 3 70B100100100000042.9%