Non-passive narration preserved

Test: Text Replacement

Avg. Score
87.1%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude Haiku 4.5100.0%$0.00515.8s100%
2Gemma 4 26B100.0%$0.000327.1s100%
3Claude Sonnet 4.6100.0%$0.0157.5s100%
4Claude Sonnet 4100.0%$0.0159.4s100%
5Qwen 3.5 Plus (2026-02-15)99.1%$0.002210.2s94%
6Gemini 2.5 Flash Lite98.2%$0.00042.5s91%
7Claude Opus 4.5100.0%$0.0268.1s100%
8Claude Opus 4.6100.0%$0.0268.7s100%
9Gemma 4 31B99.1%$0.000444.2s94%
10Gemini 2.5 Flash95.5%$0.00213.0s88%
11Grok 4.20 (Beta)95.5%$0.00472.5s88%
12Claude Opus 4.7100.0%$0.0367.3s100%
13Xiaomi MIMO v2.594.6%$0.004015.0s88%
14Claude Sonnet 4.596.4%$0.0157.2s89%
15Qwen 2.5 72B94.6%$0.000416.3s84%
16Mistral Small 3.2 24B93.8%$0.00036.9s82%
17Gemini 3 Flash (Preview)93.8%$0.00274.3s82%
18Grok 4.2093.8%$0.00296.5s82%
19Mistral Large 393.8%$0.001610.5s82%
20DeepSeek V3.295.5%$0.000753.2s88%
21Grok 4 Fast93.8%$0.001411.7s82%
22Writer: Palmyra X593.8%$0.005013.7s84%
23MiniMax M2.597.3%$0.00211.3m90%
24Gemini 3.5 Flash (Reasoning, Minimal)93.8%$0.00813.5s82%
25Ministral 3 14B92.9%$0.00035.7s79%
26Grok 4.1 Fast93.8%$0.001722.5s82%
27Mistral Large93.8%$0.006210.4s82%
28GPT-4o Mini (temp=1)87.5%$0.000612.4s88%
29Mistral Large 293.8%$0.006210.5s82%
30GPT-4o Mini (temp=0)87.5%$0.000613.8s88%
31LFM2 24B87.5%$0.000115.5s88%
32ByteDance Seed 1.6 Flash92.9%$0.001221.9s82%
33DeepSeek V4 Pro92.9%$0.002022.3s82%
34Gemma 3 12B92.9%$0.000112.8s77%
35Gemini 3.1 Flash Lite (Preview)89.3%$0.00132.4s80%
36Gemini 3.1 Flash Lite89.3%$0.00132.4s80%
37Gemini 3.1 Flash Lite (Reasoning)88.4%$0.00138.2s82%
38GPT-4.1 Mini88.4%$0.001510.3s82%
39Hermes 3 405B87.5%$0.001731.7s88%
40Cydonia 24B V4.192.0%$0.000618.0s78%
41Xiaomi MIMO v2.5 Pro93.8%$0.007332.3s82%
42GPT-5.4 Mini (Reasoning, Low)91.1%$0.00527.9s78%
43DeepSeek V3 (2024-12-26)91.1%$0.001223.1s78%
44Mistral Small Creative91.1%$0.00034.1s73%
45GPT-5.5 (Reasoning, Low)94.6%$0.03211.5s88%
46GPT-5.4 Mini87.5%$0.00403.2s79%
47GPT-5.4 (Reasoning, Low)93.8%$0.02014.1s82%
48Arcee AI: Trinity Large (Preview)89.3%$0.000031.5s80%
49Gemma 3 27B86.6%$0.000324.0s82%
50GPT-5 Mini93.8%$0.008250.8s82%
51Gemini 2.5 Flash Lite (Reasoning)92.0%$0.003332.9s77%
52GPT-5.593.8%$0.0266.8s82%
53Z.AI GLM 4.693.8%$0.007857.8s82%
54Mistral Small 484.8%$0.00064.5s79%
55Qwen3 235B A22B Instruct 250791.1%$0.000419.2s73%
56Stealth: Healer Alpha90.2%$0.000024.2s74%
57Arcee AI: Trinity Mini86.6%$0.000311.2s75%
58GPT-5.491.1%$0.0138.3s75%
59Claude 3.7 Sonnet86.6%$0.0158.5s82%
60Claude Opus 4.7 (Reasoning)99.1%$0.06413.9s94%
61Gemini 2.5 Flash (Reasoning)90.2%$0.01422.4s79%
62GPT-4o, May 13th (temp=0)90.2%$0.0154.6s75%
63Gemma 3 4B88.4%$0.000110.1s70%
64Z.AI GLM 4.592.0%$0.006352.4s77%
65Z.AI GLM 5 Turbo93.8%$0.02250.4s82%
66Mistral Medium 3.182.1%$0.00186.3s77%
67Grok 495.5%$0.03947.6s88%
68Grok 4.20 (Reasoning)93.8%$0.0181.1m82%
69ByteDance Seed 1.687.5%$0.00771.4m88%
70Qwen 3.6 Flash89.3%$0.01445.2s80%
71GPT-5.4 Mini (Reasoning)89.3%$0.01837.8s80%
72Gemini 3 Flash (Preview, Reasoning)88.4%$0.02236.9s82%
73Grok 4.20 (Beta, Reasoning)93.8%$0.04026.6s82%
74GPT-4o, May 13th (temp=1)85.7%$0.0154.7s74%
75GPT-5.288.4%$0.02522.2s77%
76Z.AI GLM 4.5 Air91.1%$0.00591.5m78%
77GPT-5.192.0%$0.02934.1s77%
78DeepSeek V4 Flash (Reasoning)92.0%$0.00132.3m77%
79DeepSeek V4 Flash92.0%$0.000310.6s49%
80Ministral 3 8B83.9%$0.00024.2s57%
81GPT-4.180.4%$0.00766.4s66%
82Ministral 8B83.0%$0.00023.9s57%
83Stealth: Hunter Alpha87.5%$0.000029.8s55%
84Qwen 3 32B87.5%$0.001043.4s58%
85GPT-5.4 Nano80.4%$0.00113.8s59%
86GPT-5.5 (Reasoning)93.8%$0.06825.1s82%
87GPT-5.4 Nano (Reasoning)79.5%$0.003422.3s63%
88DeepSeek V3.188.4%$0.000935.4s52%
89Gemma 4 26B (Reasoning)93.8%$0.00423.4m82%
90Qwen 3.5 Plus (2026-04-20)88.4%$0.0202.1m82%
91Mistral Small 4 (Reasoning)86.6%$0.003634.3s55%
92GPT-5.4 (Reasoning)92.9%$0.05247.2s77%
93Grok 4.387.5%$0.00306.0s47%
94Grok 4.3 (Reasoning)91.1%$0.0191.5m70%
95Gemini 2.5 Pro92.9%$0.05641.9s77%
96Llama 3.1 70B77.7%$0.000718.9s61%
97Claude 3.5 Sonnet81.3%$0.03013.8s71%
98Qwen 3.5 35B89.3%$0.0251.3m71%
99ByteDance Seed 2.0 Lite85.7%$0.00851.6m69%
100Qwen 3.5 Flash87.5%$0.00501.4m61%
101GPT-5.4 Nano (Reasoning, Low)77.7%$0.00147.4s56%
102GPT-592.0%$0.0501.3m77%
103Qwen 3.6 35B84.8%$0.0111.0m61%
104GPT-OSS 120B84.8%$0.001046.5s52%
105Gemini 3 Pro (Preview)89.3%$0.06443.7s80%
106Claude Opus 491.1%$0.07713.4s78%
107Z.AI GLM 4.792.0%$0.0173.0m77%
108Qwen 3.5 122B90.2%$0.0411.9m79%
109DeepSeek V4 Pro (Reasoning)92.0%$0.0133.5m77%
110Gemini 3.5 Flash (Reasoning)89.3%$0.07733.0s80%
111GPT-4o, Aug. 6th (temp=0)81.3%$0.00913.9s46%
112Qwen 3.5 397B A17B90.2%$0.0113.7m79%
113DeepSeek V3 (2025-03-24)82.1%$0.000842.6s46%
114Z.AI GLM 591.1%$0.0233.2m78%
115Qwen 3.5 27B91.1%$0.0372.9m78%
116Inception Mercury75.9%$0.00053.8s41%
117ByteDance Seed 2.0 Mini89.3%$0.00364.0m74%
118Llama 3.1 8B75.9%$0.000115.3s42%
119GPT-4.1 Nano75.0%$0.00045.0s39%
120Aion 2.085.7%$0.00691.5m45%
121Z.AI GLM 4.7 Flash84.8%$0.00342.4m55%
122WizardLM 2 8x22b82.1%$0.00141.5m45%
123Z.AI GLM 5.192.9%$0.0443.4m77%
124o4 Mini83.0%$0.02639.2s44%
125GPT-5 Nano78.6%$0.00521.9m52%
126DeepSeek-V2 Chat75.9%$0.001121.5s33%
127Qwen3.7 Max88.4%$0.0722.1m77%
128Llama 3.1 Nemotron 70B64.3%$0.002120.8s47%
129Ministral 3B71.4%$0.00012.9s32%
130Inception Mercury 267.9%$0.00273.7s38%
131GPT-4o, Aug. 6th (temp=1)71.4%$0.00864.1s31%
132Gemma 4 31B (Reasoning)92.9%$0.00276.3m77%
133Claude Sonnet 4.6 (Reasoning)93.8%$0.1231.5m82%
134o4 Mini High83.9%$0.0561.4m54%
135MiniMax M2.785.7%$0.0143.1m49%
136Skyfall 36B V269.6%$0.001013.0s26%
137Claude Opus 4.6 (Reasoning)91.1%$0.1291.1m78%
138Qwen3.6 Max Preview92.0%$0.0664.0m77%
139Qwen 3.5 9B76.8%$0.00202.9m45%
140Ministral 3 3B60.7%$0.00022.9s25%
141Qwen 3.6 27B82.1%$0.0271.9m32%
142MoonshotAI: Kimi K2.591.1%$0.0296.5m70%
143Mistral NeMO53.6%$0.00023.1s14%
144Gemini 3.1 Pro (Preview)93.8%$0.1682.6m82%
145Nemotron 3 Super61.6%$0.00002.5m24%
146MoonshotAI: Kimi K2.692.9%$0.0727.6m77%
147Claude 3 Haiku42.0%$0.00136.8s6%
148Nemotron 3 Nano63.4%$0.00344.0m27%
149Cohere Command R+ (Aug. 2024)36.6%$0.009218.4s17%
150Hermes 3 70B57.1%$0.00222.8m10%
151Rocinante 12B28.6%$0.000510.0s7%
87.14%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)1001001001001001008898.2%
MiniMax M2.51001001001001001008898.2%
Gemma 4 31B1001001001001001008898.2%
Qwen 3.5 Plus (2026-02-15)1001001001001001008898.2%
Gemini 2.5 Flash Lite100100100100100888896.4%
Z.AI GLM 4.61001001008888888892.9%
MiniMax M2.71001001008888888892.9%
Claude Sonnet 4.51001001008888888892.9%
Xiaomi MIMO v2.51001001008888888892.9%
DeepSeek V3.21001001008888888892.9%
Grok 4100100888888888891.1%
GPT-5.4 Mini (Reasoning, Low)100100888888888891.1%
Grok 4.20 (Beta)100100888888888891.1%
Gemini 2.5 Flash100100888888888891.1%
Qwen 2.5 72B1001001008888887591.1%
Arcee AI: Trinity Large (Preview)100100888888888891.1%
GPT-5.5 (Reasoning, Low)10088888888888889.3%
Xiaomi MIMO v2.5 Pro10088888888888889.3%
GPT-4o, May 13th (temp=1)10088888888888889.3%
DeepSeek V3 (2024-12-26)10088888888888889.3%
GPT-5.4 Mini10088888888888889.3%
Claude Opus 4.6 (Reasoning)8888888888888887.5%
Qwen3.6 Max Preview8888888888888887.5%
Gemini 3.1 Pro (Preview)8888888888888887.5%
Z.AI GLM 5.18888888888888887.5%
Z.AI GLM 5 Turbo8888888888888887.5%
Gemini 3.5 Flash (Reasoning)8888888888888887.5%
Claude Sonnet 4.6 (Reasoning)8888888888888887.5%
GPT-5.4 (Reasoning)8888888888888887.5%
GPT-5.5 (Reasoning)8888888888888887.5%
GPT-5 Mini8888888888888887.5%
GPT-5.18888888888888887.5%
MoonshotAI: Kimi K2.68888888888888887.5%
GPT-58888888888888887.5%
Qwen 3.5 397B A17B8888888888888887.5%
Gemma 4 31B (Reasoning)8888888888888887.5%
Qwen 3.5 122B8888888888888887.5%
Qwen 3.5 Plus (2026-04-20)8888888888888887.5%
Gemma 4 26B (Reasoning)8888888888888887.5%
Grok 4.20 (Beta, Reasoning)8888888888888887.5%
GPT-5.4 (Reasoning, Low)8888888888888887.5%
Grok 4.20 (Reasoning)8888888888888887.5%
Z.AI GLM 58888888888888887.5%
MoonshotAI: Kimi K2.58888888888888887.5%
Qwen 3.5 27B8888888888888887.5%
ByteDance Seed 1.68888888888888887.5%
Qwen 3.6 Flash8888888888888887.5%
GPT-5.4 Mini (Reasoning)8888888888888887.5%
Gemini 3 Flash (Preview, Reasoning)8888888888888887.5%
DeepSeek V4 Pro (Reasoning)8888888888888887.5%
Grok 4.1 Fast8888888888888887.5%
Aion 2.08888888888888887.5%
GPT-5.58888888888888887.5%
DeepSeek V4 Flash (Reasoning)8888888888888887.5%
Gemini 3 Pro (Preview)8888888888888887.5%
Z.AI GLM 4.78888888888888887.5%
Gemini 2.5 Pro8888888888888887.5%
Claude Opus 48888888888888887.5%
Gemini 2.5 Flash (Reasoning)8888888888888887.5%
Gemini 3.5 Flash (Reasoning, Minimal)8888888888888887.5%
GPT-OSS 120B8888888888888887.5%
Gemini 3.1 Flash Lite (Reasoning)8888888888888887.5%
Qwen 3.5 Flash8888888888888887.5%
Z.AI GLM 4.58888888888888887.5%
Grok 4 Fast8888888888888887.5%
Gemini 3.1 Flash Lite (Preview)8888888888888887.5%
Gemini 3.1 Flash Lite8888888888888887.5%
Gemini 2.5 Flash Lite (Reasoning)8888888888888887.5%
Mistral Large 38888888888888887.5%
GPT-4o, May 13th (temp=0)8888888888888887.5%
Gemini 3 Flash (Preview)8888888888888887.5%
DeepSeek-V2 Chat8888888888888887.5%
Z.AI GLM 4.7 Flash10088888888887587.5%
ByteDance Seed 2.0 Lite8888888888888887.5%
GPT-4.1 Mini8888888888888887.5%
Z.AI GLM 4.5 Air8888888888888887.5%
Hermes 3 405B8888888888888887.5%
Mistral Large 28888888888888887.5%
Mistral Small 4 (Reasoning)10088888888887587.5%
DeepSeek V3.1100100100100100882587.5%
DeepSeek V3 (2025-03-24)10088888888887587.5%
Grok 4.208888888888888887.5%
Mistral Large8888888888888887.5%
Writer: Palmyra X510088888888887587.5%
GPT-4o Mini (temp=1)8888888888888887.5%
Mistral Small 3.2 24B8888888888888887.5%
Gemma 3 12B8888888888888887.5%
GPT-4o Mini (temp=0)8888888888888887.5%
LFM2 24B8888888888888887.5%
Qwen3.7 Max8888888888887585.7%
GPT-5.28888888888887585.7%
Claude 3.7 Sonnet8888888888887585.7%
DeepSeek V4 Pro10088888888757585.7%
Gemma 3 27B8888888888887585.7%
ByteDance Seed 1.6 Flash10088888888757585.7%
Hermes 3 70B100100888888756385.7%
Ministral 3 14B8888888888887585.7%
GPT-4.18888888888757583.9%
Qwen 3.5 35B8888888888886383.9%
ByteDance Seed 2.0 Mini8888888888757583.9%
Qwen 3 32B8888888888886383.9%
DeepSeek V4 Flash10010010010010088083.9%
Mistral Small 48888888888757583.9%
Cydonia 24B V4.110088888888756383.9%
Grok 4.3 (Reasoning)8888888888885082.1%
GPT-5.48888888875757582.1%
Qwen3 235B A22B Instruct 25078888888888756382.1%
Mistral Small Creative8888888888756382.1%
Arcee AI: Trinity Mini8888888875757582.1%
Qwen 3.5 9B8888888888755080.4%
Stealth: Healer Alpha8888887575757580.4%
Inception Mercury 28888888888636380.4%
Stealth: Hunter Alpha8888888888882578.6%
Inception Mercury8888757575757578.6%
Qwen 3.6 35B8888888888505076.8%
Grok 4.3100100100888863076.8%
Llama 3.1 70B8888887575755076.8%
Mistral Medium 3.18875757575757576.8%
Gemma 3 4B8888757575756376.8%
Claude 3.5 Sonnet7575757575757575.0%
GPT-4o, Aug. 6th (temp=0)888888888888075.0%
o4 Mini High8888888875632573.2%
WizardLM 2 8x22b100100888888381373.2%
Llama 3.1 8B100100887563503873.2%
GPT-4o, Aug. 6th (temp=1)888888888863071.4%
o4 Mini8888888875252567.9%
GPT-5.4 Nano (Reasoning)8875636363636367.9%
GPT-5.4 Nano7575756363636367.9%
Ministral 3 8B7575757563635067.9%
GPT-5 Nano8888757550503866.1%
Ministral 8B7575636363636366.1%
Qwen 3.6 27B100888888880064.3%
GPT-5.4 Nano (Reasoning, Low)7575756363505064.3%
GPT-4.1 Nano1008875635038058.9%
Llama 3.1 Nemotron 70B8875636338382555.4%
Nemotron 3 Super888850505050053.6%
Ministral 3B6363505050383850.0%
Skyfall 36B V288757575250048.2%
Nemotron 3 Nano8875503825251344.6%
Ministral 3 3B6363502525251337.5%
Mistral NeMO38252513130016.1%
Cohere Command R+ (Aug. 2024)501313000010.7%
Rocinante 12B38131300008.9%
Claude 3 Haiku00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Z.AI GLM 5.11001001001001001008898.2%
GPT-5.4 (Reasoning)1001001001001001008898.2%
MoonshotAI: Kimi K2.61001001001001001008898.2%
Gemma 4 31B (Reasoning)1001001001001001008898.2%
Gemini 2.5 Pro1001001001001001008898.2%
o4 Mini1001001001001001008898.2%
Xiaomi MIMO v2.5 Pro1001001001001001008898.2%
DeepSeek V3.21001001001001001008898.2%
Grok 4.31001001001001001008898.2%
Gemma 3 12B1001001001001001008898.2%
Qwen 2.5 72B1001001001001001008898.2%
Qwen3.6 Max Preview100100100100100888896.4%
GPT-5.1100100100100100888896.4%
GPT-5100100100100100888896.4%
DeepSeek V4 Pro (Reasoning)100100100100100888896.4%
DeepSeek V4 Flash (Reasoning)100100100100100888896.4%
MiniMax M2.5100100100100100888896.4%
Z.AI GLM 4.7100100100100100888896.4%
Stealth: Hunter Alpha100100100100100888896.4%
Z.AI GLM 4.5100100100100100888896.4%
Gemini 2.5 Flash Lite (Reasoning)100100100100100888896.4%
Xiaomi MIMO v2.5100100100100100888896.4%
Claude Opus 4.6 (Reasoning)10010010010088888894.6%
Z.AI GLM 510010010010088888894.6%
MoonshotAI: Kimi K2.51001001001001001006394.6%
Qwen 3.5 27B10010010010088888894.6%
o4 Mini High10010010010088888894.6%
Z.AI GLM 4.610010010010088888894.6%
Qwen 3.5 35B10010010010088888894.6%
Claude Opus 410010010010088888894.6%
ByteDance Seed 2.0 Mini10010010010088888894.6%
Z.AI GLM 4.5 Air10010010010088888894.6%
Qwen 3.5 397B A17B1001001008888888892.9%
Qwen 3.5 122B1001001008888888892.9%
Qwen 3.6 35B1001001008888888892.9%
Gemini 2.5 Flash (Reasoning)1001001008888888892.9%
GPT-4o, May 13th (temp=0)10010010010088887592.9%
DeepSeek V3 (2024-12-26)1001001008888888892.9%
GPT-5.4 Nano1001001008888888892.9%
Ministral 3B100100100100100886392.9%
Qwen3.7 Max100100888888888891.1%
Gemini 3.5 Flash (Reasoning)100100888888888891.1%
Qwen 3.6 Flash100100888888888891.1%
GPT-5.4 Mini (Reasoning)100100888888888891.1%
GPT-5.2100100888888888891.1%
Gemini 3 Pro (Preview)100100888888888891.1%
Gemini 3.1 Flash Lite (Preview)100100888888888891.1%
Gemini 3.1 Flash Lite100100888888888891.1%
GPT-5.4 Mini (Reasoning, Low)100100888888888891.1%
GPT-5 Nano10010010010088886391.1%
Qwen 3 32B1001001001001001003891.1%
GPT-5.4 Nano (Reasoning)100100888888888891.1%
GPT-5.4 Nano (Reasoning, Low)100100888888888891.1%
GPT-4.1 Nano100100888888888891.1%
WizardLM 2 8x22b100100888888888891.1%
Arcee AI: Trinity Mini100100888888888891.1%
Skyfall 36B V210010010010088886391.1%
Mistral NeMO100100888888888891.1%
Qwen 3.5 Plus (2026-04-20)10088888888888889.3%
Gemini 3 Flash (Preview, Reasoning)10088888888888889.3%
Gemini 3.1 Flash Lite (Reasoning)10088888888888889.3%
GPT-4.1 Mini10088888888888889.3%
DeepSeek V3.1100100100100100883889.3%
ByteDance Seed 1.68888888888888887.5%
Qwen 3.5 Flash10010010010088883887.5%
Claude 3.5 Sonnet8888888888888887.5%
Claude 3.7 Sonnet8888888888888887.5%
Hermes 3 405B8888888888888887.5%
GPT-4o, Aug. 6th (temp=0)1001001008875757587.5%
GPT-4o Mini (temp=1)8888888888888887.5%
GPT-4o Mini (temp=0)8888888888888887.5%
Gemma 3 27B8888888888888887.5%
Mistral Medium 3.18888888888888887.5%
Arcee AI: Trinity Large (Preview)8888888888888887.5%
LFM2 24B8888888888888887.5%
GPT-5.4 Mini8888888888887585.7%
Mistral Small 4 (Reasoning)10010010010088882585.7%
Mistral Small 48888888888887585.7%
Aion 2.010010010010010088083.9%
ByteDance Seed 2.0 Lite10088888888885083.9%
Claude 3 Haiku10088888875757583.9%
Ministral 3 3B10010010010063636383.9%
GPT-OSS 120B100100100100100383882.1%
Z.AI GLM 4.7 Flash10010010010088503882.1%
GPT-4o, May 13th (temp=1)10088887575757582.1%
Nemotron 3 Nano1001001008888633882.1%
MiniMax M2.710010010010088501378.6%
Llama 3.1 70B8888757575757578.6%
Llama 3.1 8B1001001008875632578.6%
GPT-4.18875757575757576.8%
DeepSeek V3 (2025-03-24)10010088888875076.8%
Qwen 3.5 9B10010010010038383873.2%
Inception Mercury10010010010038383873.2%
Llama 3.1 Nemotron 70B8875757575755073.2%
GPT-4o, Aug. 6th (temp=1)10010075757575071.4%
Nemotron 3 Super100100887563382569.6%
DeepSeek-V2 Chat100888888880064.3%
Cohere Command R+ (Aug. 2024)8875636363503862.5%
Inception Mercury 2100100383838383855.4%
Rocinante 12B887575502525048.2%
Hermes 3 70B1001000000028.6%