Mara pronouns preserved (coreference test)

Test: Text Replacement

Avg. Score
84.6%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Mistral Small Creative100.0%$0.00023.2s100%
2Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00112.0s100%
3Gemini 3.1 Flash Lite100.0%$0.00113.9s100%
4Gemini 2.5 Flash100.0%$0.00172.4s100%
5Grok 4 Fast100.0%$0.00086.0s100%
6Gemini 3 Flash (Preview)100.0%$0.00213.6s100%
7GPT-4.1 Mini100.0%$0.00127.1s100%
8Mistral Medium 3.1100.0%$0.00156.5s100%
9Gemma 4 26B100.0%$0.000313.5s100%
10Stealth: Healer Alpha100.0%$0.000014.7s100%
11Gemini 3.1 Flash Lite (Preview)99.4%$0.00111.9s95%
12Claude Haiku 4.5100.0%$0.00423.3s100%
13Llama 3.1 Nemotron 70B100.0%$0.001616.3s100%
14Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00642.9s100%
15GPT-4.1100.0%$0.00614.7s100%
16GPT-4o, Aug. 6th (temp=0)100.0%$0.00762.8s100%
17GPT-4o, Aug. 6th (temp=1)100.0%$0.00763.2s100%
18Gemma 4 31B100.0%$0.000427.2s100%
19Hermes 3 405B100.0%$0.001425.1s100%
20DeepSeek V4 Flash (Reasoning)100.0%$0.000528.0s100%
21DeepSeek V4 Pro100.0%$0.001925.2s100%
22Qwen 3.5 Plus (2026-02-15)98.7%$0.00177.7s91%
23Grok 4.1 Fast98.7%$0.000910.9s91%
24Qwen 3 32B100.0%$0.000633.7s100%
25Grok 4.20 (Reasoning)100.0%$0.006018.4s100%
26GPT-5.4100.0%$0.0106.1s100%
27Stealth: Hunter Alpha98.7%$0.000017.4s91%
28Gemini 3 Flash (Preview, Reasoning)100.0%$0.008313.3s100%
29GPT-5.4 (Reasoning, Low)100.0%$0.0116.7s100%
30GPT-4o, May 13th (temp=1)100.0%$0.0124.0s100%
31GPT-4o, May 13th (temp=0)100.0%$0.0124.1s100%
32Claude Sonnet 4.6100.0%$0.0135.1s100%
33Claude Sonnet 4.5100.0%$0.0135.2s100%
34Claude 3.7 Sonnet100.0%$0.0136.4s100%
35Claude Sonnet 4100.0%$0.0136.5s100%
36Z.AI GLM 4.5100.0%$0.004132.4s100%
37Grok 4.20 (Beta, Reasoning)100.0%$0.0136.4s100%
38Qwen 3.6 Flash100.0%$0.007522.4s100%
39Qwen 3.6 35B100.0%$0.005330.7s100%
40Gemini 2.5 Flash (Reasoning)98.7%$0.00598.8s91%
41ByteDance Seed 1.6100.0%$0.003537.7s100%
42Grok 4.3 (Reasoning)100.0%$0.007632.7s100%
43Aion 2.0100.0%$0.003847.6s100%
44o4 Mini100.0%$0.01421.5s100%
45ByteDance Seed 2.0 Lite100.0%$0.004552.2s100%
46GPT-5.1100.0%$0.01716.5s100%
47GPT-5.5100.0%$0.0215.3s100%
48Claude Opus 4.5100.0%$0.0215.8s100%
49Claude Opus 4.6100.0%$0.0216.5s100%
50Gemma 4 31B (Reasoning)100.0%$0.00101.2m100%
51Z.AI GLM 5100.0%$0.007454.4s100%
52Z.AI GLM 4.7100.0%$0.00591.0m100%
53GPT-5.5 (Reasoning, Low)100.0%$0.0256.6s100%
54Qwen 3.5 27B100.0%$0.01249.5s100%
55Claude 3.5 Sonnet100.0%$0.02510.7s100%
56MiniMax M2.7100.0%$0.00651.1m100%
57Claude Opus 4.7100.0%$0.0285.1s100%
58Inception Mercury 287.0%$0.00192.6s68%
59Gemma 3 27B89.0%$0.000218.1s72%
60Z.AI GLM 4.698.7%$0.00501.0m91%
61Gemini 3.5 Flash (Reasoning)100.0%$0.02912.0s100%
62o4 Mini High100.0%$0.02233.0s100%
63GPT-5.5 (Reasoning)100.0%$0.03110.1s100%
64Qwen 3.6 27B100.0%$0.0141.0m100%
65Gemini 2.5 Pro100.0%$0.02820.4s100%
66Qwen 3.5 122B100.0%$0.02046.9s100%
67Z.AI GLM 4.5 Air92.2%$0.001944.5s78%
68Grok 4100.0%$0.02530.9s100%
69GPT-OSS 120B90.9%$0.001140.9s74%
70Z.AI GLM 5.1100.0%$0.0131.2m100%
71Gemini 3 Pro (Preview)100.0%$0.03121.3s100%
72DeepSeek V4 Pro (Reasoning)100.0%$0.00801.7m100%
73MoonshotAI: Kimi K2.6100.0%$0.0151.5m100%
74ByteDance Seed 1.6 Flash90.3%$0.000712.3s47%
75Claude Sonnet 4.6 (Reasoning)100.0%$0.03820.1s100%
76MoonshotAI: Kimi K2.5100.0%$0.00851.8m100%
77Ministral 3 3B80.5%$0.00012.2s50%
78Claude Opus 4.7 (Reasoning)100.0%$0.0438.1s100%
79Gemma 4 26B (Reasoning)100.0%$0.00172.2m100%
80Xiaomi MIMO v2.592.9%$0.004317.1s48%
81GPT-5.4 Nano (Reasoning, Low)86.4%$0.00098.4s44%
82Qwen3 235B A22B Instruct 250781.8%$0.000516.2s52%
83Claude Opus 4.6 (Reasoning)100.0%$0.04316.8s100%
84Grok 4.2088.3%$0.00224.8s41%
85GPT-5.292.9%$0.0107.1s48%
86Llama 3.1 70B92.9%$0.000536.2s48%
87Arcee AI: Trinity Mini63.6%$0.00027.1s64%
88Gemini 3.1 Pro (Preview)100.0%$0.03836.2s100%
89Qwen3.7 Max100.0%$0.03254.8s100%
90GPT-5100.0%$0.03549.7s100%
91Z.AI GLM 5 Turbo92.9%$0.009519.7s48%
92Claude 3 Haiku78.6%$0.00105.0s42%
93Ministral 3B77.3%$0.00012.3s41%
94GPT-5.4 Nano81.8%$0.00093.5s37%
95Gemma 3 4B77.3%$0.00016.5s42%
96MiniMax M2.595.5%$0.00171.6m72%
97GPT-5 Mini92.2%$0.005732.9s49%
98Inception Mercury69.5%$0.00055.2s41%
99Xiaomi MIMO v2.5 Pro85.7%$0.002913.0s30%
100Mistral Small 4 (Reasoning)84.4%$0.001916.2s30%
101GPT-5.4 (Reasoning)92.9%$0.02016.0s48%
102GPT-5.4 Mini (Reasoning)85.7%$0.00716.2s30%
103Writer: Palmyra X577.3%$0.004013.0s37%
104Claude Opus 4100.0%$0.0638.9s100%
105Qwen3.6 Max Preview100.0%$0.0331.7m100%
106Cydonia 24B V4.173.4%$0.000513.6s27%
107Qwen 2.5 72B78.6%$0.000311.5s18%
108Grok 4.20 (Beta)75.3%$0.00382.0s21%
109GPT-5.4 Nano (Reasoning)68.8%$0.00103.7s20%
110Qwen 3.5 Plus (2026-04-20)92.9%$0.0121.2m48%
111Nemotron 3 Super82.5%$0.000057.7s31%
112GPT-5.4 Mini (Reasoning, Low)72.1%$0.00362.9s19%
113GPT-5.4 Mini55.8%$0.00312.3s27%
114Mistral Small 461.7%$0.00043.4s15%
115Qwen 3.5 9B89.6%$0.00141.9m45%
116DeepSeek-V2 Chat71.4%$0.000917.7s10%
117Ministral 3 14B61.0%$0.00034.5s13%
118Mistral Large71.4%$0.00508.4s10%
119DeepSeek V3 (2025-03-24)68.8%$0.000726.6s11%
120GPT-5 Nano74.0%$0.00291.0m24%
121Mistral Small 3.2 24B56.5%$0.00025.5s6%
122Qwen 3.5 35B80.5%$0.01652.7s30%
123DeepSeek V3 (2024-12-26)64.3%$0.000918.3s4%
124Ministral 3 8B52.6%$0.00023.7s3%
125Qwen 3.5 Flash71.4%$0.00331.0m17%
126Skyfall 36B V248.1%$0.000810.8s12%
127Llama 3.1 8B54.5%$0.000110.1s3%
128Ministral 8B50.6%$0.00013.9s1%
129Gemini 2.5 Flash Lite50.0%$0.00031.9s0%
130Mistral NeMO50.0%$0.00022.7s0%
131Rocinante 12B50.0%$0.00049.0s3%
132DeepSeek V4 Flash50.0%$0.00028.7s0%
133Arcee AI: Trinity Large (Preview)54.5%$0.000027.1s3%
134Mistral Large 350.0%$0.00128.4s0%
135Grok 4.350.0%$0.00245.9s0%
136Z.AI GLM 4.7 Flash63.0%$0.00161.1m15%
137Mistral Large 250.0%$0.00508.4s0%
138Qwen 3.5 397B A17B85.7%$0.00752.2m30%
139GPT-4.1 Nano40.3%$0.00034.2s0%
140WizardLM 2 8x22b57.1%$0.000840.0s1%
141GPT-4o Mini (temp=0)31.8%$0.000510.8s12%
142GPT-4o Mini (temp=1)31.8%$0.000511.0s12%
143Gemini 2.5 Flash Lite (Reasoning)44.2%$0.002315.5s2%
144Gemma 3 12B29.2%$0.00019.6s8%
145DeepSeek V3.250.0%$0.000638.9s0%
146DeepSeek V3.146.1%$0.000643.9s1%
147Hermes 3 70B35.7%$0.000421.2s0%
148Cohere Command R+ (Aug. 2024)50.0%$0.007932.6s0%
149LFM2 24B6.5%$0.000114.7s0%
150ByteDance Seed 2.0 Mini50.0%$0.00202.0m0%
151Nemotron 3 Nano57.8%$0.00223.2m17%
84.58%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)1001001001001001008297.4%
Z.AI GLM 4.5 Air1001001001001001008297.4%
Inception Mercury 210010010010082828292.2%
MiniMax M2.5100100100100100914590.9%
Z.AI GLM 5 Turbo100100100100100100085.7%
GPT-5.4 (Reasoning)100100100100100100085.7%
Qwen 3.5 Plus (2026-04-20)100100100100100100085.7%
GPT-5.2100100100100100100085.7%
Qwen 3.5 9B100100100100100100085.7%
Xiaomi MIMO v2.5100100100100100100085.7%
Llama 3.1 70B100100100100100100085.7%
ByteDance Seed 1.6 Flash100100100100100100085.7%
GPT-5 Mini10010010010010091084.4%
GPT-5.4 Nano (Reasoning, Low)10010010010010045077.9%
Gemma 3 27B9191737373737377.9%
Grok 4.2010010010010010036076.6%
Inception Mercury10082828282732775.3%
Qwen 3.5 397B A17B1001001001001000071.4%
GPT-5.4 Mini (Reasoning)1001001001001000071.4%
Xiaomi MIMO v2.5 Pro1001001001001000071.4%
Mistral Small 4 (Reasoning)100100100100820068.8%
Nemotron 3 Super1001009182820064.9%
Qwen3 235B A22B Instruct 25076464646464646463.6%
GPT-5.4 Nano100100100733627963.6%
Arcee AI: Trinity Mini6464646464646463.6%
Qwen 3.5 35B10010010064640061.0%
Ministral 3 3B6464646464555561.0%
GPT-5 Nano1001001009199058.4%
Ministral 3B6464645555555558.4%
GPT-5.4 Mini (Reasoning, Low)10010010010000057.1%
Qwen 2.5 72B10010010010000057.1%
Claude 3 Haiku6464646464641857.1%
Nemotron 3 Nano1001008255450054.5%
Writer: Palmyra X5646464646464054.5%
Gemma 3 4B5555555555555554.5%
Grok 4.20 (Beta)10010010027270050.6%
GPT-5.4 Nano (Reasoning)100100735500046.8%
Cydonia 24B V4.11009164272718046.8%
Qwen 3.5 Flash100100643600042.9%
DeepSeek-V2 Chat100100100000042.9%
Mistral Large100100100000042.9%
Z.AI GLM 4.7 Flash10010073000039.0%
DeepSeek V3 (2025-03-24)10010064000037.7%
GPT-5.4 Mini363636363627029.9%
DeepSeek V3 (2024-12-26)1001000000028.6%
Hermes 3 70B1001000000028.6%
Mistral Small 3.2 24B2727272727272727.3%
Mistral Small 4454518181818023.4%
Ministral 3 14B5555181890022.1%
Skyfall 36B V236272727180019.5%
Rocinante 12B100270000018.2%
Gemini 2.5 Flash Lite (Reasoning)10000000014.3%
WizardLM 2 8x22b10000000014.3%
Arcee AI: Trinity Large (Preview)640000009.1%
Llama 3.1 8B640000009.1%
Ministral 3 8B1818000005.2%
Ministral 8B90000001.3%
ByteDance Seed 2.0 Mini00000000.0%
Mistral Large 300000000.0%
Mistral Large 200000000.0%
DeepSeek V3.100000000.0%
DeepSeek V3.200000000.0%
DeepSeek V4 Flash00000000.0%
Gemini 2.5 Flash Lite00000000.0%
GPT-4o Mini (temp=1)00000000.0%
Grok 4.300000000.0%
Gemma 3 12B00000000.0%
GPT-4o Mini (temp=0)00000000.0%
GPT-4.1 Nano00000000.0%
Cohere Command R+ (Aug. 2024)00000000.0%
Mistral NeMO00000000.0%
LFM2 24B00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Qwen3.7 Max100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Cydonia 24B V4.1100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)1001001001001001009198.7%
Grok 4.1 Fast1001001001001001008297.4%
Z.AI GLM 4.61001001001001001008297.4%
Stealth: Hunter Alpha1001001001001001008297.4%
Qwen 3.5 Plus (2026-02-15)1001001001001001008297.4%
Ministral 3B1001001001001001007396.1%
GPT-5.4 Nano (Reasoning, Low)100100100100100828294.8%
ByteDance Seed 1.6 Flash1001001001001001006494.8%
Qwen 3.5 9B1001001001001001005593.5%
DeepSeek V3.1100100100100100915592.2%
GPT-5.4 Nano (Reasoning)1001001009182828290.9%
GPT-5 Nano1001001008282828289.6%
Z.AI GLM 4.5 Air1001001008282737387.0%
GPT-5.4 Mini (Reasoning, Low)100100828282828287.0%
Z.AI GLM 4.7 Flash10091918282828287.0%
Mistral Small 3.2 24B100100100100100100085.7%
GPT-OSS 120B8282828282828281.8%
Inception Mercury 28282828282828281.8%
GPT-5.4 Mini8282828282828281.8%
Rocinante 12B1001001001009182081.8%
GPT-4.1 Nano1001001001008282080.5%
Skyfall 36B V210010010010045454576.6%
Gemini 2.5 Flash Lite (Reasoning)1001001001007345074.0%
Inception Mercury10082827355361863.6%
GPT-4o Mini (temp=1)6464646464646463.6%
GPT-4o Mini (temp=0)6464646464646463.6%
Arcee AI: Trinity Mini6464646464646463.6%
Nemotron 3 Nano1009182644545061.0%
Gemma 3 12B10055555555454558.4%
Hermes 3 70B100100100000042.9%
LFM2 24B1818181899013.0%