Name replacement accuracy

Test: Text Replacement

Avg. Score
96.2%
Scenarios
14

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Reasoning)99.4%$0.00092.6s98%
2Gemini 3.1 Flash Lite (Preview)99.3%$0.00091.7s97%
3Gemini 3.1 Flash Lite99.4%$0.00092.8s97%
4Grok 4 Fast99.6%$0.00075.8s98%
5Claude Haiku 4.599.7%$0.00342.8s98%
6Grok 4.1 Fast99.5%$0.000810.6s98%
7DeepSeek V4 Flash98.9%$0.00027.6s96%
8GPT-4.1 Mini98.9%$0.00106.6s95%
9Gemma 4 26B99.6%$0.000215.4s98%
10Qwen 3.5 Plus (2026-02-15)99.2%$0.00156.7s95%
11Mistral Large 398.7%$0.00117.4s96%
12Gemini 3 Flash (Preview)98.4%$0.00183.2s94%
13Stealth: Hunter Alpha99.4%$0.000017.2s97%
14GPT-4.199.4%$0.00524.1s97%
15Inception Mercury 298.3%$0.00162.2s92%
16GPT-4o Mini (temp=1)98.2%$0.00049.2s94%
17Qwen 2.5 72B98.4%$0.000310.3s94%
18Gemini 2.5 Flash Lite98.1%$0.00031.6s88%
19Stealth: Healer Alpha98.9%$0.000013.4s94%
20Grok 4.2098.2%$0.00194.2s93%
21GPT-4o Mini (temp=0)98.0%$0.00049.2s93%
22Mistral Large98.8%$0.00427.2s96%
23Mistral Large 298.8%$0.00427.2s96%
24Gemma 3 27B98.6%$0.000216.0s94%
25DeepSeek V4 Pro99.4%$0.001221.5s98%
26GPT-4o, Aug. 6th (temp=0)99.0%$0.00642.6s94%
27DeepSeek-V2 Chat98.4%$0.000816.5s94%
28Gemini 2.5 Flash (Reasoning)99.1%$0.005410.7s97%
29GPT-5.4 Nano (Reasoning)96.9%$0.00115.1s90%
30Gemma 4 31B99.7%$0.000329.4s99%
31Claude Sonnet 499.8%$0.0105.7s99%
32GPT-4o, May 13th (temp=0)99.7%$0.0103.5s98%
33Claude Sonnet 4.599.6%$0.0104.6s98%
34GPT-5.499.2%$0.00895.4s97%
35Gemini 2.5 Flash98.7%$0.00142.1s82%
36GPT-4o, Aug. 6th (temp=1)98.7%$0.00642.6s92%
37GPT-5.4 (Reasoning, Low)99.4%$0.00945.6s97%
38Grok 4.20 (Beta)97.6%$0.00321.8s87%
39GPT-5.4 Nano95.9%$0.00073.1s87%
40Llama 3.1 8B96.5%$0.00009.3s89%
41Mistral Medium 3.196.2%$0.00126.0s89%
42Z.AI GLM 4.599.7%$0.003126.1s98%
43Claude 3.7 Sonnet99.4%$0.0105.5s97%
44GPT-4o, May 13th (temp=1)99.2%$0.0103.3s96%
45Claude Sonnet 4.699.2%$0.0104.4s96%
46GPT-5.298.8%$0.00876.0s94%
47Llama 3.1 Nemotron 70B97.7%$0.001413.9s89%
48Mistral Small 3.2 24B97.7%$0.00024.7s80%
49Grok 4.20 (Reasoning)99.6%$0.006421.1s98%
50GPT-5.4 Mini (Reasoning)98.4%$0.00537.2s88%
51DeepSeek V4 Flash (Reasoning)99.5%$0.000536.8s98%
52Z.AI GLM 5 Turbo99.7%$0.008318.8s98%
53Gemma 3 12B97.3%$0.00018.6s82%
54Llama 3.1 70B98.0%$0.000523.9s91%
55GPT-5.4 Nano (Reasoning, Low)95.3%$0.00073.7s83%
56Gemini 3 Flash (Preview, Reasoning)99.5%$0.009717.2s98%
57GPT-5.4 Mini96.1%$0.00272.1s82%
58Qwen 3.6 35B99.4%$0.005529.6s98%
59Grok 4.20 (Beta, Reasoning)99.5%$0.0147.8s98%
60GPT-5.4 Mini (Reasoning, Low)96.1%$0.00293.3s82%
61Grok 4.396.3%$0.00204.5s80%
62Xiaomi MIMO v2.598.2%$0.003313.1s83%
63GPT-5.199.5%$0.01313.7s98%
64GPT-5 Mini99.3%$0.005233.1s97%
65GPT-5.599.7%$0.0184.7s99%
66DeepSeek V3 (2024-12-26)97.6%$0.000715.0s80%
67DeepSeek V3.298.9%$0.000444.5s96%
68Claude Opus 4.699.5%$0.0175.6s98%
69Z.AI GLM 4.5 Air98.3%$0.001530.2s89%
70Claude Opus 4.599.3%$0.0175.2s97%
71Z.AI GLM 4.699.7%$0.004542.4s98%
72GPT-OSS 120B98.9%$0.000848.6s97%
73ByteDance Seed 2.0 Lite99.7%$0.004044.8s98%
74Mistral Small 495.2%$0.00043.2s72%
75ByteDance Seed 1.699.4%$0.004043.2s98%
76Inception Mercury94.0%$0.00044.5s76%
77GPT-5.4 (Reasoning)99.2%$0.01613.3s97%
78ByteDance Seed 1.6 Flash95.5%$0.000712.2s76%
79Grok 4.3 (Reasoning)99.3%$0.007935.9s97%
80GPT-5.5 (Reasoning, Low)99.7%$0.0216.7s99%
81Aion 2.099.5%$0.003848.1s97%
82Qwen3 235B A22B Instruct 250795.6%$0.000314.8s75%
83GPT-5 Nano98.8%$0.002350.9s96%
84Claude 3.5 Sonnet99.3%$0.0209.0s97%
85Ministral 3 14B93.4%$0.00023.9s70%
86Qwen 3.5 Flash98.7%$0.002748.6s93%
87Mistral Small Creative92.7%$0.00022.9s71%
88Hermes 3 405B97.3%$0.001222.3s74%
89Xiaomi MIMO v2.5 Pro97.4%$0.003515.5s72%
90Gemini 2.5 Flash Lite (Reasoning)96.3%$0.001816.2s71%
91Qwen 3.6 Flash98.2%$0.007421.9s80%
92Claude Opus 4.798.5%$0.0234.3s95%
93Writer: Palmyra X594.3%$0.003411.1s73%
94Claude Opus 4.7 (Reasoning)98.7%$0.0254.7s95%
95Mistral Small 4 (Reasoning)94.1%$0.001714.8s69%
96GPT-5.5 (Reasoning)99.4%$0.0279.8s97%
97Qwen 3.5 35B99.0%$0.01343.1s96%
98Z.AI GLM 599.5%$0.008360.0s97%
99Qwen 3 32B95.3%$0.000832.0s72%
100o4 Mini97.7%$0.01321.3s80%
101Grok 499.6%$0.02328.7s98%
102Gemini 2.5 Pro99.6%$0.02719.7s98%
103WizardLM 2 8x22b96.2%$0.000737.2s70%
104o4 Mini High99.1%$0.02133.0s96%
105Claude Opus 4.6 (Reasoning)99.6%$0.03012.5s98%
106MiniMax M2.597.7%$0.001859.0s80%
107Arcee AI: Trinity Mini88.5%$0.00026.5s58%
108GPT-4.1 Nano89.9%$0.00033.6s52%
109Qwen 3.5 Plus (2026-04-20)99.4%$0.0111.2m98%
110Z.AI GLM 5.199.5%$0.0131.1m98%
111Claude Sonnet 4.6 (Reasoning)99.6%$0.03219.3s98%
112Qwen 3.5 27B99.5%$0.0151.1m98%
113Z.AI GLM 4.799.7%$0.00701.5m97%
114Nemotron 3 Super95.2%$0.000051.1s66%
115Ministral 3 8B85.7%$0.00023.3s52%
116Mistral NeMO85.5%$0.00022.5s51%
117DeepSeek V4 Pro (Reasoning)99.7%$0.00641.7m98%
118DeepSeek V3 (2025-03-24)92.5%$0.000635.4s58%
119Gemma 4 31B (Reasoning)99.4%$0.00111.9m98%
120GPT-599.8%$0.02944.3s99%
121Gemma 4 26B (Reasoning)99.5%$0.00131.9m98%
122Ministral 8B84.8%$0.00013.3s50%
123Qwen 3.5 122B99.6%$0.02559.3s98%
124ByteDance Seed 2.0 Mini97.6%$0.00171.7m91%
125Qwen 3.6 27B97.6%$0.01458.6s81%
126Gemini 3 Pro (Preview)99.5%$0.03826.2s98%
127MiniMax M2.796.7%$0.00581.2m74%
128Z.AI GLM 4.7 Flash94.0%$0.00151.1m68%
129Claude Opus 499.8%$0.0517.7s98%
130Arcee AI: Trinity Large (Preview)88.0%$0.000023.8s42%
131DeepSeek V3.188.5%$0.000634.3s46%
132Qwen 3.5 9B96.8%$0.00121.8m78%
133Ministral 3 3B78.9%$0.00012.0s43%
134MoonshotAI: Kimi K2.599.6%$0.0112.1m98%
135Ministral 3B78.4%$0.00002.0s39%
136Qwen 3.5 397B A17B99.4%$0.00892.2m97%
137MoonshotAI: Kimi K2.699.6%$0.0181.9m98%
138Gemini 3.1 Pro (Preview)98.8%$0.04542.6s87%
139LFM2 24B76.9%$0.000111.9s30%
140Qwen3.6 Max Preview99.7%$0.0331.8m99%
141Gemma 3 4B73.9%$0.00016.0s24%
142Rocinante 12B63.8%$0.00048.0s17%
143Cohere Command R+ (Aug. 2024)71.4%$0.006727.2s23%
144Nemotron 3 Nano83.4%$0.00162.0m46%
145Claude 3 Haiku61.2%$0.00094.3s4%
146Hermes 3 70B67.7%$0.00152.0m10%
96.23%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
ByteDance Seed 2.0 Mini1001001001001001009298.8%
Qwen 3 32B1001001001001001007596.4%
DeepSeek V3 (2025-03-24)1001001001001001007596.4%
Rocinante 12B10010010010092837592.9%
Ministral 3 3B9292929292929291.7%
Ministral 3B100100929292925889.3%
Mistral NeMO100100100100100100886.9%
DeepSeek V3.110010010010010092885.7%
Arcee AI: Trinity Mini10010092929283079.8%
Cohere Command R+ (Aug. 2024)1001008000029.8%
Claude 3 Haiku1001000000028.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Grok 4.1 Fast1001001001001001009899.7%
Z.AI GLM 5 Turbo100100100100100989899.5%
Grok 4 Fast100100100100100989899.5%
GPT-5.510010010010098989899.2%
Z.AI GLM 4.510010010010098989899.2%
GPT-5.4 (Reasoning, Low)1001001009898989698.7%
GPT-5.21001001009898989698.7%
GPT-5.5 (Reasoning)100100989898989898.7%
ByteDance Seed 1.6100100989898989898.7%
GPT-5 Nano1001001009898989698.7%
Grok 4.3 (Reasoning)1001001009898989598.5%
Qwen 3.5 Plus (2026-04-20)100100989898989698.5%
Claude Opus 41001001009898969698.5%
GPT-5.5 (Reasoning, Low)10098989898989898.5%
Z.AI GLM 4.710098989898989898.5%
GPT-5.4 Mini10098989898989698.2%
Claude Opus 4.6 (Reasoning)9898989898989898.2%
Qwen3.6 Max Preview9898989898989898.2%
GPT-5.4 (Reasoning)10098989898989698.2%
Claude Opus 4.7 (Reasoning)9898989898989898.2%
Claude Opus 4.69898989898989898.2%
GPT-510098989898989698.2%
Gemma 4 31B (Reasoning)9898989898989898.2%
GPT-5.4 Mini (Reasoning)9898989898989898.2%
Claude Sonnet 49898989898989898.2%
Grok 410098989898989698.2%
Gemma 4 31B9898989898989898.2%
Qwen 3.5 122B9898989898989698.0%
Grok 4.20 (Beta, Reasoning)9898989898989698.0%
Grok 4.20 (Reasoning)9898989898989698.0%
Qwen 3.5 27B9898989898989698.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009898969398.0%
Z.AI GLM 4.69898989898989698.0%
DeepSeek V4 Pro (Reasoning)100100989896969597.7%
Claude Haiku 4.510098989896969697.7%
Aion 2.09898989898969697.7%
MiniMax M2.59898989898969697.7%
Claude Sonnet 4.6 (Reasoning)10098989696969697.5%
GPT-5.19898989898969597.5%
DeepSeek V4 Flash (Reasoning)10098989696969697.5%
Gemini 2.5 Pro9898989896969697.5%
GPT-5.49898989896969697.5%
MoonshotAI: Kimi K2.510098969696969697.2%
Claude Opus 4.79898989896969597.2%
Gemini 3 Pro (Preview)9898989696969697.2%
Claude Sonnet 4.59898989696969697.2%
DeepSeek V4 Pro9898989696969697.2%
Qwen 3.6 35B9898989696969697.2%
GPT-OSS 120B9898989696969597.0%
DeepSeek V3.19898969696969697.0%
Gemma 4 26B (Reasoning)9898969696969596.7%
Qwen 3.5 35B9898969696969596.7%
Stealth: Hunter Alpha9896969696969696.7%
Qwen 3.5 Flash9896969696969696.7%
Gemma 4 26B9896969696969696.7%
GPT-4.1 Mini9898989696969396.7%
Z.AI GLM 4.5 Air9896969696969696.7%
Hermes 3 405B9896969696969696.7%
Xiaomi MIMO v2.5 Pro9898969696969596.7%
Gemini 3.1 Pro (Preview)9696969696969696.5%
Z.AI GLM 5.19696969696969696.5%
MoonshotAI: Kimi K2.69696969696969696.5%
Qwen 3.5 397B A17B9696969696969696.5%
Claude Sonnet 4.69696969696969696.5%
Qwen 3.6 Flash9896969696969596.5%
Claude Opus 4.59696969696969696.5%
MiniMax M2.79898969696969396.5%
Qwen 3.5 Plus (2026-02-15)9696969696969696.5%
GPT-5.4 Mini (Reasoning, Low)9898969696959596.5%
GPT-4o, May 13th (temp=0)9696969696969696.5%
ByteDance Seed 2.0 Lite9696969696969696.5%
Claude 3.5 Sonnet9696969696969696.5%
Claude 3.7 Sonnet9696969696969696.5%
GPT-4o, Aug. 6th (temp=0)9696969696969696.5%
DeepSeek V3.29696969696969696.5%
GPT-4o Mini (temp=1)9696969696969696.5%
GPT-4o Mini (temp=0)9696969696969696.5%
Gemini 3 Flash (Preview, Reasoning)9896969696969396.2%
Gemini 3.1 Flash Lite (Reasoning)9696969696969596.2%
GPT-5.4 Nano (Reasoning, Low)9696969696969596.2%
GPT-5.4 Nano9696969696969596.2%
Hermes 3 70B9696969696969596.2%
GPT-5 Mini9898969696959396.2%
Gemma 3 27B9896969695959596.0%
Z.AI GLM 59696969696969396.0%
Xiaomi MIMO v2.510096969696959196.0%
GPT-4.19696969695959595.7%
Stealth: Healer Alpha9896969696939395.7%
Gemini 2.5 Flash Lite9696969695959595.7%
Gemini 3.1 Flash Lite9696969695959395.5%
Z.AI GLM 4.7 Flash9896969595959395.5%
GPT-5.4 Nano (Reasoning)9696969696959195.5%
Qwen 2.5 72B9696969595959595.5%
GPT-4o, May 13th (temp=1)9896969595939395.2%
Gemini 3 Flash (Preview)9696969595959395.2%
DeepSeek V4 Flash9696969696958895.0%
WizardLM 2 8x22b10098969693918995.0%
Gemini 2.5 Flash (Reasoning)9696959595939394.7%
Gemini 3.1 Flash Lite (Preview)9595959595959594.7%
GPT-4o, Aug. 6th (temp=1)9896959595919194.5%
o4 Mini High9896959595918994.2%
Inception Mercury9896969593938293.5%
Mistral Small 3.2 24B9595939393939393.5%
GPT-4.1 Nano9695959393918993.2%
Mistral Large 39393939393939393.0%
Mistral Large 29393939393939393.0%
Mistral Large9393939393939393.0%
Inception Mercury 210098989696956392.5%
DeepSeek V3 (2024-12-26)9393939393918992.2%
o4 Mini9595939391898892.0%
DeepSeek-V2 Chat9393939191919192.0%
LFM2 24B9393939391898991.7%
DeepSeek V3 (2025-03-24)9393919191918991.5%
Qwen 3.5 9B9898989696965391.0%
Grok 4.20 (Beta)9691918989888890.5%
Mistral Medium 3.19191888888888888.7%
ByteDance Seed 2.0 Mini9593938986827587.7%
Grok 4.209189898888867987.2%
Grok 4.3100100989898674987.2%
Qwen 3 32B9693939184776886.2%
Llama 3.1 70B9389848484828185.5%
Gemini 2.5 Flash9898989898951185.2%
Llama 3.1 Nemotron 70B8686868684828284.7%
Qwen 3.6 27B10098989898722684.5%
Mistral Small Creative9382828282828183.7%
Mistral Small 4 (Reasoning)9896939393892383.7%
Gemma 3 12B9696959595951183.2%
Llama 3.1 8B8988888479777583.0%
ByteDance Seed 1.6 Flash9393888481812677.9%
Mistral Small 49593939389403076.2%
Nemotron 3 Super1001001009896181174.7%
Arcee AI: Trinity Large (Preview)9696969696111171.9%
Nemotron 3 Nano9898969389141271.7%
Ministral 3 3B8984707067635871.7%
Mistral NeMO7775727070686771.4%
Qwen3 235B A22B Instruct 25078281796765605870.2%
Ministral 3B8989796358494767.9%
Writer: Palmyra X57467656161605863.7%
Ministral 3 14B6363636361585661.2%
Arcee AI: Trinity Mini96827979777460.7%
Ministral 8B5351474444444246.4%
Ministral 3 8B5449464444424045.6%
Claude 3 Haiku93890000026.1%
Cohere Command R+ (Aug. 2024)891612420017.5%
Rocinante 12B93129420017.0%
Gemma 3 4B1111111111111110.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Grok 4.3 (Reasoning)1001001001001001009899.7%
ByteDance Seed 1.61001001001001001009899.7%
Qwen 3.6 Flash1001001001001001009899.7%
o4 Mini High1001001001001001009899.7%
MiniMax M2.71001001001001001009899.7%
DeepSeek V4 Flash (Reasoning)1001001001001001009899.7%
GPT-4.11001001001001001009899.7%
GPT-OSS 120B1001001001001001009899.7%
Grok 4 Fast1001001001001001009899.7%
Xiaomi MIMO v2.51001001001001001009899.7%
Z.AI GLM 4.7 Flash1001001001001001009899.7%
DeepSeek V3.21001001001001001009899.7%
DeepSeek V3 (2025-03-24)1001001001001001009899.7%
Llama 3.1 70B1001001001001001009899.7%
ByteDance Seed 1.6 Flash1001001001001001009899.7%
Qwen 3.5 Plus (2026-04-20)100100100100100989899.3%
Claude Sonnet 4.6100100100100100989899.3%
GPT-5.4 Mini (Reasoning)100100100100100989899.3%
DeepSeek V4 Pro (Reasoning)1001001001001001009599.3%
Aion 2.0100100100100100989899.3%
o4 Mini100100100100100989899.3%
Qwen 3.5 35B1001001001001001009599.3%
DeepSeek V3.11001001001001001009599.3%
GPT-5.4 (Reasoning, Low)10010010010098989899.0%
Claude Opus 4.710010010010098989899.0%
ByteDance Seed 2.0 Mini10010010010098989899.0%
GPT-5.410010010010098989899.0%
GPT-5 Nano10010010010098989899.0%
GPT-5 Mini10010010010098989598.7%
Xiaomi MIMO v2.5 Pro10010010010098989598.7%
Stealth: Hunter Alpha1001001001001001009198.7%
Qwen 3.5 Flash100100100100100959598.7%
Grok 4.20 (Beta)1001001009898989898.7%
Inception Mercury 21001001009898989898.7%
Gemma 3 27B10010010010098989598.7%
Qwen 2.5 72B1001001009898989898.7%
Claude Sonnet 4.6 (Reasoning)100100989898989898.3%
GPT-5.4 (Reasoning)1001001009898989598.3%
Gemini 2.5 Flash Lite (Reasoning)100100989898989898.3%
DeepSeek V4 Pro100100989898989898.3%
Grok 4.2010098989898989898.0%
Mistral Small 410098989898989898.0%
Claude Opus 4.7 (Reasoning)9898989898989897.7%
Gemma 4 31B (Reasoning)10098989898989597.7%
GPT-4.1 Mini9898989898959597.0%
Stealth: Healer Alpha100100100100100987496.0%
Mistral Small 4 (Reasoning)100100989895958696.0%
Mistral Medium 3.19595959595959595.3%
Mistral NeMO9895959595939395.0%
GPT-5.4 Nano (Reasoning)9895959593939394.7%
Inception Mercury9898959595918894.4%
GPT-5.4 Nano (Reasoning, Low)9895959593939194.4%
Llama 3.1 8B10095959591919194.0%
Gemma 3 12B9893939393939393.7%
Ministral 3 14B9595939393939393.7%
GPT-5.4 Nano9595959593888893.0%
GPT-4.1 Nano9393939191919191.7%
Mistral Small Creative9191919191919190.7%
Arcee AI: Trinity Mini10095959393936090.0%
Qwen 3.5 9B100100989898983389.0%
Ministral 3 8B9393939388887288.7%
Ministral 8B9593939388726786.0%
MiniMax M2.5100100100989898285.0%
Qwen 3 32B9898989898121273.1%
Nemotron 3 Nano10095959547442872.1%
GPT-5.4 Mini (Reasoning, Low)7470676765605165.1%
GPT-5.4 Mini7065656563636364.8%
Arcee AI: Trinity Large (Preview)1009553515151257.8%
Rocinante 12B958674402116247.8%
Gemma 3 4B6735353333302837.2%
Ministral 3B3737373737282634.2%
Ministral 3 3B3737333330282832.2%
Cohere Command R+ (Aug. 2024)6037373328161432.2%
Claude 3 Haiku93930000026.6%
LFM2 24B55522223.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100.0%
Grok 4.20100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
Grok 4.3100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
GPT-5.5 (Reasoning)1001001001001001008898.2%
MiniMax M2.71001001001001001008898.2%
Z.AI GLM 4.71001001001001001008898.2%
DeepSeek-V2 Chat1001001001001001008898.2%
Grok 4.20 (Beta)1001001001001001008898.2%
GPT-4o, May 13th (temp=1)1001001001001001008898.2%
DeepSeek V3.11001001001001001008898.2%
ByteDance Seed 1.6 Flash1001001001001001008898.2%
Claude Sonnet 4.6100100100100100888896.4%
Z.AI GLM 4.7 Flash100100100100100888896.4%
Hermes 3 405B100100100100100888896.4%
Mistral Small 4 (Reasoning)100100100100100888896.4%
Qwen 3 32B100100100100100888896.4%
Nemotron 3 Nano100100100100100888896.4%
Mistral Small 4100100100100100888896.4%
Llama 3.1 8B1001001001001001007596.4%
Claude Opus 4.7 (Reasoning)10010010010088888894.6%
ByteDance Seed 2.0 Mini10010010010088888894.6%
Gemma 3 27B10010010010088888894.6%
Arcee AI: Trinity Large (Preview)10010010010088888894.6%
Hermes 3 70B10010010010088888894.6%
GPT-5.4 Nano (Reasoning, Low)1001001008888888892.9%
GPT-4.1 Nano100100100100100886392.9%
Gemini 3 Flash (Preview)100100888888888891.1%
GPT-4o Mini (temp=1)100100888888888891.1%
Qwen 2.5 72B100100888888888891.1%
GPT-5.4 Nano100100888888888891.1%
DeepSeek V3 (2025-03-24)100100100100100883889.3%
GPT-5.4 Nano (Reasoning)10088888888888889.3%
GPT-4o Mini (temp=0)8888888888888887.5%
Mistral Medium 3.18888888888888887.5%
Mistral Small Creative8888888888888887.5%
Ministral 3 14B8888888888888887.5%
Arcee AI: Trinity Mini8888888888888887.5%
Gemma 3 4B8888888888888887.5%
Ministral 3 3B6350505038383846.4%
Ministral 3B8863503838252546.4%
Mistral NeMO7550383838252541.1%
Ministral 3 8B3838383838382535.7%
Cohere Command R+ (Aug. 2024)7550502525131335.7%
Rocinante 12B885050251313033.9%
Ministral 8B3838383825251330.4%
LFM2 24B2525252525252525.0%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
GPT-5.5100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemma 4 31B100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Gemma 4 26B100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Z.AI GLM 5.11001001001001001009799.6%
Claude Opus 4.7 (Reasoning)1001001001001001009799.6%
GPT-5.5 (Reasoning)1001001001001001009799.6%
o4 Mini High1001001001001001009799.6%
Mistral Medium 3.11001001001001001009799.6%
Z.AI GLM 5 Turbo1001001001001001009599.2%
Qwen 3.5 122B100100100100100979799.2%
Qwen 3.6 27B100100100100100979799.2%
o4 Mini100100100100100979799.2%
Z.AI GLM 4.5 Air100100100100100979799.2%
Writer: Palmyra X51001001001001001009599.2%
Llama 3.1 70B1001001001001001009599.2%
Claude 3 Haiku1001001001001001009599.2%
Arcee AI: Trinity Mini100100100100100979799.2%
Ministral 3 3B100100100100100979799.2%
GPT-5.4 (Reasoning)100100100100100979598.8%
Grok 4.3 (Reasoning)10010010010097979798.8%
GPT-5.410010010010097979798.8%
GPT-4o, Aug. 6th (temp=1)10010010010097979798.8%
Qwen 3.5 397B A17B100100100100100959598.5%
GPT-5.210010010010097979598.5%
Qwen 3.6 35B1001001009797979798.5%
Grok 4.20100100100100100959598.5%
Gemma 4 31B (Reasoning)1001001009797979798.5%
GPT-5.4 Mini (Reasoning)100100100100100959598.5%
Xiaomi MIMO v2.510010010010097979598.5%
GPT-5 Mini1001001009797979598.1%
Qwen 3.5 Plus (2026-04-20)1001001009797979598.1%
Qwen 3.5 9B1001001009797979598.1%
Ministral 3B1001001009797979598.1%
GPT-5.4 Mini (Reasoning, Low)10010010010095959597.7%
Qwen 3.6 Flash100100979797979597.7%
Stealth: Healer Alpha10097979797979797.7%
Qwen 2.5 72B10010010010095959597.7%
Inception Mercury1001001009797959597.7%
Qwen 3.5 27B9797979797979797.3%
MiniMax M2.510097979797979597.3%
Gemini 2.5 Flash (Reasoning)9797979797979797.3%
GPT-OSS 120B9797979797979797.3%
Inception Mercury 29797979797979797.3%
GPT-5.4 Nano (Reasoning, Low)10097979797979597.3%
DeepSeek-V2 Chat1001001009595959596.9%
Grok 4.20 (Beta)1001001009595959596.9%
Mistral Large1001001009595959596.9%
Claude Opus 4.710097979795959596.5%
DeepSeek V3 (2025-03-24)1001001009595959296.5%
Qwen 3.5 35B100100979795929296.1%
DeepSeek V3 (2024-12-26)100100959595959596.1%
Hermes 3 70B100100959595959596.1%
GPT-5 Nano100100979792929295.8%
GPT-5.4 Nano (Reasoning)10097979595959295.8%
ByteDance Seed 1.6 Flash9797979797958995.8%
Qwen 3.5 Flash100100979592929295.4%
GPT-5.4 Nano9797979795929295.4%
WizardLM 2 8x22b10095959595959595.4%
Ministral 3 14B9797959595959595.4%
Gemini 2.5 Flash Lite (Reasoning)10095959595929294.6%
Mistral Large 39595959595959594.6%
GPT-5.4 Mini9595959595959594.6%
Mistral Large 29595959595959594.6%
DeepSeek V3.29595959595959594.6%
DeepSeek V4 Flash9595959595959594.6%
GPT-4o Mini (temp=1)9595959595959594.6%
Mistral Small 3.2 24B9595959595959594.6%
Gemma 3 12B9595959595959594.6%
GPT-4o Mini (temp=0)9595959595959594.6%
Mistral Small 49595959595959594.6%
Ministral 3 8B9595959595959594.6%
Mistral NeMO9595959595959594.6%
Grok 4.39595959595959294.2%
Ministral 8B9595959595959294.2%
Z.AI GLM 4.7 Flash10097959595868693.4%
ByteDance Seed 2.0 Mini9595959595928693.1%
Nemotron 3 Super9797959592927892.3%
Cohere Command R+ (Aug. 2024)9595959595957892.3%
Llama 3.1 8B9595959595928192.3%
Mistral Small 4 (Reasoning)100100979795924990.0%
LFM2 24B8686868686868686.5%
Gemini 2.5 Flash Lite8686868686865181.5%
DeepSeek V3.1959595959595081.1%
Nemotron 3 Nano9797959595413078.4%
Xiaomi MIMO v2.5 Pro1001001001001000071.4%
Rocinante 12B1009289865151067.2%
Arcee AI: Trinity Large (Preview)10000000014.3%
GPT-4.1 Nano513333309.3%