Name replacement accuracy

Test: Text Replacement

Avg. Score
95.5%
Scenarios
14

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Preview)99.3%$0.00091.7s97%
2Grok 4 Fast99.6%$0.00075.8s98%
3Claude Haiku 4.599.7%$0.00342.8s98%
4Grok 4.1 Fast99.5%$0.000810.6s98%
5GPT-4.1 Mini98.9%$0.00106.6s95%
6Qwen 3.5 Plus (2026-02-15)99.2%$0.00156.7s95%
7Mistral Large 398.7%$0.00117.4s96%
8Gemini 3 Flash (Preview)98.4%$0.00183.2s94%
9Stealth: Hunter Alpha99.4%$0.000017.2s97%
10GPT-4.199.4%$0.00524.1s97%
11Inception Mercury 298.3%$0.00162.2s92%
12GPT-4o Mini (temp=1)98.2%$0.00049.2s94%
13Qwen 2.5 72B98.4%$0.000310.3s94%
14Gemini 2.5 Flash Lite98.1%$0.00031.6s88%
15Stealth: Healer Alpha98.9%$0.000013.4s94%
16GPT-4o Mini (temp=0)98.0%$0.00049.2s93%
17Mistral Large98.8%$0.00427.2s96%
18Mistral Large 298.8%$0.00427.2s96%
19Gemma 3 27B98.6%$0.000216.0s94%
20GPT-4o, Aug. 6th (temp=0)99.0%$0.00642.6s94%
21DeepSeek-V2 Chat98.4%$0.000816.5s94%
22Gemini 2.5 Flash (Reasoning)99.1%$0.005410.7s97%
23GPT-5.4 Nano (Reasoning)96.9%$0.00115.1s90%
24Claude Sonnet 499.8%$0.0105.7s99%
25GPT-4o, May 13th (temp=0)99.7%$0.0103.5s98%
26Claude Sonnet 4.599.6%$0.0104.6s98%
27GPT-5.499.2%$0.00895.4s97%
28Gemini 2.5 Flash98.7%$0.00142.1s82%
29GPT-4o, Aug. 6th (temp=1)98.7%$0.00642.6s92%
30GPT-5.4 (Reasoning, Low)99.4%$0.00945.6s97%
31Grok 4.20 (Beta)97.6%$0.00321.8s87%
32GPT-5.4 Nano95.9%$0.00073.1s87%
33Llama 3.1 8B96.5%$0.00009.3s89%
34Mistral Medium 3.196.2%$0.00126.0s89%
35Z.AI GLM 4.599.7%$0.003126.1s98%
36Claude 3.7 Sonnet99.4%$0.0105.5s97%
37GPT-4o, May 13th (temp=1)99.2%$0.0103.3s96%
38Claude Sonnet 4.699.2%$0.0104.4s96%
39GPT-5.298.8%$0.00876.0s94%
40Llama 3.1 Nemotron 70B97.7%$0.001413.9s89%
41Mistral Small 3.2 24B97.7%$0.00024.7s80%
42GPT-5.4 Mini (Reasoning)98.4%$0.00537.2s88%
43Z.AI GLM 5 Turbo99.7%$0.008318.8s98%
44Gemma 3 12B97.3%$0.00018.6s82%
45Llama 3.1 70B98.0%$0.000523.9s91%
46GPT-5.4 Nano (Reasoning, Low)95.3%$0.00073.7s83%
47Gemini 3 Flash (Preview, Reasoning)99.5%$0.009717.2s98%
48GPT-5.4 Mini96.1%$0.00272.1s82%
49Grok 4.20 (Beta, Reasoning)99.5%$0.0147.8s98%
50GPT-5.4 Mini (Reasoning, Low)96.1%$0.00293.3s82%
51GPT-5.199.5%$0.01313.7s98%
52GPT-5 Mini99.3%$0.005233.1s97%
53DeepSeek V3 (2024-12-26)97.6%$0.000715.0s80%
54DeepSeek V3.298.9%$0.000444.5s96%
55Claude Opus 4.699.5%$0.0175.6s98%
56Claude Opus 4.599.3%$0.0175.2s97%
57Z.AI GLM 4.699.7%$0.004542.4s98%
58ByteDance Seed 2.0 Lite99.7%$0.004044.8s98%
59Mistral Small 495.2%$0.00043.2s72%
60ByteDance Seed 1.699.4%$0.004043.2s98%
61Inception Mercury94.0%$0.00044.5s76%
62GPT-5.4 (Reasoning)99.2%$0.01613.3s97%
63ByteDance Seed 1.6 Flash95.5%$0.000712.2s76%
64Aion 2.099.5%$0.003848.1s97%
65Qwen3 235B A22B Instruct 250795.6%$0.000314.8s75%
66GPT-5 Nano98.8%$0.002350.9s96%
67Claude 3.5 Sonnet99.3%$0.0209.0s97%
68Ministral 3 14B93.4%$0.00023.9s70%
69Qwen 3.5 Flash98.7%$0.002748.6s93%
70Mistral Small Creative92.7%$0.00022.9s71%
71Hermes 3 405B97.3%$0.001222.3s74%
72Gemini 2.5 Flash Lite (Reasoning)96.3%$0.001816.2s71%
73Writer: Palmyra X594.3%$0.003411.1s73%
74Mistral Small 4 (Reasoning)94.1%$0.001714.8s69%
75Qwen 3.5 35B99.0%$0.01343.1s96%
76Z.AI GLM 599.5%$0.008360.0s97%
77Qwen 3 32B95.3%$0.000832.0s72%
78o4 Mini97.7%$0.01321.3s80%
79Grok 499.6%$0.02328.7s98%
80Gemini 2.5 Pro99.6%$0.02719.7s98%
81WizardLM 2 8x22b96.2%$0.000737.2s70%
82o4 Mini High99.1%$0.02133.0s96%
83Claude Opus 4.6 (Reasoning)99.6%$0.03012.5s98%
84MiniMax M2.597.7%$0.001859.0s80%
85Arcee AI: Trinity Mini88.5%$0.00026.5s58%
86GPT-4.1 Nano89.9%$0.00033.6s52%
87Claude Sonnet 4.6 (Reasoning)99.6%$0.03219.3s98%
88Qwen 3.5 27B99.5%$0.0151.1m98%
89Z.AI GLM 4.799.7%$0.00701.5m97%
90Nemotron 3 Super95.2%$0.000051.1s66%
91Ministral 3 8B85.7%$0.00023.3s52%
92Mistral NeMO85.5%$0.00022.5s51%
93DeepSeek V3 (2025-03-24)92.5%$0.000635.4s58%
94GPT-599.8%$0.02944.3s99%
95Ministral 8B84.8%$0.00013.3s50%
96Qwen 3.5 122B99.6%$0.02559.3s98%
97ByteDance Seed 2.0 Mini97.6%$0.00171.7m91%
98Gemini 3 Pro (Preview)99.5%$0.03826.2s98%
99MiniMax M2.796.7%$0.00581.2m74%
100Z.AI GLM 4.7 Flash94.0%$0.00151.1m68%
101Claude Opus 499.8%$0.0517.7s98%
102Arcee AI: Trinity Large (Preview)88.0%$0.000023.8s42%
103DeepSeek V3.188.5%$0.000634.3s46%
104Qwen 3.5 9B96.8%$0.00121.8m78%
105Ministral 3 3B78.9%$0.00012.0s43%
106MoonshotAI: Kimi K2.599.6%$0.0112.1m98%
107Ministral 3B78.4%$0.00002.0s39%
108Qwen 3.5 397B A17B99.4%$0.00892.2m97%
109Gemini 3.1 Pro (Preview)98.8%$0.04542.6s87%
110LFM2 24B76.9%$0.000111.9s30%
111Gemma 3 4B73.9%$0.00016.0s24%
112Rocinante 12B63.8%$0.00048.0s17%
113Cohere Command R+ (Aug. 2024)71.4%$0.006727.2s23%
114Nemotron 3 Nano83.4%$0.00162.0m46%
115Claude 3 Haiku61.2%$0.00094.3s4%
116Hermes 3 70B67.7%$0.00152.0m10%
95.51%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
ByteDance Seed 2.0 Mini1001001001001001009298.8%
Qwen 3 32B1001001001001001007596.4%
DeepSeek V3 (2025-03-24)1001001001001001007596.4%
Rocinante 12B10010010010092837592.9%
Ministral 3 3B9292929292929291.7%
Ministral 3B100100929292925889.3%
Mistral NeMO100100100100100100886.9%
DeepSeek V3.110010010010010092885.7%
Arcee AI: Trinity Mini10010092929283079.8%
Cohere Command R+ (Aug. 2024)1001008000029.8%
Claude 3 Haiku1001000000028.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Grok 4.1 Fast1001001001001001009899.7%
Z.AI GLM 5 Turbo100100100100100989899.5%
Grok 4 Fast100100100100100989899.5%
Z.AI GLM 4.510010010010098989899.2%
GPT-5.21001001009898989698.7%
GPT-5.4 (Reasoning, Low)1001001009898989698.7%
ByteDance Seed 1.6100100989898989898.7%
GPT-5 Nano1001001009898989698.7%
Z.AI GLM 4.710098989898989898.5%
Claude Opus 41001001009898969698.5%
GPT-5.4 (Reasoning)10098989898989698.2%
GPT-510098989898989698.2%
Grok 410098989898989698.2%
GPT-5.4 Mini10098989898989698.2%
Claude Opus 4.6 (Reasoning)9898989898989898.2%
Claude Opus 4.69898989898989898.2%
GPT-5.4 Mini (Reasoning)9898989898989898.2%
Claude Sonnet 49898989898989898.2%
Qwen 3.5 27B9898989898989698.0%
Z.AI GLM 4.69898989898989698.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009898969398.0%
Qwen 3.5 122B9898989898989698.0%
Grok 4.20 (Beta, Reasoning)9898989898989698.0%
Aion 2.09898989898969697.7%
MiniMax M2.59898989898969697.7%
Claude Haiku 4.510098989896969697.7%
Gemini 2.5 Pro9898989896969697.5%
Claude Sonnet 4.6 (Reasoning)10098989696969697.5%
GPT-5.19898989898969597.5%
GPT-5.49898989896969697.5%
MoonshotAI: Kimi K2.510098969696969697.2%
Claude Sonnet 4.59898989696969697.2%
Gemini 3 Pro (Preview)9898989696969697.2%
DeepSeek V3.19898969696969697.0%
Qwen 3.5 35B9898969696969596.7%
Hermes 3 405B9896969696969696.7%
Stealth: Hunter Alpha9896969696969696.7%
Qwen 3.5 Flash9896969696969696.7%
GPT-4.1 Mini9898989696969396.7%
Gemini 3.1 Pro (Preview)9696969696969696.5%
Qwen 3.5 397B A17B9696969696969696.5%
Claude Sonnet 4.69696969696969696.5%
Claude Opus 4.59696969696969696.5%
Qwen 3.5 Plus (2026-02-15)9696969696969696.5%
GPT-4o, May 13th (temp=0)9696969696969696.5%
ByteDance Seed 2.0 Lite9696969696969696.5%
Claude 3.5 Sonnet9696969696969696.5%
Claude 3.7 Sonnet9696969696969696.5%
GPT-4o, Aug. 6th (temp=0)9696969696969696.5%
DeepSeek V3.29696969696969696.5%
GPT-4o Mini (temp=1)9696969696969696.5%
GPT-4o Mini (temp=0)9696969696969696.5%
MiniMax M2.79898969696969396.5%
GPT-5.4 Mini (Reasoning, Low)9898969696959596.5%
GPT-5 Mini9898969696959396.2%
Gemini 3 Flash (Preview, Reasoning)9896969696969396.2%
GPT-5.4 Nano (Reasoning, Low)9696969696969596.2%
GPT-5.4 Nano9696969696969596.2%
Hermes 3 70B9696969696969596.2%
Gemma 3 27B9896969695959596.0%
Z.AI GLM 59696969696969396.0%
Gemini 2.5 Flash Lite9696969695959595.7%
GPT-4.19696969695959595.7%
Stealth: Healer Alpha9896969696939395.7%
GPT-5.4 Nano (Reasoning)9696969696959195.5%
Qwen 2.5 72B9696969595959595.5%
Z.AI GLM 4.7 Flash9896969595959395.5%
GPT-4o, May 13th (temp=1)9896969595939395.2%
Gemini 3 Flash (Preview)9696969595959395.2%
WizardLM 2 8x22b10098969693918995.0%
Gemini 2.5 Flash (Reasoning)9696959595939394.7%
Gemini 3.1 Flash Lite (Preview)9595959595959594.7%
GPT-4o, Aug. 6th (temp=1)9896959595919194.5%
o4 Mini High9896959595918994.2%
Inception Mercury9896969593938293.5%
Mistral Small 3.2 24B9595939393939393.5%
GPT-4.1 Nano9695959393918993.2%
Mistral Large 39393939393939393.0%
Mistral Large 29393939393939393.0%
Mistral Large9393939393939393.0%
Inception Mercury 210098989696956392.5%
DeepSeek V3 (2024-12-26)9393939393918992.2%
o4 Mini9595939391898892.0%
DeepSeek-V2 Chat9393939191919192.0%
LFM2 24B9393939391898991.7%
DeepSeek V3 (2025-03-24)9393919191918991.5%
Qwen 3.5 9B9898989696965391.0%
Grok 4.20 (Beta)9691918989888890.5%
Mistral Medium 3.19191888888888888.7%
ByteDance Seed 2.0 Mini9593938986827587.7%
Qwen 3 32B9693939184776886.2%
Llama 3.1 70B9389848484828185.5%
Gemini 2.5 Flash9898989898951185.2%
Llama 3.1 Nemotron 70B8686868684828284.7%
Mistral Small Creative9382828282828183.7%
Mistral Small 4 (Reasoning)9896939393892383.7%
Gemma 3 12B9696959595951183.2%
Llama 3.1 8B8988888479777583.0%
ByteDance Seed 1.6 Flash9393888481812677.9%
Mistral Small 49593939389403076.2%
Nemotron 3 Super1001001009896181174.7%
Arcee AI: Trinity Large (Preview)9696969696111171.9%
Nemotron 3 Nano9898969389141271.7%
Ministral 3 3B8984707067635871.7%
Mistral NeMO7775727070686771.4%
Qwen3 235B A22B Instruct 25078281796765605870.2%
Ministral 3B8989796358494767.9%
Writer: Palmyra X57467656161605863.7%
Ministral 3 14B6363636361585661.2%
Arcee AI: Trinity Mini96827979777460.7%
Ministral 8B5351474444444246.4%
Ministral 3 8B5449464444424045.6%
Claude 3 Haiku93890000026.1%
Cohere Command R+ (Aug. 2024)891612420017.5%
Rocinante 12B93129420017.0%
Gemma 3 4B1111111111111110.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009899.7%
o4 Mini High1001001001001001009899.7%
MiniMax M2.71001001001001001009899.7%
GPT-4.11001001001001001009899.7%
Grok 4 Fast1001001001001001009899.7%
Z.AI GLM 4.7 Flash1001001001001001009899.7%
DeepSeek V3.21001001001001001009899.7%
DeepSeek V3 (2025-03-24)1001001001001001009899.7%
Llama 3.1 70B1001001001001001009899.7%
ByteDance Seed 1.6 Flash1001001001001001009899.7%
Claude Sonnet 4.6100100100100100989899.3%
GPT-5.4 Mini (Reasoning)100100100100100989899.3%
Aion 2.0100100100100100989899.3%
o4 Mini100100100100100989899.3%
Qwen 3.5 35B1001001001001001009599.3%
DeepSeek V3.11001001001001001009599.3%
GPT-5.4 (Reasoning, Low)10010010010098989899.0%
ByteDance Seed 2.0 Mini10010010010098989899.0%
GPT-5.410010010010098989899.0%
GPT-5 Nano10010010010098989899.0%
GPT-5 Mini10010010010098989598.7%
Stealth: Hunter Alpha1001001001001001009198.7%
Qwen 3.5 Flash100100100100100959598.7%
Grok 4.20 (Beta)1001001009898989898.7%
Inception Mercury 21001001009898989898.7%
Gemma 3 27B10010010010098989598.7%
Qwen 2.5 72B1001001009898989898.7%
Claude Sonnet 4.6 (Reasoning)100100989898989898.3%
GPT-5.4 (Reasoning)1001001009898989598.3%
Gemini 2.5 Flash Lite (Reasoning)100100989898989898.3%
Mistral Small 410098989898989898.0%
GPT-4.1 Mini9898989898959597.0%
Stealth: Healer Alpha100100100100100987496.0%
Mistral Small 4 (Reasoning)100100989895958696.0%
Mistral Medium 3.19595959595959595.3%
Mistral NeMO9895959595939395.0%
GPT-5.4 Nano (Reasoning)9895959593939394.7%
Inception Mercury9898959595918894.4%
GPT-5.4 Nano (Reasoning, Low)9895959593939194.4%
Llama 3.1 8B10095959591919194.0%
Ministral 3 14B9595939393939393.7%
Gemma 3 12B9893939393939393.7%
GPT-5.4 Nano9595959593888893.0%
GPT-4.1 Nano9393939191919191.7%
Mistral Small Creative9191919191919190.7%
Arcee AI: Trinity Mini10095959393936090.0%
Qwen 3.5 9B100100989898983389.0%
Ministral 3 8B9393939388887288.7%
Ministral 8B9593939388726786.0%
MiniMax M2.5100100100989898285.0%
Qwen 3 32B9898989898121273.1%
Nemotron 3 Nano10095959547442872.1%
GPT-5.4 Mini (Reasoning, Low)7470676765605165.1%
GPT-5.4 Mini7065656563636364.8%
Arcee AI: Trinity Large (Preview)1009553515151257.8%
Rocinante 12B958674402116247.8%
Gemma 3 4B6735353333302837.2%
Ministral 3B3737373737282634.2%
Ministral 3 3B3737333330282832.2%
Cohere Command R+ (Aug. 2024)6037373328161432.2%
Claude 3 Haiku93930000026.6%
LFM2 24B55522223.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
MiniMax M2.71001001001001001008898.2%
Z.AI GLM 4.71001001001001001008898.2%
DeepSeek-V2 Chat1001001001001001008898.2%
Grok 4.20 (Beta)1001001001001001008898.2%
GPT-4o, May 13th (temp=1)1001001001001001008898.2%
DeepSeek V3.11001001001001001008898.2%
ByteDance Seed 1.6 Flash1001001001001001008898.2%
Claude Sonnet 4.6100100100100100888896.4%
Z.AI GLM 4.7 Flash100100100100100888896.4%
Hermes 3 405B100100100100100888896.4%
Mistral Small 4 (Reasoning)100100100100100888896.4%
Qwen 3 32B100100100100100888896.4%
Nemotron 3 Nano100100100100100888896.4%
Mistral Small 4100100100100100888896.4%
Llama 3.1 8B1001001001001001007596.4%
ByteDance Seed 2.0 Mini10010010010088888894.6%
Gemma 3 27B10010010010088888894.6%
Arcee AI: Trinity Large (Preview)10010010010088888894.6%
Hermes 3 70B10010010010088888894.6%
GPT-5.4 Nano (Reasoning, Low)1001001008888888892.9%
GPT-4.1 Nano100100100100100886392.9%
Gemini 3 Flash (Preview)100100888888888891.1%
GPT-4o Mini (temp=1)100100888888888891.1%
Qwen 2.5 72B100100888888888891.1%
GPT-5.4 Nano100100888888888891.1%
DeepSeek V3 (2025-03-24)100100100100100883889.3%
GPT-5.4 Nano (Reasoning)10088888888888889.3%
GPT-4o Mini (temp=0)8888888888888887.5%
Mistral Medium 3.18888888888888887.5%
Mistral Small Creative8888888888888887.5%
Ministral 3 14B8888888888888887.5%
Arcee AI: Trinity Mini8888888888888887.5%
Gemma 3 4B8888888888888887.5%
Ministral 3 3B6350505038383846.4%
Ministral 3B8863503838252546.4%
Mistral NeMO7550383838252541.1%
Ministral 3 8B3838383838382535.7%
Cohere Command R+ (Aug. 2024)7550502525131335.7%
Rocinante 12B885050251313033.9%
Ministral 8B3838383825251330.4%
LFM2 24B2525252525252525.0%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
o4 Mini High1001001001001001009799.6%
Mistral Medium 3.11001001001001001009799.6%
Z.AI GLM 5 Turbo1001001001001001009599.2%
Qwen 3.5 122B100100100100100979799.2%
o4 Mini100100100100100979799.2%
Writer: Palmyra X51001001001001001009599.2%
Llama 3.1 70B1001001001001001009599.2%
Claude 3 Haiku1001001001001001009599.2%
Arcee AI: Trinity Mini100100100100100979799.2%
Ministral 3 3B100100100100100979799.2%
GPT-5.4 (Reasoning)100100100100100979598.8%
GPT-5.410010010010097979798.8%
GPT-4o, Aug. 6th (temp=1)10010010010097979798.8%
Qwen 3.5 397B A17B100100100100100959598.5%
GPT-5.210010010010097979598.5%
GPT-5.4 Mini (Reasoning)100100100100100959598.5%
GPT-5 Mini1001001009797979598.1%
Ministral 3B1001001009797979598.1%
Qwen 3.5 9B1001001009797979598.1%
Stealth: Healer Alpha10097979797979797.7%
GPT-5.4 Mini (Reasoning, Low)10010010010095959597.7%
Inception Mercury1001001009797959597.7%
Qwen 2.5 72B10010010010095959597.7%
MiniMax M2.510097979797979597.3%
Qwen 3.5 27B9797979797979797.3%
Gemini 2.5 Flash (Reasoning)9797979797979797.3%
Inception Mercury 29797979797979797.3%
GPT-5.4 Nano (Reasoning, Low)10097979797979597.3%
Grok 4.20 (Beta)1001001009595959596.9%
Mistral Large1001001009595959596.9%
DeepSeek-V2 Chat1001001009595959596.9%
DeepSeek V3 (2025-03-24)1001001009595959296.5%
DeepSeek V3 (2024-12-26)100100959595959596.1%
Hermes 3 70B100100959595959596.1%
Qwen 3.5 35B100100979795929296.1%
GPT-5 Nano100100979792929295.8%
GPT-5.4 Nano (Reasoning)10097979595959295.8%
ByteDance Seed 1.6 Flash9797979797958995.8%
Qwen 3.5 Flash100100979592929295.4%
GPT-5.4 Nano9797979795929295.4%
WizardLM 2 8x22b10095959595959595.4%
Ministral 3 14B9797959595959595.4%
Gemini 2.5 Flash Lite (Reasoning)10095959595929294.6%
Mistral Large 39595959595959594.6%
GPT-5.4 Mini9595959595959594.6%
Mistral Large 29595959595959594.6%
DeepSeek V3.29595959595959594.6%
GPT-4o Mini (temp=1)9595959595959594.6%
Mistral Small 3.2 24B9595959595959594.6%
Gemma 3 12B9595959595959594.6%
GPT-4o Mini (temp=0)9595959595959594.6%
Mistral Small 49595959595959594.6%
Ministral 3 8B9595959595959594.6%
Mistral NeMO9595959595959594.6%
Ministral 8B9595959595959294.2%
Z.AI GLM 4.7 Flash10097959595868693.4%
ByteDance Seed 2.0 Mini9595959595928693.1%
Nemotron 3 Super9797959592927892.3%
Cohere Command R+ (Aug. 2024)9595959595957892.3%
Llama 3.1 8B9595959595928192.3%
Mistral Small 4 (Reasoning)100100979795924990.0%
LFM2 24B8686868686868686.5%
Gemini 2.5 Flash Lite8686868686865181.5%
DeepSeek V3.1959595959595081.1%
Nemotron 3 Nano9797959595413078.4%
Rocinante 12B1009289865151067.2%
Arcee AI: Trinity Large (Preview)10000000014.3%
GPT-4.1 Nano513333309.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009699.4%
GPT-5 Mini1001001001001001009699.4%
Z.AI GLM 4.51001001001001001009699.4%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001009699.4%
DeepSeek V3 (2024-12-26)1001001001001001009699.4%
GPT-5.4 Mini1001001001001001009699.4%
Mistral Large1001001001001001009699.4%
GPT-5.4 Nano (Reasoning, Low)1001001001001001009699.4%
GPT-5.4 Nano1001001001001001009699.4%
Z.AI GLM 4.7 Flash1001001001001001009298.9%
Nemotron 3 Super100100100100100969698.9%
o4 Mini High100100100100100969298.3%
Aion 2.01001001001001001008898.3%
ByteDance Seed 2.0 Mini1001001001001001008898.3%
GPT-4o, May 13th (temp=1)100100100100100968897.7%
GPT-5 Nano100100100100100968897.7%
Llama 3.1 8B10010010010096969297.7%
o4 Mini1001001009696929296.6%
DeepSeek-V2 Chat1001001009696929296.6%
Grok 4.20 (Beta)100100100100100888896.6%
GPT-4.1 Mini100100100100100928496.6%
Qwen 3 32B1001001001001001007696.6%
Nemotron 3 Nano1001001009696968896.6%
Claude Opus 4.6 (Reasoning)9696969696969696.0%
Claude Opus 4.69696969696969696.0%
Claude Opus 4.59696969696969696.0%
ByteDance Seed 1.6 Flash1001001009696928896.0%
Ministral 3 14B9696969696969696.0%
GPT-4.1 Nano10010010010096888495.4%
Hermes 3 70B10010010010096967294.9%
Gemma 3 4B9696969696929294.9%
Arcee AI: Trinity Mini9696929292928892.6%
Llama 3.1 70B10096969692848092.0%
GPT-4o, Aug. 6th (temp=1)1001001009288767289.7%
GPT-4o, Aug. 6th (temp=0)10088888888888889.7%
Hermes 3 405B1001001001001001002088.6%
Llama 3.1 Nemotron 70B9696928888846486.9%
Mistral Medium 3.18888848484848485.1%
DeepSeek V3 (2025-03-24)1001001009288881683.4%
Mistral Small 41001001001008848877.7%
Mistral Small 4 (Reasoning)1001001009684202074.3%
Qwen3 235B A22B Instruct 25079696966056444470.3%
WizardLM 2 8x22b1001001001002016062.3%
Writer: Palmyra X58076726052524061.7%
Ministral 3 8B5252484848444047.4%
Mistral Small Creative7656484040363246.9%
Ministral 8B5252484440403644.6%
Mistral NeMO76726436324040.6%
LFM2 24B96968888432.6%
Rocinante 12B100682012128031.4%
Ministral 3 3B5248484488830.9%
Ministral 3B886016888427.4%
Cohere Command R+ (Aug. 2024)922016884421.7%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Sonnet 4.6100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009899.7%
Claude Sonnet 4.51001001001001001009899.7%
GPT-4o, May 13th (temp=0)1001001001001001009899.7%
Claude Haiku 4.51001001001001001009899.7%
Claude Opus 4.6 (Reasoning)100100100100100989899.5%
Qwen3 235B A22B Instruct 2507100100100100100989899.5%
Claude Sonnet 410010010010098989899.2%
Z.AI GLM 4.7100100100100100989699.2%
GPT-510010010010098989899.2%
GPT-4o, May 13th (temp=1)10010010010098989899.2%
GPT-4o, Aug. 6th (temp=1)10010010010098989899.2%
GPT-5.11001001009898989898.9%
Qwen 3.5 Flash100100100100100989498.9%
Mistral Small 4 (Reasoning)1001001009898989898.9%
Grok 41001001009898989898.9%
ByteDance Seed 2.0 Lite1001001009898989898.9%
Qwen 3.5 27B100100989898989898.7%
Llama 3.1 70B1001001009898989698.7%
Qwen 2.5 72B100100989898989898.7%
Writer: Palmyra X5100100989898989898.7%
Z.AI GLM 5100100989898989698.4%
MoonshotAI: Kimi K2.510098989898989898.4%
Claude Opus 4.510098989898989898.4%
Qwen 3.5 35B100100989898989698.4%
ByteDance Seed 2.0 Mini1001001009898989498.4%
Qwen 3.5 Plus (2026-02-15)10098989898989898.4%
Gemini 3 Flash (Preview)1001001009898969698.4%
Z.AI GLM 4.6100100989898989698.4%
Stealth: Hunter Alpha100100989898989698.4%
Z.AI GLM 4.510098989898989698.1%
Stealth: Healer Alpha10010010010096969498.1%
Hermes 3 405B10098989898989698.1%
Z.AI GLM 5 Turbo10098989898989698.1%
Claude Opus 4.69898989898989898.1%
Aion 2.010010010010098969298.1%
Gemini 3 Pro (Preview)9898989898989898.1%
Claude Opus 410098989898989698.1%
Gemini 3.1 Flash Lite (Preview)9898989898989898.1%
Mistral Large 39898989898989898.1%
Claude 3.7 Sonnet9898989898989898.1%
Mistral Large 29898989898989898.1%
Mistral Large9898989898989898.1%
Mistral Small 3.2 24B9898989898989898.1%
Arcee AI: Trinity Large (Preview)9898989898989898.1%
Mistral Small Creative10098989898989698.1%
Hermes 3 70B10098989898989698.1%
LFM2 24B9898989898989898.1%
GPT-5 Mini100100989898969497.8%
Qwen 3.5 397B A17B9898989898989697.8%
Gemini 2.5 Pro9898989898989697.8%
Gemini 2.5 Flash Lite9898989898989697.8%
Gemini 3 Flash (Preview, Reasoning)9898989898989697.8%
ByteDance Seed 1.610098989898969497.6%
Grok 4 Fast10098989898969497.6%
GPT-4.1 Nano10098989898969497.6%
GPT-5.4 Mini (Reasoning)100100989896949497.3%
Grok 4.1 Fast100100989896949497.3%
Qwen 3.5 122B9898989896969697.3%
Grok 4.20 (Beta, Reasoning)100100989896949497.3%
MiniMax M2.710098989696969697.3%
Claude 3.5 Sonnet9898989896969697.3%
GPT-5 Nano9898989896969697.3%
GPT-5.4 Mini10098989896969497.3%
Mistral Medium 3.19898989896969697.3%
o4 Mini High9898989896969697.3%
GPT-5.4 (Reasoning, Low)10098989898949297.0%
Mistral Small 410098969696969697.0%
MiniMax M2.510098989696969497.0%
GPT-5.4 Mini (Reasoning, Low)9898989898969297.0%
Grok 4.20 (Beta)9898989696969697.0%
GPT-5.410098969696969496.8%
GPT-5.4 Nano (Reasoning)9898989696969496.8%
Gemini 2.5 Flash9898989696969496.8%
GPT-4.110098969696969496.8%
Gemini 2.5 Flash (Reasoning)9898989896949496.8%
Gemini 2.5 Flash Lite (Reasoning)100100969696949296.5%
DeepSeek-V2 Chat9898989696949496.5%
Inception Mercury 2100100969696949296.5%
GPT-4o Mini (temp=1)9896969696969696.5%
o4 Mini100100969694949296.2%
Gemini 3.1 Pro (Preview)9696969696969696.2%
GPT-5.29696969696969696.2%
GPT-4o Mini (temp=0)9696969696969696.2%
Ministral 3 14B9696969696969696.2%
GPT-5.4 Nano (Reasoning, Low)9896969696949496.0%
WizardLM 2 8x22b9896969696949496.0%
DeepSeek V3.29898989694949296.0%
GPT-4.1 Mini9696969696969496.0%
Gemma 3 12B9696969696949495.7%
ByteDance Seed 1.6 Flash9696969696949495.7%
Llama 3.1 8B9898989696929195.7%
DeepSeek V3 (2024-12-26)9896969694929295.1%
GPT-5.4 (Reasoning)9696969494949294.9%
Z.AI GLM 4.7 Flash9694949494949294.3%
GPT-5.4 Nano9494949494949494.3%
Ministral 3 8B9494949494949494.3%
Gemma 3 27B9894949492929193.8%
Mistral NeMO9696949494928993.8%
Ministral 8B9694949492929293.8%
Qwen 3 32B9898989896947093.3%
Inception Mercury9694929191898791.4%
Qwen 3.5 9B9896949292927090.8%
Nemotron 3 Nano9896969696941784.9%
Nemotron 3 Super9898969696941584.9%
DeepSeek V3 (2025-03-24)9898989696961184.9%
Arcee AI: Trinity Mini1009898989696083.8%
Cohere Command R+ (Aug. 2024)9892898775746082.2%
DeepSeek V3.1949292929138672.2%
Rocinante 12B96969692748066.0%
Ministral 3B9164554743424254.7%
Ministral 3 3B7458575547454253.9%
Claude 3 Haiku00000000.0%
Gemma 3 4B00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Qwen 3.5 Flash1001001001001001009298.8%
Rocinante 12B1001001001001001009298.8%
DeepSeek V3 (2025-03-24)1001001001001001007596.4%
Arcee AI: Trinity Mini9292929292929291.7%
Hermes 3 405B100100100100100100085.7%
Ministral 3 14B100100100100100100085.7%
Claude 3 Haiku100100100100100100085.7%
Nemotron 3 Nano10010010010092423381.0%
Hermes 3 70B1001001001001000071.4%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
GPT-51001001001001001009899.7%
Aion 2.01001001001001001009899.7%
Z.AI GLM 4.61001001001001001009899.7%
Claude Sonnet 41001001001001001009899.7%
Z.AI GLM 4.51001001001001001009899.7%
MoonshotAI: Kimi K2.5100100100100100989899.5%
Grok 41001001001001001009699.5%
GPT-5.4 Mini (Reasoning)100100100100100989899.5%
Gemini 2.5 Pro100100100100100989899.5%
Gemini 3.1 Pro (Preview)100100100100100989699.2%
Qwen 3.5 27B10010010010098989899.2%
Z.AI GLM 4.710010010010098989899.2%
GPT-4o, May 13th (temp=0)10010010010098989899.2%
GPT-4o, Aug. 6th (temp=1)100100100100100989699.2%
Stealth: Healer Alpha100100100100100989699.2%
GPT-4o Mini (temp=0)10010010010098989899.2%
Z.AI GLM 5 Turbo1001001009898989899.0%
Qwen 3.5 397B A17B1001001009898989899.0%
Z.AI GLM 5100100100100100969699.0%
o4 Mini High1001001009898989899.0%
Gemini 2.5 Flash (Reasoning)1001001009898989899.0%
Gemini 3 Flash (Preview, Reasoning)100100989898989898.7%
Grok 4.1 Fast100100989898989898.7%
Grok 4 Fast100100989898989898.7%
GPT-5.11001001009898989598.5%
Qwen 3.5 35B10098989898989898.5%
Stealth: Hunter Alpha10098989898989898.5%
GPT-4o, May 13th (temp=1)10098989898989898.5%
GPT-4.1 Mini10098989898989898.5%
Mistral Small 4 (Reasoning)10098989898989898.5%
Qwen 3.5 9B10098989898989698.2%
Gemini 2.5 Flash Lite (Reasoning)100100989898969698.2%
GPT-5.4 Nano (Reasoning)100100989898989598.2%
GPT-5.4 (Reasoning, Low)10098989898989698.2%
ByteDance Seed 1.69898989898989898.2%
GPT-5.29898989898989898.2%
Gemini 3 Pro (Preview)9898989898989898.2%
Mistral Large 39898989898989898.2%
DeepSeek-V2 Chat9898989898989898.2%
Mistral Large 29898989898989898.2%
DeepSeek V3.29898989898989898.2%
Gemini 2.5 Flash Lite9898989898989898.2%
Mistral Large9898989898989898.2%
Qwen3 235B A22B Instruct 25079898989898989898.2%
GPT-4o Mini (temp=1)10098989898989698.2%
WizardLM 2 8x22b9898989898989898.2%
Gemini 3.1 Flash Lite (Preview)9898989898989698.0%
Claude Haiku 4.5100100989896969698.0%
GPT-5 Nano100100989898969598.0%
ByteDance Seed 2.0 Mini100100989696969697.7%
Mistral Medium 3.19898989898969697.7%
Grok 4.20 (Beta, Reasoning)9898989898969697.7%
GPT-4.1 Nano10098989896969697.7%
Claude Sonnet 4.510098989696969697.5%
Writer: Palmyra X59898989896969697.5%
Gemma 3 27B10098989696969697.5%
GPT-5.49898989696969597.0%
Llama 3.1 70B10098969696969396.7%
Arcee AI: Trinity Large (Preview)9896969696969696.7%
MiniMax M2.59896969696969596.5%
Claude 3.5 Sonnet9696969696969696.5%
Claude 3.7 Sonnet9696969696969696.5%
Hermes 3 405B9696969696969696.5%
Mistral Small 3.2 24B9696969696969696.5%
Qwen 2.5 72B9696969696969696.5%
Llama 3.1 Nemotron 70B9696969696969696.5%
Mistral Small Creative9696969696969696.5%
Ministral 3 14B9696969696969696.5%
Ministral 3 8B9696969696969696.5%
Llama 3.1 8B9696969696969696.5%
Qwen 3.5 Plus (2026-02-15)9896969696969596.5%
Claude Sonnet 4.69696969696969596.2%
Mistral NeMO9696969696969596.2%
Ministral 8B9696969696969396.0%
Cohere Command R+ (Aug. 2024)9896969695939395.5%
Inception Mercury 29896969595939395.2%
Gemma 3 12B9695959595959595.0%
Claude 3 Haiku9696959595959395.0%
Gemini 3 Flash (Preview)9595959595959594.7%
Mistral Small 49696969695919194.7%
Qwen 3.5 Flash9898989898987494.7%
Qwen 3 32B10098989898987294.7%
GPT-5.4 Mini9896959393939394.5%
Rocinante 12B9696959593938493.2%
GPT-5.4 Mini (Reasoning, Low)9896959391898893.0%
GPT-5.4 Nano9696959589888291.7%
Ministral 3B9191919191918890.7%
Ministral 3 3B9191919188888889.7%
Gemma 3 4B9389898989888689.2%
Grok 4.20 (Beta)9898989695894288.2%
GPT-5.4 Nano (Reasoning, Low)9895959389865887.7%
DeepSeek V3.19898989898961686.2%
Nemotron 3 Super10098989896961486.0%
o4 Mini10010010010010098285.7%
ByteDance Seed 1.6 Flash9898989696951885.7%
DeepSeek V3 (2024-12-26)989898989896084.0%
DeepSeek V3 (2025-03-24)989898969696083.5%
Inception Mercury9595888681704679.9%
Z.AI GLM 4.7 Flash100100969191251974.7%
Arcee AI: Trinity Mini7575707070706871.4%
MiniMax M2.79898989672181871.2%
Nemotron 3 Nano9695959539372869.2%
LFM2 24B8282826161601263.2%
Hermes 3 70B9500000013.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
GPT-5 Mini1001001001001001009899.7%
Qwen 3.5 27B1001001001001001009899.7%
Z.AI GLM 4.61001001001001001009899.7%
MiniMax M2.71001001001001001009899.7%
o4 Mini1001001001001001009899.7%
Gemini 3.1 Flash Lite (Preview)1001001001001001009899.7%
DeepSeek-V2 Chat1001001001001001009899.7%
GPT-5.41001001001001001009899.7%
Gemma 3 12B1001001001001001009899.7%
Llama 3.1 70B1001001001001001009899.7%
Stealth: Hunter Alpha100100100100100989899.3%
Qwen 3.5 9B100100100100100989899.3%
Nemotron 3 Super100100100100100989899.3%
Inception Mercury 2100100100100100989899.3%
Arcee AI: Trinity Large (Preview)100100100100100989899.3%
Claude 3 Haiku100100100100100989899.3%
GPT-5.4 (Reasoning, Low)10010010010098989899.0%
Qwen 3.5 Flash1001001001001001009399.0%
Grok 4 Fast100100100100100989599.0%
Stealth: Healer Alpha10010010010098989899.0%
DeepSeek V3 (2025-03-24)100100100100100989599.0%
MiniMax M2.510010010010098989598.7%
GPT-5.4 Mini (Reasoning, Low)100100100100100989398.7%
GPT-5 Nano10010010010098989598.7%
Grok 4.1 Fast1001001009898989598.3%
Z.AI GLM 4.7 Flash100100989898989898.3%
Qwen 3 32B1001001009898989598.3%
Arcee AI: Trinity Mini100100989898989898.3%
Cohere Command R+ (Aug. 2024)1001001009898989598.3%
Gemini 2.5 Flash Lite (Reasoning)10098989898989898.0%
GPT-5.4 Mini10010010010098989198.0%
Qwen 3.5 35B100100989898989397.7%
Llama 3.1 8B9898989898959597.0%
Ministral 3 14B100100959595959596.7%
GPT-4.1 Nano9898989895959596.7%
Mistral Small Creative9595959595939394.7%
Mistral Small 4 (Reasoning)10098989898937794.4%
Mistral NeMO9898989898888194.0%
Inception Mercury9595959388888691.7%
ByteDance Seed 1.6 Flash10098959595886591.0%
Ministral 3 3B9591919191918891.0%
GPT-5.4 Nano (Reasoning)9893939391917490.4%
Ministral 3B9191919188888889.7%
GPT-5.4 Mini (Reasoning)1001001009391844787.7%
GPT-5.4 Nano9391888886847486.4%
Nemotron 3 Nano10098989377633780.7%
GPT-5.4 Nano (Reasoning, Low)8681777472727276.4%
DeepSeek V3.1100100100100449765.8%
Rocinante 12B4742281400018.6%
Gemma 3 4B1616161616161616.3%
Hermes 3 70B10000000014.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Inception Mercury100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B1001001001001001008898.2%
Mistral Small 4 (Reasoning)100100100100100888896.4%
MiniMax M2.71001001001001001005092.9%
Z.AI GLM 4.7 Flash1001001001001001005092.9%
Arcee AI: Trinity Mini1001001008888888892.9%
Gemini 3.1 Pro (Preview)1001001001001001003891.1%
Nemotron 3 Nano1001001001001001003891.1%
GPT-4.1 Nano8888888888888887.5%
Mistral NeMO8888888888888887.5%
LFM2 24B8888888888888887.5%
Hermes 3 70B1001001006300051.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
MoonshotAI: Kimi K2.51001001001001001009799.6%
Qwen 3.5 27B1001001001001001009799.6%
o4 Mini1001001001001001009799.6%
Qwen 3.5 35B1001001001001001009799.6%
ByteDance Seed 2.0 Mini1001001001001001009799.6%
Stealth: Healer Alpha1001001001001001009799.6%
GPT-5.41001001001001001009799.6%
Llama 3.1 Nemotron 70B1001001001001001009799.6%
GPT-5.4 Nano1001001001001001009799.6%
o4 Mini High100100100100100979799.2%
GPT-4.1 Mini100100100100100979799.2%
GPT-4.1 Nano1001001001001001009599.2%
Cohere Command R+ (Aug. 2024)100100100100100979799.2%
Mistral Large 210010010010097979798.8%
GPT-5.4 Nano (Reasoning, Low)10010010010097979798.8%
Inception Mercury 21001001009797979798.5%
GPT-5 Nano1001001009797979798.5%
MiniMax M2.71001001009797979798.5%
GPT-5.4 Mini (Reasoning, Low)100100100100100959598.5%
GPT-5.4 Mini100100100100100978998.1%
GPT-5.4 Nano (Reasoning)10010010010097979298.1%
Mistral Large100100979797979798.1%
Llama 3.1 8B100100100100100978998.1%
GPT-5.4 Mini (Reasoning)10010010010097959297.7%
Mistral Large 39797979797979797.3%
Nemotron 3 Super100100979797979297.3%
Mistral Small 4 (Reasoning)100100979797959597.3%
Qwen 3.5 Plus (2026-02-15)1001001001001001007896.9%
MiniMax M2.59797979797979596.9%
Mistral Medium 3.110010010010089898695.0%
Z.AI GLM 4.7 Flash9797979797928695.0%
GPT-5.21001001008989897892.3%
LFM2 24B9292929292929291.9%
Qwen 3.5 9B10010010010097973590.0%
DeepSeek V3.1100100100100100923589.6%
Mistral Small 3.2 24B100100100100100100085.7%
Gemini 2.5 Flash Lite (Reasoning)100100100979797084.6%
Rocinante 12B1001001001009789083.8%
Inception Mercury9595929278731977.6%
Nemotron 3 Nano9797866530303062.2%
Hermes 3 70B1001001001900045.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
MiniMax M2.7100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
MiniMax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
Inception Mercury 2100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Qwen 3 32B100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Mistral Small 4100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Claude 3.5 Sonnet1001001001001001009699.4%
Inception Mercury1001001001001001009699.4%
ByteDance Seed 1.6 Flash1001001001001001009699.4%
GPT-5.4 Nano (Reasoning)100100100100100969698.9%
GPT-4.1 Nano1001001001001001009298.9%
GPT-5.4 Nano (Reasoning, Low)1001001009696969697.7%
GPT-5.4 Nano1001001009696969697.7%
Cohere Command R+ (Aug. 2024)1001001001001001008097.1%
LFM2 24B10096969696969696.6%
Claude 3 Haiku9696969696969696.0%
Arcee AI: Trinity Mini9696969696969295.4%
Z.AI GLM 4.7 Flash1001001001001001004892.6%
Mistral Small 4 (Reasoning)1001001001001001004091.4%
Nemotron 3 Nano10010010010096605286.9%
DeepSeek V3.110010010010010016474.3%
DeepSeek V3 (2025-03-24)10010010010010016073.7%
Rocinante 12B100100100962016462.3%
Hermes 3 70B1001000000028.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100.0%
GPT-5.4100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
LFM2 24B100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001009899.7%
Grok 4.20 (Beta, Reasoning)1001001001001001009899.7%
MoonshotAI: Kimi K2.51001001001001001009899.7%
GPT-5.4 Mini (Reasoning)1001001001001001009899.7%
o4 Mini1001001001001001009899.7%
Gemini 3.1 Flash Lite (Preview)1001001001001001009899.7%
DeepSeek V3 (2024-12-26)1001001001001001009899.7%
Qwen 3 32B1001001001001001009899.7%
Gemma 3 27B1001001001001001009899.7%
Mistral Small Creative1001001001001001009899.7%
GPT-5.2100100100100100989899.5%
Z.AI GLM 4.6100100100100100989899.5%
MiniMax M2.7100100100100100989899.5%
ByteDance Seed 2.0 Mini100100100100100989899.5%
Grok 4 Fast100100100100100989899.5%
Stealth: Healer Alpha100100100100100989899.5%
Nemotron 3 Super1001001001001001009699.5%
Mistral Small 4 (Reasoning)1001001001001001009699.5%
Arcee AI: Trinity Large (Preview)100100100100100989899.5%
GPT-5.4 (Reasoning)100100100100100989699.2%
Z.AI GLM 4.5100100100100100989699.2%
Qwen 3.5 9B1001001001001001009499.2%
GPT-5 Nano10010010010098989899.2%
Gemma 3 4B10010010010098989899.2%
Ministral 3B10010010010098989698.9%
Grok 4.1 Fast1001001009898989898.9%
GPT-5.4 Nano (Reasoning)10010010010098989698.9%
GPT-5.110010010010098969698.7%
GPT-5.4 Mini (Reasoning, Low)100100100100100949498.4%
MiniMax M2.510098989898989898.4%
Mistral Small 41001001009898969698.4%
Cohere Command R+ (Aug. 2024)100100989898989698.4%
Gemini 3 Flash (Preview)1001001009898989298.1%
GPT-4o Mini (temp=1)9898989898989898.1%
GPT-4o Mini (temp=0)9898989898989898.1%
ByteDance Seed 1.6 Flash10098989898989698.1%
Ministral 3 14B9898989898989898.1%
Grok 410098989898969697.8%
GPT-5.4 Nano (Reasoning, Low)1001001009696969697.8%
GPT-4.1 Nano9898989898989697.8%
Inception Mercury 2100100989898969297.6%
GPT-5.4 Nano100100989696969697.6%
Ministral 3 3B9898989896969697.3%
ByteDance Seed 1.69898989696969496.8%
Ministral 3 8B10096969696969696.8%
Ministral 8B9896969696969696.5%
Nemotron 3 Nano9898969696949496.2%
Arcee AI: Trinity Mini9898969696969296.2%
Mistral NeMO9696969696969696.2%
Inception Mercury9694929189858590.3%
DeepSeek V3.11001001001001001002689.5%
Z.AI GLM 4.7 Flash9896969696911784.4%
Rocinante 12B969696969694282.5%
Gemini 2.5 Flash Lite (Reasoning)100100100989866080.3%
Hermes 3 70B100100100000042.9%