Structural validity

Test: Codex Extraction

Avg. Score
96.8%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Inception Mercury99.9%$0.00063.9s99%
2Inception Mercury 2100.0%$0.00223.5s100%
3Gemini 3.1 Flash Lite (Preview)99.9%$0.00172.0s99%
4Gemini 2.5 Flash99.8%$0.00232.5s99%
5Ministral 3 8B99.6%$0.00063.3s98%
6Mistral Small 3.2 24B99.5%$0.00054.4s98%
7Grok 4.20 (Beta)100.0%$0.00492.0s100%
8GPT-5.4 Mini (Reasoning, Low)100.0%$0.00423.7s100%
9Ministral 3 3B99.1%$0.00041.8s97%
10GPT-5.4 Mini99.6%$0.00312.2s98%
11GPT-4.1 Mini99.6%$0.00156.6s98%
12Grok 4 Fast99.7%$0.00128.7s98%
13DeepSeek V3 (2024-12-26)100.0%$0.001713.5s100%
14Ministral 3B98.3%$0.00021.8s95%
15Claude Haiku 4.5100.0%$0.00734.5s100%
16DeepSeek-V2 Chat99.9%$0.001914.8s99%
17Qwen 3.5 Plus (2026-02-15)99.8%$0.003010.6s99%
18Gemma 3 27B99.4%$0.000514.5s97%
19GPT-4o, Aug. 6th (temp=1)99.8%$0.00923.8s99%
20Stealth: Healer Alpha100.0%$0.000024.6s100%
21Grok 4.1 Fast100.0%$0.001722.1s100%
22Mistral Medium 3.198.6%$0.00265.8s95%
23GPT-4o Mini (temp=1)98.0%$0.00068.0s95%
24GPT-4o, Aug. 6th (temp=0)99.7%$0.01004.0s99%
25Gemini 2.5 Flash (Reasoning)100.0%$0.008211.8s100%
26MiniMax M2.799.9%$0.002221.7s99%
27GPT-4o Mini (temp=0)98.3%$0.00057.9s93%
28Gemini 3 Flash (Preview)98.2%$0.00273.9s93%
29Mistral Large 398.3%$0.00278.2s95%
30DeepSeek V3 (2025-03-24)100.0%$0.001427.0s100%
31Z.AI GLM 4.599.5%$0.002816.8s96%
32Mistral Small 4 (Reasoning)98.7%$0.001915.6s95%
33Claude 3 Haiku97.3%$0.00175.5s92%
34Z.AI GLM 5 Turbo99.8%$0.006816.0s98%
35GPT-5.4 Nano98.1%$0.00113.8s89%
36Hermes 3 405B98.9%$0.004017.8s97%
37Stealth: Hunter Alpha100.0%$0.000036.7s100%
38MiniMax M2.599.7%$0.002330.1s99%
39Qwen 2.5 72B97.0%$0.000811.0s92%
40WizardLM 2 8x22b99.9%$0.002631.3s99%
41DeepSeek V3.2100.0%$0.001136.7s100%
42Mistral Small 496.4%$0.00083.2s89%
43Mistral Large 298.9%$0.0118.3s96%
44Gemma 3 4B95.9%$0.00026.7s90%
45Gemini 3 Flash (Preview, Reasoning)100.0%$0.009622.2s100%
46Arcee AI: Trinity Mini95.8%$0.00035.8s90%
47Gemini 2.5 Flash Lite (Reasoning)97.7%$0.002115.6s92%
48GPT-5.498.2%$0.0116.4s95%
49GPT-5.4 (Reasoning, Low)99.8%$0.01710.4s98%
50o4 Mini100.0%$0.01421.5s100%
51Claude Sonnet 4.5100.0%$0.0226.6s100%
52Writer: Palmyra X596.7%$0.00527.5s90%
53Z.AI GLM 4.6100.0%$0.005738.8s100%
54Claude 3.7 Sonnet100.0%$0.0217.8s100%
55Claude Sonnet 4100.0%$0.0227.9s100%
56Claude Sonnet 4.6100.0%$0.0227.4s100%
57GPT-5.4 Nano (Reasoning, Low)96.6%$0.00113.7s83%
58Qwen 3 32B98.9%$0.001037.9s96%
59GPT-5.2100.0%$0.01816.8s100%
60Grok 4.20 (Beta, Reasoning)100.0%$0.02012.7s100%
61Ministral 8B96.0%$0.00043.8s83%
62GPT-4.1 Nano94.1%$0.00032.8s85%
63Hermes 3 70B97.1%$0.001215.8s87%
64Qwen 3.5 Flash100.0%$0.003149.5s100%
65GPT-4.196.6%$0.00734.7s87%
66GPT-5.4 Mini (Reasoning)100.0%$0.01725.6s100%
67Gemini 2.5 Flash Lite94.2%$0.00051.9s82%
68Mistral Small Creative94.3%$0.00063.9s83%
69GPT-5 Mini100.0%$0.007045.5s100%
70Qwen3 235B A22B Instruct 250796.2%$0.000719.3s85%
71ByteDance Seed 1.6 Flash97.3%$0.001139.3s93%
72GPT-4o, May 13th (temp=0)99.5%$0.0254.3s95%
73Cohere Command R+ (Aug. 2024)97.7%$0.01417.9s94%
74GPT-5.4 Nano (Reasoning)97.3%$0.002312.2s81%
75Nemotron 3 Super99.8%$0.00001.0m99%
76Z.AI GLM 5100.0%$0.009151.4s100%
77GPT-4o, May 13th (temp=1)98.4%$0.0254.0s94%
78Qwen 3.5 35B99.9%$0.01547.3s99%
79Claude Opus 4.5100.0%$0.0377.7s100%
80Claude Opus 4.6100.0%$0.0378.1s100%
81o4 Mini High100.0%$0.02540.5s100%
82Gemini 2.5 Pro100.0%$0.03422.8s100%
83Aion 2.0100.0%$0.00801.2m100%
84Ministral 3 14B90.7%$0.00096.0s75%
85ByteDance Seed 1.6100.0%$0.00731.3m100%
86Z.AI GLM 4.7 Flash98.7%$0.00191.2m94%
87Claude 3.5 Sonnet100.0%$0.04313.9s100%
88GPT-5 Nano99.5%$0.00431.3m96%
89Grok 4100.0%$0.03137.2s100%
90Nemotron 3 Nano99.9%$0.00131.6m99%
91ByteDance Seed 2.0 Lite99.6%$0.00781.4m98%
92GPT-5.1100.0%$0.03543.6s100%
93Z.AI GLM 4.7100.0%$0.00981.7m100%
94DeepSeek V3.195.5%$0.001226.1s61%
95Claude Opus 4.6 (Reasoning)100.0%$0.05521.4s100%
96Gemma 3 12B89.8%$0.000314.1s59%
97Qwen 3.5 397B A17B100.0%$0.0121.8m100%
98Llama 3.1 70B90.6%$0.001616.0s59%
99Claude Sonnet 4.6 (Reasoning)100.0%$0.05236.0s100%
100GPT-5.4 (Reasoning)100.0%$0.04453.6s100%
101Gemini 3 Pro (Preview)100.0%$0.05535.6s100%
102Mistral Large93.4%$0.0117.6s56%
103Qwen 3.5 9B95.6%$0.00131.5m79%
104Arcee AI: Trinity Large (Preview)89.5%$0.000020.5s46%
105GPT-5100.0%$0.0491.3m100%
106MoonshotAI: Kimi K2.5100.0%$0.0132.4m100%
107Llama 3.1 Nemotron 70B81.0%$0.005017.3s55%
108Llama 3.1 8B78.9%$0.000110.0s48%
109Gemini 3.1 Pro (Preview)99.5%$0.0651.0m96%
110Qwen 3.5 27B95.0%$0.0211.6m70%
111ByteDance Seed 2.0 Mini98.2%$0.00343.4m95%
112Claude Opus 4100.0%$0.11013.5s100%
113Rocinante 12B51.3%$0.001325.6s16%
114Qwen 3.5 122B100.0%$0.0793.6m100%
115LFM2 24B43.2%$0.000212.5s9%
116Mistral NeMO30.0%$0.00061.4s0%
96.78%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Stealth: Hunter Alpha1001001001009999.8%
Grok 41001001001009999.8%
Z.AI GLM 4.51001001001009999.8%
Claude 3.7 Sonnet1001001001009999.8%
DeepSeek V3.21001001001009999.8%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Mistral Small 4 (Reasoning)1001001001009899.7%
Gemma 3 27B1001001001009899.6%
MiniMax M2.71001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
DeepSeek-V2 Chat100100100999999.6%
MiniMax M2.51001001001009899.6%
Inception Mercury100100100999999.6%
GPT-4o, Aug. 6th (temp=1)100100100999999.6%
WizardLM 2 8x22b1001001001009799.4%
Qwen 3 32B100100100999899.3%
Gemini 2.5 Flash Lite10010099999899.2%
Mistral Small 3.2 24B1009999999999.2%
Ministral 3 3B10010099989899.0%
GPT-4o, Aug. 6th (temp=0)999999999998.9%
Qwen 2.5 72B10010099999598.5%
Ministral 3 8B10010098989698.4%
Ministral 8B1009999989698.3%
GPT-4o Mini (temp=1)1009999979698.1%
Mistral Large 2999999999498.1%
Gemini 3.1 Pro (Preview)1001001001009098.0%
LFM2 24B10010099979497.9%
Mistral Small 410010098979497.9%
Z.AI GLM 4.7 Flash100100100989197.8%
Ministral 3B1009898979697.8%
Hermes 3 405B999999979597.7%
Llama 3.1 70B1009999969497.5%
Gemini 2.5 Flash Lite (Reasoning)100100100998897.5%
Gemma 3 12B10010099998997.5%
GPT-4o Mini (temp=0)10010096969497.4%
Writer: Palmyra X51009898969597.4%
Mistral Small Creative999696969696.8%
ByteDance Seed 2.0 Mini999896959496.6%
Llama 3.1 Nemotron 70B989897959496.3%
Claude 3 Haiku1009896959296.3%
ByteDance Seed 1.6 Flash999996949396.2%
Cohere Command R+ (Aug. 2024)989897969396.2%
Arcee AI: Trinity Mini999995959295.9%
Qwen3 235B A22B Instruct 2507969696969595.6%
Hermes 3 70B1009999988395.6%
Gemma 3 4B1009696959095.4%
Mistral Large 3989893939395.2%
Ministral 3 14B999695929294.9%
GPT-4.1 Nano1009898928093.6%
Qwen 3.5 9B100100100995390.4%
DeepSeek V3.11001001001001082.0%
Arcee AI: Trinity Large (Preview)999999871078.8%
Mistral Large99989393076.7%
Llama 3.1 8B949188842576.5%
Rocinante 12B9689790052.8%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Inception Mercury100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
ByteDance Seed 2.0 Lite1001001001009899.6%
GPT-4o, May 13th (temp=1)1001001001009899.6%
GPT-4o, Aug. 6th (temp=1)1001001001009899.6%
Qwen 3.5 35B1001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
Claude 3 Haiku1001001001009899.6%
GPT-4.1 Mini1001001001009899.5%
Ministral 3 3B1001001001009899.5%
GPT-5.4 (Reasoning, Low)1001001001009699.3%
GPT-5.4 Nano (Reasoning)1001001001009699.3%
Mistral Large 2100100100989899.2%
MiniMax M2.5100100100989899.2%
Z.AI GLM 4.7 Flash1001001001009699.2%
Qwen 3 32B1001001001009699.2%
GPT-5.4 Nano1001001001009699.2%
Ministral 3B100100100989899.1%
Hermes 3 70B100100100989899.0%
Hermes 3 405B100100100989698.7%
Mistral Small 3.2 24B10010098989898.7%
Mistral Large1009898989898.5%
Cohere Command R+ (Aug. 2024)10010098989698.5%
GPT-4o Mini (temp=1)100100100979498.3%
Mistral Large 3989898989898.1%
GPT-4.1100100100969398.0%
Mistral Small 4 (Reasoning)100100100979397.9%
Gemini 2.5 Flash Lite (Reasoning)1009898969697.9%
GPT-5.410010096969397.1%
GPT-4o Mini (temp=0)1001001001008597.0%
Qwen 2.5 72B10010096959397.0%
ByteDance Seed 2.0 Mini1009896969497.0%
Writer: Palmyra X510010096959396.9%
Gemini 2.5 Flash Lite1009898969096.5%
Qwen 3.5 9B10010096968996.4%
ByteDance Seed 1.6 Flash1009896949296.1%
GPT-5.4 Nano (Reasoning, Low)10010097949096.1%
Mistral Small Creative979696949194.9%
Mistral Medium 3.1969494949494.5%
Mistral Small 410010096918394.2%
Arcee AI: Trinity Mini989793929093.8%
GPT-4.1 Nano989493938993.4%
Qwen3 235B A22B Instruct 250710010096957593.3%
Gemini 3 Flash (Preview)1009393898992.9%
Gemma 3 4B969391919092.3%
Llama 3.1 70B989492908691.9%
Ministral 3 14B929189898890.0%
Llama 3.1 8B938888827585.0%
Arcee AI: Trinity Large (Preview)1009898941080.0%
Gemma 3 12B908988881073.0%
Llama 3.1 Nemotron 70B918479773172.5%
Rocinante 12B858379691065.3%
LFM2 24B252525252525.0%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Hermes 3 405B1001001001009899.6%
Nemotron 3 Nano1001001001009899.6%
GPT-5.4 Mini1001001001009799.4%
Qwen 3.5 Plus (2026-02-15)1001001001009799.3%
Nemotron 3 Super1001001001009799.3%
GPT-4o Mini (temp=1)1001001001009599.0%
Arcee AI: Trinity Large (Preview)100100100989799.0%
Ministral 3 3B100100100989799.0%
GPT-4.1 Mini100100100989798.9%
GPT-4.11001001001009398.7%
Arcee AI: Trinity Mini1001001001009298.3%
Mistral Large 2100100100979498.2%
Mistral Large100100100979498.2%
Ministral 3B10010098979698.2%
Qwen 3 32B100100100989398.2%
ByteDance Seed 1.6 Flash100100100969598.2%
Z.AI GLM 4.51001001001009098.0%
Cohere Command R+ (Aug. 2024)10010098989598.0%
Gemma 3 27B10010098979597.9%
GPT-4o, May 13th (temp=0)1001001001008997.9%
Z.AI GLM 4.7 Flash1001001001008997.9%
Mistral Small 4 (Reasoning)10010097949497.0%
Qwen 3.5 9B100100100949196.9%
Gemma 3 4B989796969596.6%
Mistral Small 4100100100988496.5%
GPT-4o, May 13th (temp=1)1009896968996.0%
Gemini 2.5 Flash Lite (Reasoning)100100100938695.8%
Qwen3 235B A22B Instruct 250710010096938995.7%
GPT-5.4 Nano1001001001007695.2%
Hermes 3 70B10010098987894.7%
Claude 3 Haiku1009896908894.2%
Qwen 2.5 72B989693928893.5%
Gemma 3 12B979494919093.1%
GPT-5.4 Nano (Reasoning, Low)1001001001006492.8%
Writer: Palmyra X5969695908692.5%
Gemini 2.5 Flash Lite10010098937192.5%
GPT-4.1 Nano969391898891.3%
GPT-5.4 Nano (Reasoning)100100100945690.0%
Mistral Small Creative1009081787885.5%
Ministral 3 14B888685706578.7%
Llama 3.1 70B959493911076.7%
Llama 3.1 Nemotron 70B777369675067.3%
Llama 3.1 8B888170691063.4%
Mistral NeMO1001001000060.0%
Rocinante 12B92836050057.1%
LFM2 24B252525252525.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001009899.6%
Gemini 3.1 Flash Lite (Preview)1001001001009899.5%
Hermes 3 405B1001001001009899.5%
Gemma 3 4B1001001001009899.5%
Claude 3 Haiku1001001001009699.3%
ByteDance Seed 2.0 Mini1001001001009699.2%
GPT-5.4 Mini1001001001009699.2%
Ministral 3 14B100100100989899.2%
Z.AI GLM 5 Turbo1001001001009599.1%
Qwen 2.5 72B1001001001009599.1%
Hermes 3 70B100100100989899.1%
Qwen 3 32B1001001001009599.0%
GPT-4o Mini (temp=0)1001001001009498.9%
Qwen 3.5 9B1001001001009498.8%
Ministral 3 3B1001001001009498.8%
Grok 4 Fast10010098989898.7%
ByteDance Seed 1.6 Flash10010098989898.7%
ByteDance Seed 2.0 Lite10010098989898.6%
Cohere Command R+ (Aug. 2024)100100100969598.3%
Ministral 3B100100100969598.3%
GPT-5 Nano1001001001009198.2%
GPT-4.1 Nano100100100969498.1%
GPT-5.4 Nano100100100979398.0%
GPT-4o, May 13th (temp=1)100100100969497.9%
GPT-5.4 Nano (Reasoning, Low)1001001001008897.7%
Mistral Small 41001001001008596.9%
GPT-4o Mini (temp=1)10010094949496.7%
Llama 3.1 70B1009897959296.4%
GPT-5.41009696969295.8%
Gemma 3 12B979796969295.4%
Arcee AI: Trinity Mini10010097908995.2%
Llama 3.1 8B989591858590.7%
GPT-4.110010091837589.8%
Gemini 2.5 Flash Lite1009390857588.6%
Llama 3.1 Nemotron 70B959392827787.9%
Ministral 8B10010079777385.8%
Qwen 3.5 27B100100100505080.0%
Mistral NeMO1001001000060.0%
Rocinante 12B836700030.0%
LFM2 24B252525252525.0%