Structural validity

Test: Codex Extraction

Avg. Score
97.2%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00182.0s100%
2Inception Mercury99.9%$0.00063.9s99%
3Inception Mercury 2100.0%$0.00223.5s100%
4Gemini 3.1 Flash Lite (Preview)99.9%$0.00172.0s99%
5Gemini 2.5 Flash99.8%$0.00232.5s99%
6Ministral 3 8B99.6%$0.00063.3s98%
7Mistral Small 3.2 24B99.5%$0.00054.4s98%
8Grok 4.20 (Beta)100.0%$0.00492.0s100%
9GPT-5.4 Mini (Reasoning, Low)100.0%$0.00423.7s100%
10Gemini 3.1 Flash Lite99.8%$0.00175.1s99%
11Ministral 3 3B99.1%$0.00041.8s97%
12GPT-5.4 Mini99.6%$0.00312.2s98%
13Grok 4.20100.0%$0.00484.9s100%
14GPT-4.1 Mini99.6%$0.00156.6s98%
15Grok 4 Fast99.7%$0.00128.7s98%
16DeepSeek V3 (2024-12-26)100.0%$0.001713.5s100%
17Ministral 3B98.3%$0.00021.8s95%
18Claude Haiku 4.5100.0%$0.00734.5s100%
19DeepSeek-V2 Chat99.9%$0.001914.8s99%
20Qwen 3.5 Plus (2026-02-15)99.8%$0.003010.6s99%
21Xiaomi MIMO v2.599.8%$0.003413.4s99%
22Gemma 3 27B99.4%$0.000514.5s97%
23Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.0103.1s100%
24GPT-4o, Aug. 6th (temp=1)99.8%$0.00923.8s99%
25Stealth: Healer Alpha100.0%$0.000024.6s100%
26DeepSeek V4 Pro99.7%$0.002116.0s98%
27Grok 4.1 Fast100.0%$0.001722.1s100%
28Mistral Medium 3.198.6%$0.00265.8s95%
29GPT-4o Mini (temp=1)98.0%$0.00068.0s95%
30GPT-4o, Aug. 6th (temp=0)99.7%$0.01004.0s99%
31DeepSeek V4 Flash99.1%$0.00037.7s93%
32Gemma 4 31B100.0%$0.000825.9s100%
33Gemini 2.5 Flash (Reasoning)100.0%$0.008211.8s100%
34Xiaomi MIMO v2.5 Pro100.0%$0.004818.7s100%
35MiniMax M2.799.9%$0.002221.7s99%
36GPT-4o Mini (temp=0)98.3%$0.00057.9s93%
37Gemini 3 Flash (Preview)98.2%$0.00273.9s93%
38Mistral Large 398.3%$0.00278.2s95%
39DeepSeek V3 (2025-03-24)100.0%$0.001427.0s100%
40Z.AI GLM 4.599.5%$0.002816.8s96%
41Mistral Small 4 (Reasoning)98.7%$0.001915.6s95%
42Claude 3 Haiku97.3%$0.00175.5s92%
43Z.AI GLM 5 Turbo99.8%$0.006816.0s98%
44GPT-5.4 Nano98.1%$0.00113.8s89%
45Hermes 3 405B98.9%$0.004017.8s97%
46Stealth: Hunter Alpha100.0%$0.000036.7s100%
47MiniMax M2.599.7%$0.002330.1s99%
48DeepSeek V4 Flash (Reasoning)100.0%$0.000937.0s100%
49Qwen 2.5 72B97.0%$0.000811.0s92%
50WizardLM 2 8x22b99.9%$0.002631.3s99%
51DeepSeek V3.2100.0%$0.001136.7s100%
52Mistral Small 496.4%$0.00083.2s89%
53Mistral Large 298.9%$0.0118.3s96%
54Gemma 3 4B95.9%$0.00026.7s90%
55Gemini 3 Flash (Preview, Reasoning)100.0%$0.009622.2s100%
56Arcee AI: Trinity Mini95.8%$0.00035.8s90%
57Grok 4.398.7%$0.00564.3s89%
58Gemini 2.5 Flash Lite (Reasoning)97.7%$0.002115.6s92%
59GPT-5.498.2%$0.0116.4s95%
60GPT-5.4 (Reasoning, Low)99.8%$0.01710.4s98%
61o4 Mini100.0%$0.01421.5s100%
62Claude Sonnet 4.5100.0%$0.0226.6s100%
63Writer: Palmyra X596.7%$0.00527.5s90%
64Z.AI GLM 4.6100.0%$0.005738.8s100%
65Claude 3.7 Sonnet100.0%$0.0217.8s100%
66Claude Sonnet 4100.0%$0.0227.9s100%
67Claude Sonnet 4.6100.0%$0.0227.4s100%
68GPT-5.4 Nano (Reasoning, Low)96.6%$0.00113.7s83%
69Qwen 3 32B98.9%$0.001037.9s96%
70GPT-5.2100.0%$0.01816.8s100%
71Qwen 3.6 35B100.0%$0.007237.6s100%
72Qwen 3.6 Flash99.8%$0.009630.2s99%
73Grok 4.20 (Beta, Reasoning)100.0%$0.02012.7s100%
74Ministral 8B96.0%$0.00043.8s83%
75Z.AI GLM 4.5 Air99.1%$0.002340.1s97%
76GPT-4.1 Nano94.1%$0.00032.8s85%
77Hermes 3 70B97.1%$0.001215.8s87%
78Qwen 3.5 Flash100.0%$0.003149.5s100%
79GPT-4.196.6%$0.00734.7s87%
80GPT-5.5100.0%$0.0276.0s100%
81GPT-5.4 Mini (Reasoning)100.0%$0.01725.6s100%
82Gemini 2.5 Flash Lite94.2%$0.00051.9s82%
83Mistral Small Creative94.3%$0.00063.9s83%
84GPT-OSS 120B100.0%$0.001157.1s100%
85GPT-5 Mini100.0%$0.007045.5s100%
86Grok 4.20 (Reasoning)100.0%$0.01237.2s100%
87Qwen3 235B A22B Instruct 250796.2%$0.000719.3s85%
88ByteDance Seed 1.6 Flash97.3%$0.001139.3s93%
89GPT-4o, May 13th (temp=0)99.5%$0.0254.3s95%
90Cohere Command R+ (Aug. 2024)97.7%$0.01417.9s94%
91GPT-5.4 Nano (Reasoning)97.3%$0.002312.2s81%
92Nemotron 3 Super99.8%$0.00001.0m99%
93Z.AI GLM 5100.0%$0.009151.4s100%
94GPT-4o, May 13th (temp=1)98.4%$0.0254.0s94%
95Qwen 3.5 35B99.9%$0.01547.3s99%
96Claude Opus 4.5100.0%$0.0377.7s100%
97Claude Opus 4.6100.0%$0.0378.1s100%
98GPT-5.5 (Reasoning, Low)100.0%$0.03813.3s100%
99o4 Mini High100.0%$0.02540.5s100%
100Gemini 2.5 Pro100.0%$0.03422.8s100%
101Aion 2.0100.0%$0.00801.2m100%
102Ministral 3 14B90.7%$0.00096.0s75%
103ByteDance Seed 1.6100.0%$0.00731.3m100%
104Z.AI GLM 4.7 Flash98.7%$0.00191.2m94%
105Gemini 3.5 Flash (Reasoning)100.0%$0.04016.4s100%
106Z.AI GLM 5.1100.0%$0.0151.1m100%
107Claude Opus 4.7 (Reasoning)100.0%$0.0465.9s100%
108Claude Opus 4.7100.0%$0.0476.1s100%
109Claude 3.5 Sonnet100.0%$0.04313.9s100%
110GPT-5 Nano99.5%$0.00431.3m96%
111Grok 4100.0%$0.03137.2s100%
112Nemotron 3 Nano99.9%$0.00131.6m99%
113ByteDance Seed 2.0 Lite99.6%$0.00781.4m98%
114GPT-5.1100.0%$0.03543.6s100%
115Qwen 3.5 Plus (2026-04-20)100.0%$0.0151.4m100%
116Grok 4.3 (Reasoning)100.0%$0.0181.3m100%
117Z.AI GLM 4.7100.0%$0.00981.7m100%
118Cydonia 24B V4.192.3%$0.001012.2s60%
119DeepSeek V3.195.5%$0.001226.1s61%
120Claude Opus 4.6 (Reasoning)100.0%$0.05521.4s100%
121Gemma 4 31B (Reasoning)100.0%$0.00162.2m100%
122Gemma 3 12B89.8%$0.000314.1s59%
123Qwen 3.5 397B A17B100.0%$0.0121.8m100%
124GPT-5.5 (Reasoning)100.0%$0.05625.8s100%
125Llama 3.1 70B90.6%$0.001616.0s59%
126Claude Sonnet 4.6 (Reasoning)100.0%$0.05236.0s100%
127GPT-5.4 (Reasoning)100.0%$0.04453.6s100%
128Gemini 3 Pro (Preview)100.0%$0.05535.6s100%
129Mistral Large93.4%$0.0117.6s56%
130Qwen3.7 Max100.0%$0.0411.1m100%
131Qwen 3.5 9B95.6%$0.00131.5m79%
132DeepSeek V4 Pro (Reasoning)100.0%$0.00932.3m100%
133Qwen 3.6 27B97.4%$0.0181.2m78%
134Arcee AI: Trinity Large (Preview)89.5%$0.000020.5s46%
135GPT-5100.0%$0.0491.3m100%
136MoonshotAI: Kimi K2.5100.0%$0.0132.4m100%
137Llama 3.1 Nemotron 70B81.0%$0.005017.3s55%
138Llama 3.1 8B78.9%$0.000110.0s48%
139MoonshotAI: Kimi K2.6100.0%$0.0262.4m100%
140Gemma 4 26B85.7%$0.000618.7s36%
141Gemini 3.1 Pro (Preview)99.5%$0.0651.0m96%
142Qwen 3.5 27B95.0%$0.0211.6m70%
143Skyfall 36B V278.4%$0.001812.7s36%
144ByteDance Seed 2.0 Mini98.2%$0.00343.4m95%
145Claude Opus 4100.0%$0.11013.5s100%
146Qwen3.6 Max Preview100.0%$0.0452.4m100%
147Gemma 4 26B (Reasoning)95.5%$0.00232.6m61%
148Rocinante 12B51.3%$0.001325.6s16%
149Qwen 3.5 122B100.0%$0.0793.6m100%
150LFM2 24B43.2%$0.000212.5s9%
151Mistral NeMO30.0%$0.00061.4s0%
97.17%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Stealth: Hunter Alpha1001001001009999.8%
Grok 41001001001009999.8%
Z.AI GLM 4.51001001001009999.8%
Claude 3.7 Sonnet1001001001009999.8%
DeepSeek V4 Pro1001001001009999.8%
DeepSeek V3.21001001001009999.8%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Qwen 3.6 27B1001001001009999.8%
Mistral Small 4 (Reasoning)1001001001009899.7%
Z.AI GLM 4.5 Air1001001001009899.6%
Gemma 3 27B1001001001009899.6%
MiniMax M2.71001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
DeepSeek-V2 Chat100100100999999.6%
MiniMax M2.51001001001009899.6%
Inception Mercury100100100999999.6%
GPT-4o, Aug. 6th (temp=1)100100100999999.6%
WizardLM 2 8x22b1001001001009799.4%
Qwen 3 32B100100100999899.3%
Gemini 2.5 Flash Lite10010099999899.2%
Xiaomi MIMO v2.51009999999999.2%
Mistral Small 3.2 24B1009999999999.2%
Ministral 3 3B10010099989899.0%
GPT-4o, Aug. 6th (temp=0)999999999998.9%
Qwen 2.5 72B10010099999598.5%
Ministral 3 8B10010098989698.4%
Ministral 8B1009999989698.3%
GPT-4o Mini (temp=1)1009999979698.1%
Mistral Large 2999999999498.1%
Gemini 3.1 Pro (Preview)1001001001009098.0%
LFM2 24B10010099979497.9%
Mistral Small 410010098979497.9%
Z.AI GLM 4.7 Flash100100100989197.8%
Ministral 3B1009898979697.8%
Hermes 3 405B999999979597.7%
Llama 3.1 70B1009999969497.5%
Gemini 2.5 Flash Lite (Reasoning)100100100998897.5%
Gemma 3 12B10010099998997.5%
GPT-4o Mini (temp=0)10010096969497.4%
Writer: Palmyra X51009898969597.4%
Mistral Small Creative999696969696.8%
DeepSeek V4 Flash1001001001008396.7%
ByteDance Seed 2.0 Mini999896959496.6%
Llama 3.1 Nemotron 70B989897959496.3%
Claude 3 Haiku1009896959296.3%
ByteDance Seed 1.6 Flash999996949396.2%
Cohere Command R+ (Aug. 2024)989897969396.2%
Arcee AI: Trinity Mini999995959295.9%
Qwen3 235B A22B Instruct 2507969696969595.6%
Hermes 3 70B1009999988395.6%
Gemma 3 4B1009696959095.4%
Mistral Large 3989893939395.2%
Ministral 3 14B999695929294.9%
GPT-4.1 Nano1009898928093.6%
Cydonia 24B V4.1969695947891.7%
Qwen 3.5 9B100100100995390.4%
Skyfall 36B V21009490866085.9%
DeepSeek V3.11001001001001082.0%
Arcee AI: Trinity Large (Preview)999999871078.8%
Mistral Large99989393076.7%
Llama 3.1 8B949188842576.5%
Gemma 4 26B10010090101062.1%
Rocinante 12B9689790052.8%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Inception Mercury100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Qwen 3.5 35B1001001001009899.6%
ByteDance Seed 2.0 Lite1001001001009899.6%
GPT-4o, May 13th (temp=1)1001001001009899.6%
GPT-4o, Aug. 6th (temp=1)1001001001009899.6%
DeepSeek V4 Flash1001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
Claude 3 Haiku1001001001009899.6%
GPT-4.1 Mini1001001001009899.5%
Ministral 3 3B1001001001009899.5%
GPT-5.4 (Reasoning, Low)1001001001009699.3%
GPT-5.4 Nano (Reasoning)1001001001009699.3%
Mistral Large 2100100100989899.2%
MiniMax M2.5100100100989899.2%
Z.AI GLM 4.7 Flash1001001001009699.2%
Qwen 3 32B1001001001009699.2%
GPT-5.4 Nano1001001001009699.2%
Ministral 3B100100100989899.1%
Hermes 3 70B100100100989899.0%
DeepSeek V4 Pro100100100989698.9%
Hermes 3 405B100100100989698.7%
Mistral Small 3.2 24B10010098989898.7%
Gemma 4 26B100100100969698.6%
Mistral Large1009898989898.5%
Cohere Command R+ (Aug. 2024)10010098989698.5%
Cydonia 24B V4.110010098989698.4%
GPT-4o Mini (temp=1)100100100979498.3%
Mistral Large 3989898989898.1%
GPT-4.1100100100969398.0%
Mistral Small 4 (Reasoning)100100100979397.9%
Gemini 2.5 Flash Lite (Reasoning)1009898969697.9%
Z.AI GLM 4.5 Air1009898969597.3%
GPT-5.410010096969397.1%
GPT-4o Mini (temp=0)1001001001008597.0%
Qwen 2.5 72B10010096959397.0%
ByteDance Seed 2.0 Mini1009896969497.0%
Writer: Palmyra X510010096959396.9%
Gemini 2.5 Flash Lite1009898969096.5%
Qwen 3.5 9B10010096968996.4%
ByteDance Seed 1.6 Flash1009896949296.1%
GPT-5.4 Nano (Reasoning, Low)10010097949096.1%
Grok 4.31001001001007595.0%
Mistral Small Creative979696949194.9%
Mistral Medium 3.1969494949494.5%
Skyfall 36B V2989897958394.4%
Mistral Small 410010096918394.2%
Arcee AI: Trinity Mini989793929093.8%
GPT-4.1 Nano989493938993.4%
Qwen3 235B A22B Instruct 250710010096957593.3%
Gemini 3 Flash (Preview)1009393898992.9%
Gemma 3 4B969391919092.3%
Llama 3.1 70B989492908691.9%
Ministral 3 14B929189898890.0%
Llama 3.1 8B938888827585.0%
Arcee AI: Trinity Large (Preview)1009898941080.0%
Gemma 3 12B908988881073.0%
Llama 3.1 Nemotron 70B918479773172.5%
Rocinante 12B858379691065.3%
LFM2 24B252525252525.0%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Grok 4.31001001001009899.6%
Hermes 3 405B1001001001009899.6%
Nemotron 3 Nano1001001001009899.6%
Qwen 3.6 Flash1001001001009799.4%
GPT-5.4 Mini1001001001009799.4%
Qwen 3.5 Plus (2026-02-15)1001001001009799.3%
Gemini 3.1 Flash Lite1001001001009799.3%
Nemotron 3 Super1001001001009799.3%
GPT-4o Mini (temp=1)1001001001009599.0%
Arcee AI: Trinity Large (Preview)100100100989799.0%
Ministral 3 3B100100100989799.0%
GPT-4.1 Mini100100100989798.9%
Cydonia 24B V4.1100100100999598.7%
GPT-4.11001001001009398.7%
Arcee AI: Trinity Mini1001001001009298.3%
Mistral Large 2100100100979498.2%
Mistral Large100100100979498.2%
Ministral 3B10010098979698.2%
Qwen 3 32B100100100989398.2%
ByteDance Seed 1.6 Flash100100100969598.2%
Z.AI GLM 4.51001001001009098.0%
Cohere Command R+ (Aug. 2024)10010098989598.0%
Gemma 3 27B10010098979597.9%
GPT-4o, May 13th (temp=0)1001001001008997.9%
Z.AI GLM 4.7 Flash1001001001008997.9%
Mistral Small 4 (Reasoning)10010097949497.0%
Qwen 3.5 9B100100100949196.9%
Gemma 3 4B989796969596.6%
Mistral Small 4100100100988496.5%
GPT-4o, May 13th (temp=1)1009896968996.0%
Gemini 2.5 Flash Lite (Reasoning)100100100938695.8%
Qwen3 235B A22B Instruct 250710010096938995.7%
GPT-5.4 Nano1001001001007695.2%
Hermes 3 70B10010098987894.7%
Claude 3 Haiku1009896908894.2%
Qwen 2.5 72B989693928893.5%
Gemma 3 12B979494919093.1%
GPT-5.4 Nano (Reasoning, Low)1001001001006492.8%
Writer: Palmyra X5969695908692.5%
Gemini 2.5 Flash Lite10010098937192.5%
GPT-4.1 Nano969391898891.3%
GPT-5.4 Nano (Reasoning)100100100945690.0%
Mistral Small Creative1009081787885.5%
Gemma 4 26B1001001001001082.0%
Ministral 3 14B888685706578.7%
Llama 3.1 70B959493911076.7%
Llama 3.1 Nemotron 70B777369675067.3%
Llama 3.1 8B888170691063.4%
Mistral NeMO1001001000060.0%
Skyfall 36B V21008880101057.7%
Rocinante 12B92836050057.1%
LFM2 24B252525252525.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001009899.6%
Gemini 3.1 Flash Lite (Preview)1001001001009899.5%
Hermes 3 405B1001001001009899.5%
Gemma 3 4B1001001001009899.5%
Z.AI GLM 4.5 Air1001001001009799.4%
Claude 3 Haiku1001001001009699.3%
ByteDance Seed 2.0 Mini1001001001009699.2%
GPT-5.4 Mini1001001001009699.2%
Ministral 3 14B100100100989899.2%
Z.AI GLM 5 Turbo1001001001009599.1%
Qwen 2.5 72B1001001001009599.1%
Hermes 3 70B100100100989899.1%
Qwen 3 32B1001001001009599.0%
GPT-4o Mini (temp=0)1001001001009498.9%
Qwen 3.5 9B1001001001009498.8%
Ministral 3 3B1001001001009498.8%
Grok 4 Fast10010098989898.7%
ByteDance Seed 1.6 Flash10010098989898.7%
ByteDance Seed 2.0 Lite10010098989898.6%
Cohere Command R+ (Aug. 2024)100100100969598.3%
Ministral 3B100100100969598.3%
GPT-5 Nano1001001001009198.2%
GPT-4.1 Nano100100100969498.1%
GPT-5.4 Nano100100100979398.0%
GPT-4o, May 13th (temp=1)100100100969497.9%
GPT-5.4 Nano (Reasoning, Low)1001001001008897.7%
Mistral Small 41001001001008596.9%
GPT-4o Mini (temp=1)10010094949496.7%
Llama 3.1 70B1009897959296.4%
GPT-5.41009696969295.8%
Gemma 3 12B979796969295.4%
Arcee AI: Trinity Mini10010097908995.2%
Llama 3.1 8B989591858590.7%
Qwen 3.6 27B1001001001005090.0%
GPT-4.110010091837589.8%
Gemini 2.5 Flash Lite1009390857588.6%
Llama 3.1 Nemotron 70B959392827787.9%
Ministral 8B10010079777385.8%
Gemma 4 26B (Reasoning)1001001001001082.0%
Cydonia 24B V4.1100100100921080.3%
Qwen 3.5 27B100100100505080.0%
Skyfall 36B V21009090881075.5%
Mistral NeMO1001001000060.0%
Rocinante 12B836700030.0%
LFM2 24B252525252525.0%