Structural validity

Test: Codex Extraction

Avg. Score
96.6%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash99.8%$0.00232.5s99%
2Ministral 3 8B99.6%$0.00063.3s98%
3Mistral Small 3.2 24B99.5%$0.00054.4s98%
4Ministral 3 3B99.1%$0.00041.8s97%
5GPT-4.1 Mini99.6%$0.00156.6s98%
6Ministral 3B98.3%$0.00021.8s95%
7Grok 4 Fast99.7%$0.00128.7s98%
8Claude Haiku 4.5100.0%$0.00734.5s100%
9DeepSeek V3 (2024-12-26)100.0%$0.001713.5s100%
10Qwen 3.5 Plus (2026-02-15)99.8%$0.003010.6s99%
11GPT-4o, Aug. 6th (temp=1)99.8%$0.00923.8s99%
12DeepSeek-V2 Chat99.9%$0.001914.8s99%
13Mistral Medium 3.198.6%$0.00265.8s95%
14GPT-4o, Aug. 6th (temp=0)99.7%$0.01004.0s99%
15Gemma 3 27B99.4%$0.000514.5s97%
16GPT-4o Mini (temp=1)98.0%$0.00068.0s95%
17Gemini 3 Flash (Preview)98.2%$0.00273.9s93%
18GPT-4o Mini (temp=0)98.3%$0.00057.9s93%
19Gemini 2.5 Flash (Reasoning)100.0%$0.008211.8s100%
20Mistral Large 398.3%$0.00278.2s95%
21Grok 4.1 Fast100.0%$0.001722.1s100%
22Claude 3 Haiku97.3%$0.00175.5s92%
23Z.AI GLM 4.599.5%$0.002816.8s96%
24DeepSeek V3 (2025-03-24)100.0%$0.001427.0s100%
25Gemma 3 4B95.9%$0.00026.7s90%
26Hermes 3 405B98.9%$0.004017.8s97%
27Qwen 2.5 72B97.0%$0.000811.0s92%
28Arcee AI: Trinity Mini95.8%$0.00035.8s90%
29Mistral Large 298.9%$0.0118.3s96%
30Gemini 2.5 Flash Lite (Reasoning)97.7%$0.002115.6s92%
31Gemini 3 Flash (Preview, Reasoning)100.0%$0.009622.2s100%
32Claude Sonnet 4.5100.0%$0.0226.6s100%
33Minimax M2.599.7%$0.002330.1s99%
34Writer: Palmyra X596.7%$0.00527.5s90%
35Claude 3.7 Sonnet100.0%$0.0217.8s100%
36WizardLM 2 8x22b99.9%$0.002631.3s99%
37Claude Sonnet 4100.0%$0.0227.9s100%
38Claude Sonnet 4.6100.0%$0.0227.4s100%
39Ministral 8B96.0%$0.00043.8s83%
40GPT-4.1 Nano94.1%$0.00032.8s85%
41DeepSeek V3.2100.0%$0.001136.7s100%
42o4 Mini100.0%$0.01421.5s100%
43Gemini 2.5 Flash Lite94.2%$0.00051.9s82%
44GPT-4.196.6%$0.00734.7s87%
45GPT-5.2100.0%$0.01816.8s100%
46Mistral Small Creative94.3%$0.00063.9s83%
47Hermes 3 70B97.1%$0.001215.8s87%
48GPT-4o, May 13th (temp=0)99.5%$0.0254.3s95%
49Z.AI GLM 4.6100.0%$0.005738.8s100%
50GPT-4o, May 13th (temp=1)98.4%$0.0254.0s94%
51Cohere Command R+ (Aug. 2024)97.7%$0.01417.9s94%
52GPT-5 Mini100.0%$0.007045.5s100%
53Claude Opus 4.5100.0%$0.0377.7s100%
54ByteDance Seed 1.6 Flash97.3%$0.001139.3s93%
55Claude Opus 4.6100.0%$0.0378.1s100%
56Ministral 3 14B90.7%$0.00096.0s75%
57Z.AI GLM 5100.0%$0.009151.4s100%
58Gemini 2.5 Pro100.0%$0.03422.8s100%
59Claude 3.5 Sonnet100.0%$0.04313.9s100%
60o4 Mini High100.0%$0.02540.5s100%
61Grok 4100.0%$0.03137.2s100%
62Aion 2.0100.0%$0.00801.2m100%
63ByteDance Seed 1.6100.0%$0.00731.3m100%
64Z.AI GLM 4.7 Flash98.7%$0.00191.2m94%
65GPT-5.1100.0%$0.03543.6s100%
66GPT-5 Nano99.5%$0.00431.3m96%
67DeepSeek V3.195.5%$0.001226.1s61%
68Claude Opus 4.6 (Reasoning)100.0%$0.05521.4s100%
69Gemma 3 12B89.8%$0.000314.1s59%
70Llama 3.1 70B90.6%$0.001616.0s59%
71Mistral Large93.4%$0.0117.6s56%
72Claude Sonnet 4.6 (Reasoning)100.0%$0.05236.0s100%
73Gemini 3 Pro (Preview)100.0%$0.05535.6s100%
74Z.AI GLM 4.7100.0%$0.00981.7m100%
75Arcee AI: Trinity Large (Preview)89.5%$0.000020.5s46%
76Qwen 3.5 397B A17B100.0%$0.0121.8m100%
77Llama 3.1 Nemotron 70B81.0%$0.005017.3s55%
78Llama 3.1 8B78.9%$0.000110.0s48%
79GPT-5100.0%$0.0491.3m100%
80Gemini 3.1 Pro (Preview)99.5%$0.0651.0m96%
81Claude Opus 4100.0%$0.11013.5s100%
82MoonshotAI: Kimi K2.5100.0%$0.0132.4m100%
83Rocinante 12B51.3%$0.001325.6s16%
84Mistral NeMO30.0%$0.00061.4s0%
96.62%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Grok 41001001001009999.8%
Z.AI GLM 4.51001001001009999.8%
Claude 3.7 Sonnet1001001001009999.8%
DeepSeek V3.21001001001009999.8%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Gemma 3 27B1001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
DeepSeek-V2 Chat100100100999999.6%
Minimax M2.51001001001009899.6%
GPT-4o, Aug. 6th (temp=1)100100100999999.6%
WizardLM 2 8x22b1001001001009799.4%
Gemini 2.5 Flash Lite10010099999899.2%
Mistral Small 3.2 24B1009999999999.2%
Ministral 3 3B10010099989899.0%
GPT-4o, Aug. 6th (temp=0)999999999998.9%
Qwen 2.5 72B10010099999598.5%
Ministral 3 8B10010098989698.4%
Ministral 8B1009999989698.3%
GPT-4o Mini (temp=1)1009999979698.1%
Mistral Large 2999999999498.1%
Gemini 3.1 Pro (Preview)1001001001009098.0%
Z.AI GLM 4.7 Flash100100100989197.8%
Ministral 3B1009898979697.8%
Hermes 3 405B999999979597.7%
Llama 3.1 70B1009999969497.5%
Gemini 2.5 Flash Lite (Reasoning)100100100998897.5%
Gemma 3 12B10010099998997.5%
GPT-4o Mini (temp=0)10010096969497.4%
Writer: Palmyra X51009898969597.4%
Mistral Small Creative999696969696.8%
Llama 3.1 Nemotron 70B989897959496.3%
Claude 3 Haiku1009896959296.3%
ByteDance Seed 1.6 Flash999996949396.2%
Cohere Command R+ (Aug. 2024)989897969396.2%
Arcee AI: Trinity Mini999995959295.9%
Hermes 3 70B1009999988395.6%
Gemma 3 4B1009696959095.4%
Mistral Large 3989893939395.2%
Ministral 3 14B999695929294.9%
GPT-4.1 Nano1009898928093.6%
DeepSeek V3.11001001001001082.0%
Arcee AI: Trinity Large (Preview)999999871078.8%
Mistral Large99989393076.7%
Llama 3.1 8B949188842576.5%
Rocinante 12B9689790052.8%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
GPT-4o, May 13th (temp=1)1001001001009899.6%
GPT-4o, Aug. 6th (temp=1)1001001001009899.6%
Gemini 2.5 Flash1001001001009899.6%
Claude 3 Haiku1001001001009899.6%
GPT-4.1 Mini1001001001009899.5%
Ministral 3 3B1001001001009899.5%
Mistral Large 2100100100989899.2%
Minimax M2.5100100100989899.2%
Z.AI GLM 4.7 Flash1001001001009699.2%
Ministral 3B100100100989899.1%
Hermes 3 70B100100100989899.0%
Hermes 3 405B100100100989698.7%
Mistral Small 3.2 24B10010098989898.7%
Mistral Large1009898989898.5%
Cohere Command R+ (Aug. 2024)10010098989698.5%
GPT-4o Mini (temp=1)100100100979498.3%
Mistral Large 3989898989898.1%
GPT-4.1100100100969398.0%
Gemini 2.5 Flash Lite (Reasoning)1009898969697.9%
GPT-4o Mini (temp=0)1001001001008597.0%
Qwen 2.5 72B10010096959397.0%
Writer: Palmyra X510010096959396.9%
Gemini 2.5 Flash Lite1009898969096.5%
ByteDance Seed 1.6 Flash1009896949296.1%
Mistral Small Creative979696949194.9%
Mistral Medium 3.1969494949494.5%
Arcee AI: Trinity Mini989793929093.8%
GPT-4.1 Nano989493938993.4%
Gemini 3 Flash (Preview)1009393898992.9%
Gemma 3 4B969391919092.3%
Llama 3.1 70B989492908691.9%
Ministral 3 14B929189898890.0%
Llama 3.1 8B938888827585.0%
Arcee AI: Trinity Large (Preview)1009898941080.0%
Gemma 3 12B908988881073.0%
Llama 3.1 Nemotron 70B918479773172.5%
Rocinante 12B858379691065.3%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Hermes 3 405B1001001001009899.6%
Qwen 3.5 Plus (2026-02-15)1001001001009799.3%
GPT-4o Mini (temp=1)1001001001009599.0%
Arcee AI: Trinity Large (Preview)100100100989799.0%
Ministral 3 3B100100100989799.0%
GPT-4.1 Mini100100100989798.9%
GPT-4.11001001001009398.7%
Arcee AI: Trinity Mini1001001001009298.3%
Mistral Large 2100100100979498.2%
Mistral Large100100100979498.2%
Ministral 3B10010098979698.2%
ByteDance Seed 1.6 Flash100100100969598.2%
Z.AI GLM 4.51001001001009098.0%
Cohere Command R+ (Aug. 2024)10010098989598.0%
Gemma 3 27B10010098979597.9%
Z.AI GLM 4.7 Flash1001001001008997.9%
GPT-4o, May 13th (temp=0)1001001001008997.9%
Gemma 3 4B989796969596.6%
GPT-4o, May 13th (temp=1)1009896968996.0%
Gemini 2.5 Flash Lite (Reasoning)100100100938695.8%
Hermes 3 70B10010098987894.7%
Claude 3 Haiku1009896908894.2%
Qwen 2.5 72B989693928893.5%
Gemma 3 12B979494919093.1%
Writer: Palmyra X5969695908692.5%
Gemini 2.5 Flash Lite10010098937192.5%
GPT-4.1 Nano969391898891.3%
Mistral Small Creative1009081787885.5%
Ministral 3 14B888685706578.7%
Llama 3.1 70B959493911076.7%
Llama 3.1 Nemotron 70B777369675067.3%
Llama 3.1 8B888170691063.4%
Mistral NeMO1001001000060.0%
Rocinante 12B92836050057.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001009899.6%
Hermes 3 405B1001001001009899.5%
Gemma 3 4B1001001001009899.5%
Claude 3 Haiku1001001001009699.3%
Ministral 3 14B100100100989899.2%
Qwen 2.5 72B1001001001009599.1%
Hermes 3 70B100100100989899.1%
GPT-4o Mini (temp=0)1001001001009498.9%
Ministral 3 3B1001001001009498.8%
Grok 4 Fast10010098989898.7%
ByteDance Seed 1.6 Flash10010098989898.7%
Cohere Command R+ (Aug. 2024)100100100969598.3%
Ministral 3B100100100969598.3%
GPT-5 Nano1001001001009198.2%
GPT-4.1 Nano100100100969498.1%
GPT-4o, May 13th (temp=1)100100100969497.9%
GPT-4o Mini (temp=1)10010094949496.7%
Llama 3.1 70B1009897959296.4%
Gemma 3 12B979796969295.4%
Arcee AI: Trinity Mini10010097908995.2%
Llama 3.1 8B989591858590.7%
GPT-4.110010091837589.8%
Gemini 2.5 Flash Lite1009390857588.6%
Llama 3.1 Nemotron 70B959392827787.9%
Ministral 8B10010079777385.8%
Mistral NeMO1001001000060.0%
Rocinante 12B836700030.0%