Precision

Test: Codex Extraction

Avg. Score
95.7%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Preview)100.0%$0.00172.0s100%
2Ministral 3 8B99.5%$0.00063.3s97%
3Gemini 3 Flash (Preview)99.7%$0.00273.9s97%
4Qwen 3.5 Plus (2026-02-15)100.0%$0.003010.6s100%
5GPT-4.1 Mini99.5%$0.00156.6s97%
6LFM2 24B99.5%$0.000212.5s97%
7GPT-4o, Aug. 6th (temp=1)100.0%$0.00923.8s100%
8Gemini 2.5 Flash Lite98.0%$0.00051.9s94%
9GPT-4o, Aug. 6th (temp=0)100.0%$0.01004.0s100%
10Ministral 8B98.2%$0.00043.8s94%
11Mistral Medium 3.199.2%$0.00265.8s95%
12Grok 4.1 Fast100.0%$0.001722.1s100%
13Gemini 2.5 Flash98.5%$0.00232.5s93%
14Ministral 3B98.1%$0.00021.8s91%
15Mistral Small 3.2 24B98.3%$0.00054.4s92%
16Mistral Small Creative97.7%$0.00063.9s92%
17Z.AI GLM 4.599.6%$0.002816.8s97%
18Grok 4 Fast98.2%$0.00128.7s94%
19Grok 4.20 (Beta)98.1%$0.00492.0s94%
20DeepSeek V3 (2024-12-26)98.8%$0.001713.5s94%
21Hermes 3 405B99.6%$0.004017.8s97%
22Ministral 3 3B96.4%$0.00041.8s90%
23Z.AI GLM 5 Turbo99.6%$0.006816.0s98%
24Qwen 2.5 72B98.1%$0.000811.0s92%
25GPT-5.4 Mini97.3%$0.00312.2s91%
26DeepSeek-V2 Chat98.4%$0.001914.8s94%
27GPT-4o Mini (temp=1)97.1%$0.00068.0s91%
28GPT-4o Mini (temp=0)97.2%$0.00057.9s90%
29DeepSeek V3 (2025-03-24)99.6%$0.001427.0s96%
30MiniMax M2.799.4%$0.002221.7s95%
31Claude 3 Haiku97.6%$0.00175.5s89%
32Writer: Palmyra X597.0%$0.00527.5s92%
33Inception Mercury 295.4%$0.00223.5s89%
34Qwen3 235B A22B Instruct 250797.7%$0.000719.3s92%
35Hermes 3 70B96.9%$0.001215.8s91%
36Claude Haiku 4.597.5%$0.00734.5s91%
37GPT-5.4 Mini (Reasoning, Low)95.9%$0.00423.7s88%
38Llama 3.1 Nemotron 70B97.3%$0.005017.3s93%
39Claude 3.7 Sonnet100.0%$0.0217.8s100%
40Claude Sonnet 4.5100.0%$0.0226.6s100%
41Claude Sonnet 4100.0%$0.0227.9s100%
42Arcee AI: Trinity Mini94.7%$0.00035.8s87%
43Gemini 2.5 Flash Lite (Reasoning)96.9%$0.002115.6s89%
44Grok 4.20 (Beta, Reasoning)100.0%$0.02012.7s100%
45MiniMax M2.598.3%$0.002330.1s94%
46GPT-4o, May 13th (temp=0)100.0%$0.0254.3s100%
47Gemma 3 27B96.0%$0.000514.5s87%
48Gemini 3 Flash (Preview, Reasoning)98.9%$0.009622.2s95%
49Stealth: Healer Alpha96.2%$0.000024.6s90%
50Claude Sonnet 4.699.7%$0.0227.4s97%
51Mistral Large 395.0%$0.00278.2s86%
52DeepSeek V3.298.8%$0.001136.7s92%
53GPT-5.4 (Reasoning, Low)98.6%$0.01710.4s94%
54GPT-5.4 Nano (Reasoning)95.1%$0.002312.2s87%
55GPT-4.195.4%$0.00734.7s87%
56WizardLM 2 8x22b97.4%$0.002631.3s92%
57Z.AI GLM 4.699.3%$0.005738.8s95%
58o4 Mini99.2%$0.01421.5s95%
59Ministral 3 14B94.2%$0.00096.0s82%
60GPT-5.495.4%$0.0116.4s88%
61Qwen 3 32B97.8%$0.001037.9s90%
62GPT-5.4 Mini (Reasoning)99.4%$0.01725.6s96%
63Stealth: Hunter Alpha96.5%$0.000036.7s90%
64Mistral Small 492.9%$0.00083.2s79%
65Mistral Large 295.0%$0.0118.3s86%
66Mistral Small 4 (Reasoning)93.8%$0.001915.6s82%
67GPT-4o, May 13th (temp=1)97.3%$0.0254.0s92%
68Claude Opus 4.5100.0%$0.0377.7s100%
69Z.AI GLM 598.8%$0.009151.4s94%
70o4 Mini High100.0%$0.02540.5s100%
71Qwen 3.5 Flash96.4%$0.003149.5s89%
72GPT-5.295.9%$0.01816.8s87%
73Gemma 3 4B88.1%$0.00026.7s76%
74Claude Opus 4.699.1%$0.0378.1s96%
75Nemotron 3 Super96.5%$0.00001.0m90%
76Grok 4100.0%$0.03137.2s100%
77Claude 3.5 Sonnet100.0%$0.04313.9s100%
78Inception Mercury88.6%$0.00063.9s73%
79GPT-5 Mini95.9%$0.007045.5s88%
80Nemotron 3 Nano99.8%$0.00131.6m98%
81Qwen 3.5 35B97.0%$0.01547.3s91%
82Aion 2.098.8%$0.00801.2m94%
83ByteDance Seed 1.697.9%$0.00731.3m93%
84Gemini 2.5 Pro97.4%$0.03422.8s92%
85ByteDance Seed 2.0 Lite98.8%$0.00781.4m93%
86Gemini 2.5 Flash (Reasoning)90.8%$0.008211.8s71%
87Z.AI GLM 4.7 Flash95.5%$0.00191.2m87%
88GPT-4.1 Nano86.8%$0.00032.8s63%
89GPT-5.4 Nano (Reasoning, Low)83.4%$0.00113.7s68%
90GPT-5 Nano96.4%$0.00431.3m86%
91Claude Opus 4.6 (Reasoning)100.0%$0.05521.4s100%
92Cohere Command R+ (Aug. 2024)91.3%$0.01417.9s71%
93Z.AI GLM 4.799.1%$0.00981.7m95%
94Qwen 3.5 9B95.1%$0.00131.5m87%
95GPT-5.4 Nano82.2%$0.00113.8s64%
96Llama 3.1 70B91.7%$0.001616.0s56%
97Gemini 3 Pro (Preview)100.0%$0.05535.6s100%
98DeepSeek V3.192.6%$0.001226.1s57%
99GPT-5.196.4%$0.03543.6s90%
100GPT-5.4 (Reasoning)99.7%$0.04453.6s97%
101Claude Sonnet 4.6 (Reasoning)98.8%$0.05236.0s94%
102ByteDance Seed 1.6 Flash86.3%$0.001139.3s66%
103Qwen 3.5 397B A17B97.6%$0.0121.8m94%
104Mistral Large89.6%$0.0117.6s53%
105Llama 3.1 8B84.5%$0.000110.0s51%
106Arcee AI: Trinity Large (Preview)89.3%$0.000020.5s40%
107GPT-598.8%$0.0491.3m96%
108Gemma 3 12B79.7%$0.000314.1s47%
109MoonshotAI: Kimi K2.598.3%$0.0132.4m94%
110Gemini 3.1 Pro (Preview)98.9%$0.0651.0m95%
111Claude Opus 4100.0%$0.11013.5s100%
112ByteDance Seed 2.0 Mini97.8%$0.00343.4m92%
113Qwen 3.5 27B88.9%$0.0211.6m41%
114Rocinante 12B60.4%$0.001325.6s9%
115Qwen 3.5 122B97.0%$0.0793.6m93%
116Mistral NeMO28.2%$0.00061.4s0%
95.73%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
GPT-5 Mini1001001001009699.3%
MoonshotAI: Kimi K2.51001001001009699.3%
Qwen 3.5 27B1001001001009699.3%
Grok 4 Fast1001001001009699.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001009699.3%
Gemini 2.5 Flash Lite1001001001009699.3%
Mistral Small Creative1001001001009699.3%
Claude 3 Haiku1001001001009699.3%
GPT-4.1 Mini1001001001009699.2%
GPT-5.4 Mini1001001001009699.2%
Ministral 3 8B1001001001009699.2%
Qwen 3 32B1001001001009599.1%
Arcee AI: Trinity Mini1001001001009599.1%
Nemotron 3 Nano1001001001009599.0%
GPT-4o Mini (temp=1)1001001001009599.0%
Z.AI GLM 4.51001001001009398.6%
Gemini 3.1 Pro (Preview)100100100969698.5%
Qwen 3.5 397B A17B100100100969698.5%
Qwen 3.5 122B100100100969698.5%
Gemini 2.5 Pro100100100969698.5%
Qwen 3.5 Flash100100100969698.5%
Z.AI GLM 4.7100100100969698.5%
Ministral 3 3B100100100969698.5%
Z.AI GLM 5 Turbo100100100969698.5%
MiniMax M2.5100100100969698.4%
Inception Mercury 21001001001009198.3%
GPT-5.21001001001009097.9%
Aion 2.01001001001009097.9%
LFM2 24B100100100959597.9%
Qwen 3.5 9B100100100969397.8%
Ministral 3B100100100969397.8%
ByteDance Seed 2.0 Mini10010096969697.8%
Gemma 3 27B10010096969697.8%
Stealth: Hunter Alpha10010096969697.7%
GPT-4o, May 13th (temp=1)1001001001008897.7%
Mistral Small 4100100100969297.7%
GPT-5.4 (Reasoning, Low)100100100969097.2%
Llama 3.1 70B10010096959497.1%
GPT-5.110010096969397.1%
Z.AI GLM 4.7 Flash100100100968997.1%
Llama 3.1 8B10010095959597.0%
Stealth: Healer Alpha10010096969397.0%
Llama 3.1 Nemotron 70B1009696969696.9%
Gemma 3 12B10010096969296.9%
Nemotron 3 Super1009696969696.8%
GPT-51009696969396.3%
Gemini 2.5 Flash (Reasoning)1009696969396.3%
GPT-5.41009696969396.3%
Qwen 3.5 35B10010093939395.7%
Ministral 3 14B10010096939095.7%
Gemma 3 4B1009595959195.5%
Cohere Command R+ (Aug. 2024)10010096929095.5%
GPT-5.4 Nano (Reasoning)969696939394.8%
GPT-4.1969693939093.6%
GPT-5 Nano100100100966993.0%
GPT-4.1 Nano1009590908391.7%
Mistral Small 4 (Reasoning)1009390908491.2%
Inception Mercury1009584818188.5%
GPT-5.4 Nano (Reasoning, Low)929090897687.4%
ByteDance Seed 1.6 Flash939384838186.7%
GPT-5.4 Nano938987847384.9%
DeepSeek V3.1100100100100080.0%
Mistral Large100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Rocinante 12B100100920058.3%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
ByteDance Seed 1.61001001001009398.7%
GPT-5.21001001001009398.7%
GPT-4.11001001001009398.7%
Gemini 2.5 Pro1001001001009398.7%
GPT-5.41001001001009398.7%
GPT-5.4 Nano (Reasoning)1001001001009398.7%
Mistral Large1001001001009398.7%
Mistral Small Creative1001001001009398.7%
Qwen 3.5 122B1001001001009398.6%
Z.AI GLM 4.7 Flash1001001001009398.6%
Grok 4.20 (Beta)1001001001009398.6%
Mistral Small 41001001001009398.6%
Stealth: Healer Alpha1001001001009298.5%
ByteDance Seed 1.6 Flash1001001001009298.5%
Ministral 3 3B1001001001009098.0%
GPT-5 Mini100100100939397.3%
GPT-5.4 (Reasoning, Low)100100100939397.3%
Gemini 2.5 Flash Lite100100100939397.2%
GPT-5.4 Nano100100100939397.2%
Ministral 3 14B100100100939397.2%
Ministral 8B100100100939397.2%
Qwen 3.5 Flash1001001001008296.5%
GPT-4.1 Nano1001001001008196.3%
Arcee AI: Trinity Mini100100100919096.2%
Mistral Small 4 (Reasoning)100100100938896.2%
Inception Mercury10010090878592.3%
GPT-5.4 Nano (Reasoning, Low)1009393928192.0%
Cohere Command R+ (Aug. 2024)939292918590.5%
Gemini 2.5 Flash (Reasoning)939393888290.0%
Llama 3.1 8B1009186797686.3%
Gemma 3 4B919185837985.7%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Gemma 3 12B93939393074.5%
Rocinante 12B100858378069.1%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
LFM2 24B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009498.8%
GPT-5.4 (Reasoning)1001001001009498.8%
GPT-51001001001009498.8%
Claude Sonnet 4.61001001001009498.8%
Aion 2.01001001001009498.8%
DeepSeek-V2 Chat1001001001009498.8%
Nemotron 3 Super1001001001009498.8%
Z.AI GLM 4.61001001001009398.7%
Stealth: Healer Alpha1001001001009398.7%
Gemini 3 Flash (Preview)1001001001009398.7%
GPT-4.1 Mini1001001001009398.7%
DeepSeek V3.11001001001009398.7%
DeepSeek V3.21001001001009398.7%
Arcee AI: Trinity Large (Preview)1001001001009398.7%
Ministral 3 8B1001001001009398.7%
Grok 4 Fast1001001001009398.6%
DeepSeek V3 (2024-12-26)1001001001009398.6%
Hermes 3 405B1001001001009398.6%
GPT-4o Mini (temp=1)1001001001009198.2%
MiniMax M2.71001001001008897.6%
Qwen 3.5 122B100100100949497.5%
GPT-5.4 Mini (Reasoning)100100100949497.5%
GPT-5.2100100100949497.5%
Gemini 2.5 Pro1001001001008897.5%
Mistral Small Creative100100100949497.5%
Ministral 3 14B100100100949497.5%
MoonshotAI: Kimi K2.5100100100949397.4%
Gemini 3 Flash (Preview, Reasoning)100100100939397.3%
Grok 4.20 (Beta)100100100939397.3%
Gemini 2.5 Flash Lite100100100949397.3%
Writer: Palmyra X5100100100939397.2%
Qwen 2.5 72B100100100939297.0%
Ministral 8B100100100939297.0%
Arcee AI: Trinity Mini100100100939196.8%
ByteDance Seed 1.6100100100948896.4%
GPT-5.4100100100948896.4%
Claude Opus 4.610010094949496.3%
Qwen 3.5 27B10010094949496.3%
Gemini 2.5 Flash Lite (Reasoning)100100100948896.3%
Ministral 3B1001001001008196.3%
GPT-4.1100100100938896.2%
Qwen3 235B A22B Instruct 2507100100100938896.2%
Qwen 3 32B1001001001007995.8%
WizardLM 2 8x22b10010093938894.9%
Stealth: Hunter Alpha1009493939394.8%
Gemini 2.5 Flash10010094888894.0%
Qwen 3.5 35B10010094948294.0%
Llama 3.1 Nemotron 70B1009392929294.0%
GPT-5 Mini1009493938893.7%
GPT-5.1949494949393.7%
Qwen 3.5 397B A17B949493939393.5%
Inception Mercury 21009392928893.0%
Hermes 3 70B1009292919093.0%
Mistral Small 4 (Reasoning)10010088888892.8%
GPT-5 Nano949494938892.6%
Z.AI GLM 4.7 Flash949393938892.2%
GPT-5.4 Mini1009494888391.8%
GPT-5.4 Mini (Reasoning, Low)1009488888891.7%
Qwen 3.5 Flash949493888390.5%
GPT-5.4 Nano (Reasoning)949388888890.1%
Mistral Small 41009393877489.3%
Qwen 3.5 9B949493838289.3%
Ministral 3 3B939387878789.2%
Gemma 3 12B938888888888.8%
Mistral Large 3888888888888.2%
Mistral Large 2888888888888.2%
Mistral Large888888888888.2%
Inception Mercury1009393876787.9%
Cohere Command R+ (Aug. 2024)100100100864786.5%
Gemini 2.5 Flash (Reasoning)100100100794785.2%
Gemma 3 4B868080797379.5%
Llama 3.1 70B100939291075.1%
Rocinante 12B1001008785074.3%
ByteDance Seed 1.6 Flash1007468685272.5%
GPT-5.4 Nano (Reasoning, Low)837271676571.8%
Llama 3.1 8B1001008069069.8%
GPT-5.4 Nano817271615868.7%
GPT-4.1 Nano837270674166.6%
Mistral NeMO10090830054.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009298.3%
Qwen 3.5 397B A17B1001001001009298.3%
Gemini 3 Flash (Preview, Reasoning)1001001001009298.3%
Aion 2.01001001001009298.3%
Z.AI GLM 4.61001001001009298.3%
Qwen 3.5 35B1001001001009298.3%
GPT-5.4 Mini1001001001009298.3%
DeepSeek V3 (2025-03-24)1001001001009298.3%
Gemini 2.5 Flash Lite1001001001009298.3%
Llama 3.1 Nemotron 70B1001001001009298.3%
Arcee AI: Trinity Large (Preview)1001001001009298.3%
Ministral 8B1001001001009298.3%
Ministral 3B1001001001009298.3%
Z.AI GLM 4.71001001001009098.0%
GPT-5.4 Nano (Reasoning)1001001001008596.9%
DeepSeek V3.21001001001008396.7%
MoonshotAI: Kimi K2.5100100100929296.7%
ByteDance Seed 1.6100100100929296.7%
o4 Mini100100100929296.7%
Grok 4.20 (Beta)100100100929296.7%
DeepSeek V3 (2024-12-26)100100100929296.7%
Mistral Medium 3.1100100100929296.7%
Qwen 3 32B100100100929196.5%
ByteDance Seed 2.0 Lite100100100928595.3%
Qwen 2.5 72B100100100928595.3%
Mistral Small Creative100100100928595.3%
Claude Sonnet 4.6 (Reasoning)10010092929295.0%
GPT-5.110010092929295.0%
Z.AI GLM 510010092929295.0%
Gemini 2.5 Pro10010092929295.0%
Grok 4 Fast10010092929295.0%
DeepSeek-V2 Chat10010092929295.0%
Mistral Small 4 (Reasoning)10010092929295.0%
MiniMax M2.510010092929194.8%
WizardLM 2 8x22b10010092929194.8%
Qwen3 235B A22B Instruct 250710010091919194.5%
Hermes 3 70B10010091919194.5%
Llama 3.1 70B10010092919094.5%
Z.AI GLM 4.7 Flash10010092909094.3%
ByteDance Seed 2.0 Mini10010092928593.6%
Qwen 3.5 9B10010092918593.4%
GPT-5 Mini1009292929293.3%
Qwen 3.5 122B1009292929293.3%
GPT-4.11009292929293.3%
Stealth: Hunter Alpha1009292929293.3%
Mistral Small 3.2 24B10010092908593.3%
GPT-4.1 Nano100100100897592.8%
Cohere Command R+ (Aug. 2024)1009291919092.7%
GPT-5.4 Mini (Reasoning, Low)10010091858592.0%
Gemini 2.5 Flash Lite (Reasoning)10010092858391.9%
Mistral Large 3929292929291.7%
GPT-4o, May 13th (temp=1)929292929291.7%
Mistral Large 2929292929291.7%
DeepSeek V3.1929292929291.7%
Mistral Large929292929291.7%
Gemini 2.5 Flash (Reasoning)100100100857391.6%
Gemma 3 4B1009090898991.6%
GPT-4o Mini (temp=1)1008989898991.1%
Claude 3 Haiku10010092857991.0%
Writer: Palmyra X5919191919190.9%
Stealth: Healer Alpha1009292858590.5%
Inception Mercury 21009292858590.5%
Nemotron 3 Super929292928590.3%
GPT-5.4929292928590.3%
Claude Haiku 4.5929292928390.0%
GPT-5.210010085857989.6%
GPT-4o Mini (temp=0)898989898988.9%
ByteDance Seed 1.6 Flash10010091796987.6%
Arcee AI: Trinity Mini1008983828086.8%
Gemma 3 27B929285857986.2%
Ministral 3 14B929285857986.2%
Mistral Small 4928985858086.0%
Inception Mercury1009290757185.6%
Llama 3.1 8B909082828084.7%
GPT-5.4 Nano (Reasoning, Low)928585797382.6%
GPT-5.4 Nano858573737377.8%
Qwen 3.5 27B1001001000060.0%
Gemma 3 12B676056565658.8%
Mistral NeMO100100900058.0%
Rocinante 12B10010000040.0%