Precision

Test: Codex Extraction

Avg. Score
96.2%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Preview)100.0%$0.00172.0s100%
2Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00182.0s100%
3Gemini 3.1 Flash Lite100.0%$0.00175.1s100%
4Ministral 3 8B99.5%$0.00063.3s97%
5DeepSeek V4 Flash99.7%$0.00037.7s97%
6Gemini 3 Flash (Preview)99.7%$0.00273.9s97%
7Qwen 3.5 Plus (2026-02-15)100.0%$0.003010.6s100%
8GPT-4.1 Mini99.5%$0.00156.6s97%
9LFM2 24B99.5%$0.000212.5s97%
10GPT-4o, Aug. 6th (temp=1)100.0%$0.00923.8s100%
11Gemini 2.5 Flash Lite98.0%$0.00051.9s94%
12Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.0103.1s100%
13GPT-4o, Aug. 6th (temp=0)100.0%$0.01004.0s100%
14Ministral 8B98.2%$0.00043.8s94%
15Mistral Medium 3.199.2%$0.00265.8s95%
16Grok 4.1 Fast100.0%$0.001722.1s100%
17Gemini 2.5 Flash98.5%$0.00232.5s93%
18Ministral 3B98.1%$0.00021.8s91%
19Gemma 4 31B100.0%$0.000825.9s100%
20Xiaomi MIMO v2.5 Pro100.0%$0.004818.7s100%
21Mistral Small 3.2 24B98.3%$0.00054.4s92%
22Grok 4.399.2%$0.00564.3s95%
23Mistral Small Creative97.7%$0.00063.9s92%
24Z.AI GLM 4.599.6%$0.002816.8s97%
25Grok 4 Fast98.2%$0.00128.7s94%
26Grok 4.20 (Beta)98.1%$0.00492.0s94%
27Xiaomi MIMO v2.599.3%$0.003413.4s95%
28DeepSeek V3 (2024-12-26)98.8%$0.001713.5s94%
29DeepSeek V4 Pro99.0%$0.002116.0s95%
30Hermes 3 405B99.6%$0.004017.8s97%
31Ministral 3 3B96.4%$0.00041.8s90%
32Z.AI GLM 5 Turbo99.6%$0.006816.0s98%
33Qwen 2.5 72B98.1%$0.000811.0s92%
34GPT-5.4 Mini97.3%$0.00312.2s91%
35DeepSeek-V2 Chat98.4%$0.001914.8s94%
36GPT-4o Mini (temp=1)97.1%$0.00068.0s91%
37GPT-4o Mini (temp=0)97.2%$0.00057.9s90%
38DeepSeek V3 (2025-03-24)99.6%$0.001427.0s96%
39MiniMax M2.799.4%$0.002221.7s95%
40Grok 4.2097.5%$0.00484.9s92%
41Claude 3 Haiku97.6%$0.00175.5s89%
42Writer: Palmyra X597.0%$0.00527.5s92%
43Inception Mercury 295.4%$0.00223.5s89%
44Qwen3 235B A22B Instruct 250797.7%$0.000719.3s92%
45Hermes 3 70B96.9%$0.001215.8s91%
46Claude Haiku 4.597.5%$0.00734.5s91%
47GPT-5.4 Mini (Reasoning, Low)95.9%$0.00423.7s88%
48Llama 3.1 Nemotron 70B97.3%$0.005017.3s93%
49Claude 3.7 Sonnet100.0%$0.0217.8s100%
50Claude Sonnet 4.5100.0%$0.0226.6s100%
51DeepSeek V4 Flash (Reasoning)99.2%$0.000937.0s96%
52Claude Sonnet 4100.0%$0.0227.9s100%
53Arcee AI: Trinity Mini94.7%$0.00035.8s87%
54Gemini 2.5 Flash Lite (Reasoning)96.9%$0.002115.6s89%
55Grok 4.20 (Beta, Reasoning)100.0%$0.02012.7s100%
56MiniMax M2.598.3%$0.002330.1s94%
57GPT-4o, May 13th (temp=0)100.0%$0.0254.3s100%
58Gemma 3 27B96.0%$0.000514.5s87%
59Gemini 3 Flash (Preview, Reasoning)98.9%$0.009622.2s95%
60Stealth: Healer Alpha96.2%$0.000024.6s90%
61Claude Sonnet 4.699.7%$0.0227.4s97%
62Mistral Large 395.0%$0.00278.2s86%
63DeepSeek V3.298.8%$0.001136.7s92%
64GPT-5.4 (Reasoning, Low)98.6%$0.01710.4s94%
65GPT-5.4 Nano (Reasoning)95.1%$0.002312.2s87%
66GPT-4.195.4%$0.00734.7s87%
67WizardLM 2 8x22b97.4%$0.002631.3s92%
68Grok 4.20 (Reasoning)100.0%$0.01237.2s100%
69Z.AI GLM 4.699.3%$0.005738.8s95%
70o4 Mini99.2%$0.01421.5s95%
71Ministral 3 14B94.2%$0.00096.0s82%
72GPT-5.495.4%$0.0116.4s88%
73Qwen 3 32B97.8%$0.001037.9s90%
74GPT-5.4 Mini (Reasoning)99.4%$0.01725.6s96%
75Qwen 3.6 Flash97.6%$0.009630.2s94%
76Stealth: Hunter Alpha96.5%$0.000036.7s90%
77Mistral Small 492.9%$0.00083.2s79%
78GPT-OSS 120B99.0%$0.001157.1s95%
79Mistral Large 295.0%$0.0118.3s86%
80Mistral Small 4 (Reasoning)93.8%$0.001915.6s82%
81Qwen 3.6 35B97.6%$0.007237.6s92%
82GPT-5.598.6%$0.0276.0s93%
83GPT-4o, May 13th (temp=1)97.3%$0.0254.0s92%
84Claude Opus 4.5100.0%$0.0377.7s100%
85Z.AI GLM 598.8%$0.009151.4s94%
86o4 Mini High100.0%$0.02540.5s100%
87Qwen 3.5 Flash96.4%$0.003149.5s89%
88GPT-5.295.9%$0.01816.8s87%
89Gemma 3 4B88.1%$0.00026.7s76%
90Claude Opus 4.699.1%$0.0378.1s96%
91Gemini 3.5 Flash (Reasoning)100.0%$0.04016.4s100%
92Z.AI GLM 5.1100.0%$0.0151.1m100%
93Nemotron 3 Super96.5%$0.00001.0m90%
94Grok 4100.0%$0.03137.2s100%
95Claude 3.5 Sonnet100.0%$0.04313.9s100%
96Inception Mercury88.6%$0.00063.9s73%
97GPT-5 Mini95.9%$0.007045.5s88%
98Z.AI GLM 4.5 Air93.9%$0.002340.1s81%
99Nemotron 3 Nano99.8%$0.00131.6m98%
100Claude Opus 4.7 (Reasoning)99.7%$0.0465.9s97%
101Qwen 3.5 35B97.0%$0.01547.3s91%
102Aion 2.098.8%$0.00801.2m94%
103GPT-5.5 (Reasoning, Low)98.2%$0.03813.3s94%
104Claude Opus 4.799.4%$0.0476.1s96%
105ByteDance Seed 1.697.9%$0.00731.3m93%
106Gemini 2.5 Pro97.4%$0.03422.8s92%
107ByteDance Seed 2.0 Lite98.8%$0.00781.4m93%
108Gemini 2.5 Flash (Reasoning)90.8%$0.008211.8s71%
109Z.AI GLM 4.7 Flash95.5%$0.00191.2m87%
110GPT-4.1 Nano86.8%$0.00032.8s63%
111GPT-5.4 Nano (Reasoning, Low)83.4%$0.00113.7s68%
112Grok 4.3 (Reasoning)99.6%$0.0181.3m96%
113Qwen 3.6 27B98.3%$0.0181.2m92%
114Qwen 3.5 Plus (2026-04-20)98.6%$0.0151.4m94%
115GPT-5 Nano96.4%$0.00431.3m86%
116Claude Opus 4.6 (Reasoning)100.0%$0.05521.4s100%
117Cohere Command R+ (Aug. 2024)91.3%$0.01417.9s71%
118Z.AI GLM 4.799.1%$0.00981.7m95%
119Gemma 4 31B (Reasoning)100.0%$0.00162.2m100%
120Qwen 3.5 9B95.1%$0.00131.5m87%
121GPT-5.4 Nano82.2%$0.00113.8s64%
122Llama 3.1 70B91.7%$0.001616.0s56%
123Cydonia 24B V4.189.8%$0.001012.2s55%
124Gemini 3 Pro (Preview)100.0%$0.05535.6s100%
125DeepSeek V3.192.6%$0.001226.1s57%
126GPT-5.196.4%$0.03543.6s90%
127GPT-5.4 (Reasoning)99.7%$0.04453.6s97%
128Qwen3.7 Max99.8%$0.0411.1m98%
129Claude Sonnet 4.6 (Reasoning)98.8%$0.05236.0s94%
130ByteDance Seed 1.6 Flash86.3%$0.001139.3s66%
131Qwen 3.5 397B A17B97.6%$0.0121.8m94%
132DeepSeek V4 Pro (Reasoning)100.0%$0.00932.3m100%
133Mistral Large89.6%$0.0117.6s53%
134Llama 3.1 8B84.5%$0.000110.0s51%
135GPT-5.5 (Reasoning)96.6%$0.05625.8s90%
136Arcee AI: Trinity Large (Preview)89.3%$0.000020.5s40%
137GPT-598.8%$0.0491.3m96%
138Gemma 3 12B79.7%$0.000314.1s47%
139MoonshotAI: Kimi K2.598.3%$0.0132.4m94%
140MoonshotAI: Kimi K2.699.7%$0.0262.4m97%
141Gemini 3.1 Pro (Preview)98.9%$0.0651.0m95%
142Gemma 4 26B84.7%$0.000618.7s29%
143Claude Opus 4100.0%$0.11013.5s100%
144ByteDance Seed 2.0 Mini97.8%$0.00343.4m92%
145Qwen3.6 Max Preview100.0%$0.0452.4m100%
146Skyfall 36B V277.6%$0.001812.7s28%
147Gemma 4 26B (Reasoning)94.7%$0.00232.6m56%
148Qwen 3.5 27B88.9%$0.0211.6m41%
149Rocinante 12B60.4%$0.001325.6s9%
150Qwen 3.5 122B97.0%$0.0793.6m93%
151Mistral NeMO28.2%$0.00061.4s0%
96.17%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
GPT-5.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Qwen3.7 Max1001001001009699.3%
GPT-5 Mini1001001001009699.3%
Qwen 3.5 Plus (2026-04-20)1001001001009699.3%
MoonshotAI: Kimi K2.51001001001009699.3%
Qwen 3.5 27B1001001001009699.3%
DeepSeek V4 Flash (Reasoning)1001001001009699.3%
Grok 4 Fast1001001001009699.3%
Gemini 2.5 Flash Lite (Reasoning)1001001001009699.3%
Gemini 2.5 Flash Lite1001001001009699.3%
Mistral Small Creative1001001001009699.3%
Claude 3 Haiku1001001001009699.3%
GPT-4.1 Mini1001001001009699.2%
GPT-5.4 Mini1001001001009699.2%
Ministral 3 8B1001001001009699.2%
Qwen 3 32B1001001001009599.1%
Arcee AI: Trinity Mini1001001001009599.1%
Nemotron 3 Nano1001001001009599.0%
GPT-4o Mini (temp=1)1001001001009599.0%
Z.AI GLM 4.51001001001009398.6%
Z.AI GLM 4.5 Air1001001001009398.6%
Gemini 3.1 Pro (Preview)100100100969698.5%
Qwen 3.5 397B A17B100100100969698.5%
Qwen 3.5 122B100100100969698.5%
Gemini 2.5 Pro100100100969698.5%
Qwen 3.5 Flash100100100969698.5%
Z.AI GLM 4.7100100100969698.5%
Ministral 3 3B100100100969698.5%
Z.AI GLM 5 Turbo100100100969698.5%
MiniMax M2.5100100100969698.4%
Inception Mercury 21001001001009198.3%
GPT-5.21001001001009097.9%
Aion 2.01001001001009097.9%
LFM2 24B100100100959597.9%
Qwen 3.5 9B100100100969397.8%
Ministral 3B100100100969397.8%
ByteDance Seed 2.0 Mini10010096969697.8%
Gemma 3 27B10010096969697.8%
Stealth: Hunter Alpha10010096969697.7%
GPT-4o, May 13th (temp=1)1001001001008897.7%
Mistral Small 4100100100969297.7%
GPT-5.4 (Reasoning, Low)100100100969097.2%
Llama 3.1 70B10010096959497.1%
GPT-5.110010096969397.1%
Qwen 3.6 Flash10010096969397.1%
Z.AI GLM 4.7 Flash100100100968997.1%
Llama 3.1 8B10010095959597.0%
Stealth: Healer Alpha10010096969397.0%
Llama 3.1 Nemotron 70B1009696969696.9%
Gemma 3 12B10010096969296.9%
Nemotron 3 Super1009696969696.8%
Skyfall 36B V2100100100929096.5%
GPT-51009696969396.3%
Gemini 2.5 Flash (Reasoning)1009696969396.3%
GPT-5.41009696969396.3%
Qwen 3.6 27B100100100968596.3%
Qwen 3.5 35B10010093939395.7%
Ministral 3 14B10010096939095.7%
Cydonia 24B V4.11009696969095.5%
Gemma 3 4B1009595959195.5%
Cohere Command R+ (Aug. 2024)10010096929095.5%
GPT-5.4 Nano (Reasoning)969696939394.8%
Qwen 3.6 35B10010093908994.3%
GPT-4.1969693939093.6%
GPT-5 Nano100100100966993.0%
GPT-4.1 Nano1009590908391.7%
Mistral Small 4 (Reasoning)1009390908491.2%
Inception Mercury1009584818188.5%
GPT-5.4 Nano (Reasoning, Low)929090897687.4%
ByteDance Seed 1.6 Flash939384838186.7%
GPT-5.4 Nano938987847384.9%
DeepSeek V3.1100100100100080.0%
Mistral Large100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Gemma 4 26B1001001000060.0%
Rocinante 12B100100920058.3%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.20100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Cydonia 24B V4.1100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5.5 (Reasoning, Low)1001001001009398.7%
Qwen 3.5 Plus (2026-04-20)1001001001009398.7%
ByteDance Seed 1.61001001001009398.7%
GPT-5.21001001001009398.7%
Qwen 3.6 27B1001001001009398.7%
DeepSeek V4 Flash (Reasoning)1001001001009398.7%
GPT-4.11001001001009398.7%
Gemini 2.5 Pro1001001001009398.7%
Gemma 4 26B1001001001009398.7%
GPT-5.41001001001009398.7%
GPT-5.4 Nano (Reasoning)1001001001009398.7%
Mistral Large1001001001009398.7%
Mistral Small Creative1001001001009398.7%
Qwen 3.5 122B1001001001009398.6%
Z.AI GLM 4.7 Flash1001001001009398.6%
Grok 4.20 (Beta)1001001001009398.6%
Mistral Small 41001001001009398.6%
Stealth: Healer Alpha1001001001009298.5%
Z.AI GLM 4.5 Air1001001001009298.5%
ByteDance Seed 1.6 Flash1001001001009298.5%
Ministral 3 3B1001001001009098.0%
GPT-5.51001001001008897.5%
GPT-5 Mini100100100939397.3%
GPT-5.4 (Reasoning, Low)100100100939397.3%
Gemini 2.5 Flash Lite100100100939397.2%
GPT-5.4 Nano100100100939397.2%
Ministral 3 14B100100100939397.2%
Ministral 8B100100100939397.2%
Qwen 3.5 Flash1001001001008296.5%
GPT-4.1 Nano1001001001008196.3%
Arcee AI: Trinity Mini100100100919096.2%
Mistral Small 4 (Reasoning)100100100938896.2%
GPT-5.5 (Reasoning)1009393939394.7%
Skyfall 36B V2100100100838192.9%
Inception Mercury10010090878592.3%
GPT-5.4 Nano (Reasoning, Low)1009393928192.0%
Cohere Command R+ (Aug. 2024)939292918590.5%
Gemini 2.5 Flash (Reasoning)939393888290.0%
Llama 3.1 8B1009186797686.3%
Gemma 3 4B919185837985.7%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Gemma 3 12B93939393074.5%
Rocinante 12B100858378069.1%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4.3100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
LFM2 24B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009498.8%
GPT-5.4 (Reasoning)1001001001009498.8%
Claude Opus 4.7 (Reasoning)1001001001009498.8%
GPT-51001001001009498.8%
Claude Sonnet 4.61001001001009498.8%
Aion 2.01001001001009498.8%
GPT-5.51001001001009498.8%
DeepSeek V4 Flash (Reasoning)1001001001009498.8%
DeepSeek-V2 Chat1001001001009498.8%
Nemotron 3 Super1001001001009498.8%
DeepSeek V4 Flash1001001001009498.8%
MoonshotAI: Kimi K2.61001001001009398.7%
Gemma 4 26B (Reasoning)1001001001009398.7%
Z.AI GLM 4.61001001001009398.7%
Stealth: Healer Alpha1001001001009398.7%
Gemini 3 Flash (Preview)1001001001009398.7%
Xiaomi MIMO v2.51001001001009398.7%
GPT-4.1 Mini1001001001009398.7%
DeepSeek V3.11001001001009398.7%
DeepSeek V3.21001001001009398.7%
Arcee AI: Trinity Large (Preview)1001001001009398.7%
Ministral 3 8B1001001001009398.7%
Grok 4 Fast1001001001009398.6%
DeepSeek V3 (2024-12-26)1001001001009398.6%
Hermes 3 405B1001001001009398.6%
GPT-4o Mini (temp=1)1001001001009198.2%
MiniMax M2.71001001001008897.6%
GPT-5.5 (Reasoning, Low)100100100949497.5%
Qwen 3.5 122B100100100949497.5%
GPT-5.4 Mini (Reasoning)100100100949497.5%
GPT-5.2100100100949497.5%
Claude Opus 4.7100100100949497.5%
Gemini 2.5 Pro1001001001008897.5%
Mistral Small Creative100100100949497.5%
Ministral 3 14B100100100949497.5%
MoonshotAI: Kimi K2.5100100100949397.4%
Gemini 3 Flash (Preview, Reasoning)100100100939397.3%
Grok 4.20 (Beta)100100100939397.3%
Gemini 2.5 Flash Lite100100100949397.3%
Writer: Palmyra X5100100100939397.2%
Qwen 2.5 72B100100100939297.0%
Ministral 8B100100100939297.0%
Arcee AI: Trinity Mini100100100939196.8%
ByteDance Seed 1.6100100100948896.4%
GPT-5.4100100100948896.4%
Qwen 3.6 35B100100100938896.3%
Claude Opus 4.610010094949496.3%
Qwen 3.5 27B10010094949496.3%
Gemini 2.5 Flash Lite (Reasoning)100100100948896.3%
Ministral 3B1001001001008196.3%
GPT-4.1100100100938896.2%
DeepSeek V4 Pro10010094949396.2%
Qwen3 235B A22B Instruct 2507100100100938896.2%
GPT-OSS 120B10010093939395.8%
Qwen 3 32B1001001001007995.8%
GPT-5.5 (Reasoning)1009494949495.0%
Qwen 3.6 Flash1009494949495.0%
WizardLM 2 8x22b10010093938894.9%
Stealth: Hunter Alpha1009493939394.8%
Grok 4.201009393939394.7%
Gemini 2.5 Flash10010094888894.0%
Qwen 3.5 35B10010094948294.0%
Cydonia 24B V4.110010093888894.0%
Llama 3.1 Nemotron 70B1009392929294.0%
GPT-5 Mini1009493938893.7%
GPT-5.1949494949393.7%
Qwen 3.5 397B A17B949493939393.5%
Inception Mercury 21009392928893.0%
Hermes 3 70B1009292919093.0%
Mistral Small 4 (Reasoning)10010088888892.8%
GPT-5 Nano949494938892.6%
Z.AI GLM 4.7 Flash949393938892.2%
GPT-5.4 Mini1009494888391.8%
GPT-5.4 Mini (Reasoning, Low)1009488888891.7%
Z.AI GLM 4.5 Air1009494888391.7%
Qwen 3.5 Flash949493888390.5%
GPT-5.4 Nano (Reasoning)949388888890.1%
Mistral Small 41009393877489.3%
Qwen 3.5 9B949493838289.3%
Ministral 3 3B939387878789.2%
Gemma 3 12B938888888888.8%
Mistral Large 3888888888888.2%
Mistral Large 2888888888888.2%
Mistral Large888888888888.2%
Inception Mercury1009393876787.9%
Cohere Command R+ (Aug. 2024)100100100864786.5%
Gemini 2.5 Flash (Reasoning)100100100794785.2%
Gemma 4 26B100100100100080.0%
Gemma 3 4B868080797379.5%
Llama 3.1 70B100939291075.1%
Rocinante 12B1001008785074.3%
ByteDance Seed 1.6 Flash1007468685272.5%
GPT-5.4 Nano (Reasoning, Low)837271676571.8%
Llama 3.1 8B1001008069069.8%
GPT-5.4 Nano817271615868.7%
GPT-4.1 Nano837270674166.6%
Mistral NeMO10090830054.7%
Skyfall 36B V29374480043.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009298.3%
Gemini 3.1 Pro (Preview)1001001001009298.3%
Grok 4.3 (Reasoning)1001001001009298.3%
Qwen 3.5 397B A17B1001001001009298.3%
Qwen 3.6 Flash1001001001009298.3%
Qwen 3.6 27B1001001001009298.3%
Aion 2.01001001001009298.3%
Z.AI GLM 4.61001001001009298.3%
GPT-5.51001001001009298.3%
Qwen 3.5 35B1001001001009298.3%
Xiaomi MIMO v2.51001001001009298.3%
GPT-5.4 Mini1001001001009298.3%
DeepSeek V3 (2025-03-24)1001001001009298.3%
Gemini 2.5 Flash Lite1001001001009298.3%
Llama 3.1 Nemotron 70B1001001001009298.3%
Arcee AI: Trinity Large (Preview)1001001001009298.3%
Ministral 8B1001001001009298.3%
Ministral 3B1001001001009298.3%
Z.AI GLM 4.71001001001009098.0%
GPT-5.4 Nano (Reasoning)1001001001008596.9%
GPT-5.5 (Reasoning)100100100929296.7%
GPT-5.5 (Reasoning, Low)100100100929296.7%
Qwen 3.5 Plus (2026-04-20)100100100929296.7%
MoonshotAI: Kimi K2.5100100100929296.7%
ByteDance Seed 1.6100100100929296.7%
o4 Mini100100100929296.7%
Grok 4.20 (Beta)100100100929296.7%
DeepSeek V3 (2024-12-26)100100100929296.7%
DeepSeek V3.21001001001008396.7%
Grok 4.3100100100929296.7%
Mistral Medium 3.1100100100929296.7%
Qwen 3 32B100100100929196.5%
ByteDance Seed 2.0 Lite100100100928595.3%
Grok 4.20100100100928595.3%
Qwen 2.5 72B100100100928595.3%
Mistral Small Creative100100100928595.3%
Claude Sonnet 4.6 (Reasoning)10010092929295.0%
GPT-5.110010092929295.0%
Z.AI GLM 510010092929295.0%
Gemini 2.5 Pro10010092929295.0%
Grok 4 Fast10010092929295.0%
DeepSeek-V2 Chat10010092929295.0%
Mistral Small 4 (Reasoning)10010092929295.0%
MiniMax M2.510010092929194.8%
WizardLM 2 8x22b10010092929194.8%
Qwen3 235B A22B Instruct 250710010091919194.5%
Hermes 3 70B10010091919194.5%
Llama 3.1 70B10010092919094.5%
Z.AI GLM 4.7 Flash10010092909094.3%
ByteDance Seed 2.0 Mini10010092928593.6%
Qwen 3.5 9B10010092918593.4%
GPT-5 Mini1009292929293.3%
Qwen 3.5 122B1009292929293.3%
GPT-4.11009292929293.3%
Stealth: Hunter Alpha1009292929293.3%
Mistral Small 3.2 24B10010092908593.3%
GPT-4.1 Nano100100100897592.8%
Cohere Command R+ (Aug. 2024)1009291919092.7%
GPT-5.4 Mini (Reasoning, Low)10010091858592.0%
Gemini 2.5 Flash Lite (Reasoning)10010092858391.9%
Mistral Large 3929292929291.7%
GPT-4o, May 13th (temp=1)929292929291.7%
Mistral Large 2929292929291.7%
DeepSeek V3.1929292929291.7%
Mistral Large929292929291.7%
Gemini 2.5 Flash (Reasoning)100100100857391.6%
Gemma 3 4B1009090898991.6%
GPT-4o Mini (temp=1)1008989898991.1%
Claude 3 Haiku10010092857991.0%
Writer: Palmyra X5919191919190.9%
Stealth: Healer Alpha1009292858590.5%
Inception Mercury 21009292858590.5%
Nemotron 3 Super929292928590.3%
GPT-5.4929292928590.3%
Claude Haiku 4.5929292928390.0%
GPT-5.210010085857989.6%
GPT-4o Mini (temp=0)898989898988.9%
ByteDance Seed 1.6 Flash10010091796987.6%
Z.AI GLM 4.5 Air1009190856986.9%
Arcee AI: Trinity Mini1008983828086.8%
Gemma 3 27B929285857986.2%
Ministral 3 14B929285857986.2%
Mistral Small 4928985858086.0%
Inception Mercury1009290757185.6%
Llama 3.1 8B909082828084.7%
GPT-5.4 Nano (Reasoning, Low)928585797382.6%
Gemma 4 26B (Reasoning)100100100100080.0%
Skyfall 36B V210010010090078.0%
GPT-5.4 Nano858573737377.8%
Cydonia 24B V4.1100858579069.6%
Qwen 3.5 27B1001001000060.0%
Gemma 3 12B676056565658.8%
Mistral NeMO100100900058.0%
Rocinante 12B10010000040.0%