Recall

Test: Codex Extraction

Avg. Score
93.3%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Preview)99.7%$0.00172.0s98%
2Gemini 3.1 Flash Lite (Reasoning)98.9%$0.00182.0s97%
3Gemini 3 Flash (Preview)99.7%$0.00273.9s97%
4Qwen 3.5 Plus (2026-02-15)100.0%$0.003010.6s100%
5Gemini 3.1 Flash Lite99.1%$0.00175.1s97%
6Mistral Small Creative98.6%$0.00063.9s95%
7Mistral Medium 3.198.5%$0.00265.8s96%
8Ministral 3 8B97.3%$0.00063.3s94%
9GPT-5.4 Mini97.6%$0.00312.2s94%
10Gemini 3.5 Flash (Reasoning, Minimal)99.7%$0.0103.1s98%
11Gemma 4 31B99.8%$0.000825.9s98%
12DeepSeek V4 Pro98.3%$0.002116.0s96%
13Z.AI GLM 4.598.4%$0.002816.8s96%
14Grok 4 Fast97.3%$0.00128.7s92%
15Grok 4.1 Fast99.0%$0.001722.1s96%
16GPT-5.4 Mini (Reasoning, Low)97.2%$0.00423.7s92%
17Ministral 3 14B96.7%$0.00096.0s90%
18Gemini 2.5 Flash Lite95.1%$0.00051.9s88%
19Mistral Large 396.5%$0.00278.2s91%
20MiniMax M2.798.2%$0.002221.7s95%
21Ministral 8B95.1%$0.00043.8s88%
22Claude Haiku 4.597.5%$0.00734.5s92%
23Gemma 3 27B96.8%$0.000514.5s91%
24Xiaomi MIMO v2.597.7%$0.003413.4s91%
25DeepSeek V4 Flash (Reasoning)99.4%$0.000937.0s97%
26GPT-4.197.0%$0.00734.7s91%
27Gemini 2.5 Flash Lite (Reasoning)97.0%$0.002115.6s91%
28GPT-5.497.5%$0.0116.4s93%
29Grok 4.2096.2%$0.00484.9s89%
30Claude 3.7 Sonnet100.0%$0.0217.8s100%
31Claude Sonnet 4.5100.0%$0.0226.6s100%
32Mistral Small 4 (Reasoning)96.1%$0.001915.6s91%
33Claude Sonnet 4100.0%$0.0227.9s100%
34Mistral Small 3.2 24B94.3%$0.00054.4s86%
35GPT-5.4 Nano (Reasoning)95.9%$0.002312.2s89%
36GPT-5.4 (Reasoning, Low)99.3%$0.01710.4s97%
37Gemini 3 Flash (Preview, Reasoning)99.1%$0.009622.2s96%
38Grok 4.20 (Beta)95.3%$0.00492.0s87%
39Z.AI GLM 5 Turbo97.7%$0.006816.0s92%
40Claude Sonnet 4.699.8%$0.0227.4s99%
41Grok 4.20 (Beta, Reasoning)99.3%$0.02012.7s97%
42Qwen 3.6 Flash98.5%$0.009630.2s96%
43GPT-4o, Aug. 6th (temp=0)95.6%$0.01004.0s87%
44Mistral Large 296.3%$0.0118.3s89%
45DeepSeek V3.297.6%$0.001136.7s91%
46GPT-4o, May 13th (temp=0)98.5%$0.0254.3s97%
47Grok 4.20 (Reasoning)99.8%$0.01237.2s98%
48DeepSeek V3 (2025-03-24)96.4%$0.001427.0s87%
49GPT-5.599.2%$0.0276.0s96%
50Stealth: Healer Alpha94.7%$0.000024.6s86%
51Z.AI GLM 4.698.1%$0.005738.8s93%
52DeepSeek-V2 Chat94.3%$0.001914.8s83%
53GPT-4o, Aug. 6th (temp=1)94.4%$0.00923.8s85%
54Inception Mercury 292.5%$0.00223.5s80%
55GPT-5.4 Mini (Reasoning)99.0%$0.01725.6s95%
56o4 Mini97.7%$0.01421.5s92%
57Qwen 3.6 35B97.9%$0.007237.6s93%
58MiniMax M2.595.3%$0.002330.1s87%
59Gemini 2.5 Flash95.8%$0.00232.5s74%
60GPT-5.297.6%$0.01816.8s92%
61Xiaomi MIMO v2.5 Pro96.3%$0.004818.7s82%
62Qwen 2.5 72B91.5%$0.000811.0s81%
63Claude 3 Haiku91.3%$0.00175.5s80%
64Stealth: Hunter Alpha94.8%$0.000036.7s88%
65Z.AI GLM 599.2%$0.009151.4s97%
66Ministral 3B90.7%$0.00021.8s76%
67Mistral Small 490.7%$0.00083.2s77%
68DeepSeek V3 (2024-12-26)92.8%$0.001713.5s80%
69WizardLM 2 8x22b94.7%$0.002631.3s86%
70GPT-5.4 Nano (Reasoning, Low)90.0%$0.00113.7s78%
71Hermes 3 405B93.0%$0.004017.8s83%
72GPT-4.1 Mini90.0%$0.00156.6s78%
73GPT-5 Mini97.3%$0.007045.5s92%
74Qwen 3.5 Flash96.9%$0.003149.5s91%
75Claude Opus 4.599.3%$0.0377.7s97%
76GPT-4o, May 13th (temp=1)95.4%$0.0254.0s89%
77Claude Opus 4.699.0%$0.0378.1s97%
78Qwen 3.5 35B98.4%$0.01547.3s95%
79Qwen 3 32B91.8%$0.001037.9s86%
80GPT-5.4 Nano87.7%$0.00113.8s75%
81Z.AI GLM 5.1100.0%$0.0151.1m100%
82Writer: Palmyra X588.9%$0.00527.5s78%
83Nemotron 3 Super96.1%$0.00001.0m90%
84GPT-5.5 (Reasoning, Low)99.0%$0.03813.3s97%
85Gemini 3.5 Flash (Reasoning)99.8%$0.04016.4s98%
86Claude Opus 4.7 (Reasoning)99.8%$0.0465.9s99%
87Claude Opus 4.799.7%$0.0476.1s98%
88DeepSeek V4 Flash94.7%$0.00037.7s62%
89o4 Mini High98.4%$0.02540.5s95%
90Gemini 2.5 Pro98.4%$0.03422.8s95%
91Grok 499.3%$0.03137.2s97%
92Claude 3.5 Sonnet99.2%$0.04313.9s96%
93Aion 2.098.5%$0.00801.2m93%
94Z.AI GLM 4.5 Air92.5%$0.002340.1s80%
95Ministral 3 3B85.2%$0.00041.8s69%
96ByteDance Seed 2.0 Lite98.6%$0.00781.4m95%
97GPT-OSS 120B93.9%$0.001157.1s83%
98GPT-5 Nano97.2%$0.00431.3m91%
99Grok 4.391.6%$0.00564.3s64%
100Inception Mercury82.8%$0.00063.9s70%
101ByteDance Seed 1.697.2%$0.00731.3m91%
102Qwen 3.5 9B97.0%$0.00131.5m91%
103Qwen 3.5 Plus (2026-04-20)99.1%$0.0151.4m97%
104Arcee AI: Trinity Mini79.7%$0.00035.8s69%
105GPT-5.197.9%$0.03543.6s94%
106Gemini 2.5 Flash (Reasoning)90.4%$0.008211.8s65%
107Z.AI GLM 4.7 Flash93.2%$0.00191.2m83%
108Claude Opus 4.6 (Reasoning)99.5%$0.05521.4s98%
109Gemma 4 31B (Reasoning)99.8%$0.00162.2m99%
110Gemma 3 4B79.2%$0.00026.7s66%
111Grok 4.3 (Reasoning)97.9%$0.0181.3m93%
112Qwen3 235B A22B Instruct 250787.5%$0.000719.3s62%
113ByteDance Seed 1.6 Flash86.4%$0.001139.3s72%
114Z.AI GLM 4.798.3%$0.00981.7m93%
115Hermes 3 70B81.2%$0.001215.8s67%
116GPT-4o Mini (temp=0)76.6%$0.00057.9s68%
117GPT-5.4 (Reasoning)99.8%$0.04453.6s99%
118Qwen3.7 Max99.9%$0.0411.1m99%
119Cydonia 24B V4.187.8%$0.001012.2s54%
120Claude Sonnet 4.6 (Reasoning)99.0%$0.05236.0s96%
121DeepSeek V3.191.3%$0.001226.1s55%
122Mistral Large90.8%$0.0117.6s55%
123GPT-5.5 (Reasoning)98.2%$0.05625.8s95%
124Qwen 3.5 397B A17B97.6%$0.0121.8m93%
125Gemini 3 Pro (Preview)99.1%$0.05535.6s94%
126GPT-4o Mini (temp=1)75.3%$0.00068.0s62%
127Cohere Command R+ (Aug. 2024)81.8%$0.01417.9s64%
128Llama 3.1 70B81.1%$0.001616.0s52%
129DeepSeek V4 Pro (Reasoning)98.3%$0.00932.3m93%
130Llama 3.1 Nemotron 70B84.5%$0.005017.3s50%
131GPT-599.4%$0.0491.3m98%
132Gemma 3 12B78.3%$0.000314.1s47%
133Arcee AI: Trinity Large (Preview)84.8%$0.000020.5s41%
134MoonshotAI: Kimi K2.598.1%$0.0132.4m93%
135Nemotron 3 Nano84.3%$0.00131.6m75%
136GPT-4.1 Nano68.6%$0.00032.8s50%
137MoonshotAI: Kimi K2.699.5%$0.0262.4m97%
138Qwen 3.6 27B92.3%$0.0181.2m61%
139Gemma 4 26B83.8%$0.000618.7s29%
140Claude Opus 4100.0%$0.11013.5s100%
141ByteDance Seed 2.0 Mini97.4%$0.00343.4m93%
142Llama 3.1 8B66.4%$0.000110.0s39%
143Qwen3.6 Max Preview99.1%$0.0452.4m94%
144Gemma 4 26B (Reasoning)94.2%$0.00232.6m56%
145Gemini 3.1 Pro (Preview)95.2%$0.0651.0m67%
146Skyfall 36B V259.4%$0.001812.7s25%
147Qwen 3.5 27B89.2%$0.0211.6m40%
148LFM2 24B35.7%$0.000212.5s14%
149Rocinante 12B40.3%$0.001325.6s13%
150Qwen 3.5 122B98.0%$0.0793.6m95%
151Mistral NeMO23.9%$0.00061.4s0%
93.27%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Grok 4.20100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Qwen3.7 Max1001001001009899.6%
GPT-5 Mini1001001001009899.6%
Qwen 3.5 Plus (2026-04-20)1001001001009899.6%
MoonshotAI: Kimi K2.51001001001009899.6%
Qwen 3.5 27B1001001001009899.6%
DeepSeek V4 Flash (Reasoning)1001001001009899.6%
Grok 4 Fast1001001001009899.6%
Gemini 2.5 Flash Lite (Reasoning)1001001001009899.6%
Z.AI GLM 4.61001001001009899.5%
MiniMax M2.71001001001009899.5%
Claude 3.5 Sonnet1001001001009899.5%
Qwen 3.5 397B A17B100100100989899.2%
Qwen 3.5 122B100100100989899.2%
Gemini 2.5 Pro100100100989899.2%
Z.AI GLM 4.5 Air1001001001009699.2%
Gemini 3.1 Flash Lite (Reasoning)1001001001009599.1%
Gemini 3.1 Flash Lite100100100989899.1%
DeepSeek V3 (2025-03-24)100100100989899.1%
GPT-5.21001001001009498.8%
ByteDance Seed 2.0 Mini10010098989898.8%
Z.AI GLM 4.7100100100989698.8%
Qwen 3.5 9B100100100989598.7%
GPT-5 Nano1001001001009398.7%
Grok 4.20 (Beta)100100100989598.6%
GPT-5.110010098989698.5%
GPT-5.4 (Reasoning, Low)100100100989498.5%
Aion 2.0100100100989498.4%
Z.AI GLM 5 Turbo100100100969698.3%
Stealth: Hunter Alpha100100100989398.3%
Qwen 3.5 Flash10010098989598.3%
Z.AI GLM 4.5100100100969598.3%
o4 Mini High100100100959598.1%
Ministral 3 8B100100100959598.1%
Gemini 2.5 Flash (Reasoning)1009898989698.1%
GPT-5.41009898989698.1%
GPT-51009898989698.1%
Qwen 3.5 35B10010096969697.7%
Claude 3 Haiku10010098959597.7%
Qwen 3.6 Flash10010098969397.5%
Ministral 3 14B10010098969397.5%
Gemma 3 27B989898989597.5%
o4 Mini10010095959597.2%
DeepSeek V4 Pro10010095959597.2%
GPT-5.4 Mini10010095959396.8%
Ministral 3 3B1009898959396.8%
GPT-4o, May 13th (temp=0)989898959596.7%
GPT-4.1989896969496.5%
Mistral Small 4 (Reasoning)1009896949496.5%
Gemini 2.5 Flash Lite1009895959396.4%
Grok 4.31009898959196.3%
Ministral 3B100100100938896.3%
GPT-5.4 Nano (Reasoning, Low)10010095939296.2%
Z.AI GLM 4.7 Flash10010095949196.1%
Mistral Small 41009895959196.0%
MiniMax M2.51009595959395.9%
Stealth: Healer Alpha1009696939395.7%
Qwen 3.6 27B10010098938795.6%
Qwen 3.6 35B10010096948795.5%
GPT-4o, May 13th (temp=1)989898988795.5%
GPT-5.4 Nano (Reasoning)989696949495.4%
Llama 3.1 Nemotron 70B989696959195.2%
DeepSeek-V2 Chat1009893939194.9%
Ministral 8B989595959194.9%
Nemotron 3 Super989693939394.7%
GPT-4o, Aug. 6th (temp=0)1009393939394.4%
Cydonia 24B V4.11009695918994.3%
WizardLM 2 8x22b10010095957994.0%
GPT-5.4 Nano1009694938693.8%
GPT-4.1 Mini1009593918893.6%
Mistral Small 3.2 24B959595919193.5%
DeepSeek V3 (2024-12-26)1009593918893.5%
Qwen3 235B A22B Instruct 2507959593938893.0%
GPT-4o, Aug. 6th (temp=1)959393919192.6%
Qwen 2.5 72B989593888892.6%
Gemma 3 12B959391918991.9%
ByteDance Seed 1.6 Flash969490888690.9%
GPT-OSS 120B939191888689.8%
Hermes 3 405B959591848489.8%
Cohere Command R+ (Aug. 2024)969191868489.6%
Writer: Palmyra X5959593917088.8%
Qwen 3 32B939186868688.5%
Gemini 2.5 Flash1001001001004087.9%
Inception Mercury 2919186848286.7%
Gemma 3 4B888686858285.5%
Llama 3.1 70B989681797084.9%
Gemini 3.1 Pro (Preview)10010098982383.9%
DeepSeek V4 Flash1001001001001482.8%
Nemotron 3 Nano848484827982.4%
Inception Mercury868481797881.6%
GPT-4o Mini (temp=0)868479797480.5%
Arcee AI: Trinity Mini868679777280.1%
DeepSeek V3.1100100100100080.0%
Mistral Large100100100100080.0%
Hermes 3 70B888179777279.5%
GPT-4o Mini (temp=1)817979676574.5%
LFM2 24B777775756573.6%
Arcee AI: Trinity Large (Preview)95958888073.5%
GPT-4.1 Nano857775685872.8%
Skyfall 36B V21008988383069.1%
Llama 3.1 8B84797775564.0%
Gemma 4 26B100100790055.8%
Rocinante 12B8786740049.5%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Gemini 3.5 Flash (Reasoning)1001001001009799.3%
MoonshotAI: Kimi K2.61001001001009799.3%
Qwen 3.5 397B A17B1001001001009799.3%
Grok 4.20 (Beta, Reasoning)1001001001009799.3%
Grok 4.20 (Reasoning)1001001001009799.3%
Qwen 3.5 27B1001001001009799.3%
o4 Mini1001001001009799.3%
Gemma 4 31B1001001001009799.3%
Inception Mercury 21001001001009799.3%
DeepSeek V4 Pro1001001001009799.3%
Gemma 3 27B1001001001009799.3%
GPT-5.5 (Reasoning, Low)1001001001009699.3%
Qwen 3.5 Plus (2026-04-20)1001001001009699.3%
GPT-5.21001001001009699.3%
GPT-4.11001001001009699.3%
Gemini 2.5 Pro1001001001009699.3%
Gemma 4 26B1001001001009699.3%
GPT-5.41001001001009699.3%
Gemini 3.1 Pro (Preview)100100100979798.6%
Claude Sonnet 4.6 (Reasoning)100100100979798.6%
Qwen 3.6 35B100100100979798.6%
Gemini 3.1 Flash Lite (Preview)100100100979798.6%
ByteDance Seed 2.0 Lite100100100979798.6%
Gemini 2.5 Flash100100100979798.6%
Mistral Small Creative100100100979798.6%
DeepSeek V4 Flash (Reasoning)100100100979698.6%
GPT-5.4 (Reasoning, Low)100100100969698.6%
GPT-5.51001001001009398.6%
Mistral Small 4 (Reasoning)100100100969698.6%
Claude Opus 4.6 (Reasoning)10010097979797.9%
Claude Opus 4.610010097979797.9%
Grok 410010097979797.9%
Z.AI GLM 4.510010097979797.9%
Qwen 3.5 122B1001001001009097.9%
GPT-5.4 Mini (Reasoning)1001001001009097.9%
Z.AI GLM 4.71001001001009097.9%
GPT-5.4 Mini1001001001009097.9%
GPT-5.4 Nano (Reasoning)10010097979697.9%
GPT-5 Mini10010097969697.9%
Claude Opus 4.51009797979797.2%
Gemini 3.1 Flash Lite1009797979797.2%
Mistral Large 31009797979797.2%
Nemotron 3 Super100100100979097.2%
MoonshotAI: Kimi K2.51001001001008697.2%
Aion 2.01001001001008697.2%
ByteDance Seed 2.0 Mini1009797979797.2%
Claude 3.5 Sonnet10010097979397.2%
Ministral 3 14B10010097969397.2%
GPT-5.5 (Reasoning)1009696969697.1%
Qwen3.6 Max Preview100100100978696.6%
Grok 4.1 Fast1009797979396.6%
Z.AI GLM 4.6100100100978696.6%
Gemini 3 Pro (Preview)100100100978696.6%
Gemini 3.1 Flash Lite (Reasoning)979797979796.6%
Claude Haiku 4.5979797979796.6%
GPT-5 Nano979797979796.6%
DeepSeek V4 Flash100100100978696.6%
GPT-4o, May 13th (temp=1)100100100909095.9%
GPT-4o, Aug. 6th (temp=1)100100100909095.9%
Mistral Large 2100100100909095.9%
Mistral Medium 3.1979797979395.9%
Ministral 3 8B10010093939395.9%
Z.AI GLM 4.7 Flash10010093939395.8%
Gemini 2.5 Flash Lite10010097968695.8%
Qwen 3.5 Flash1009797978995.8%
MiniMax M2.71009793939395.2%
Grok 4 Fast979797939395.2%
ByteDance Seed 1.61009797968695.1%
GPT-5.4 Nano979796939395.1%
Ministral 8B979796939395.1%
DeepSeek V3.110010097908694.5%
DeepSeek V3.210010093909094.5%
WizardLM 2 8x22b1009797909094.5%
Qwen 3.6 27B10010096908694.5%
Mistral Large1009796909094.5%
Gemini 2.5 Flash (Reasoning)969696938994.3%
GPT-4o, Aug. 6th (temp=0)1009797908693.8%
GPT-OSS 120B100100100868393.8%
DeepSeek V4 Pro (Reasoning)979797908693.1%
Z.AI GLM 5 Turbo1009797868693.1%
Grok 4.3 (Reasoning)979797908693.1%
Xiaomi MIMO v2.5 Pro939393939393.1%
Xiaomi MIMO v2.51009797868693.1%
Mistral Small 41009393909093.1%
GPT-5.4 Nano (Reasoning, Low)979696908693.0%
Qwen 3 32B979793938392.4%
MiniMax M2.5979793868691.7%
Stealth: Hunter Alpha979793908391.7%
Grok 4.2010010086868691.7%
Cydonia 24B V4.1979393938391.7%
Stealth: Healer Alpha1009790868391.0%
Mistral Small 3.2 24B939090909090.3%
Z.AI GLM 4.5 Air1009790838390.3%
ByteDance Seed 1.6 Flash1009090868389.6%
Claude 3 Haiku939386868689.0%
Hermes 3 405B939390838388.3%
Grok 4.20 (Beta)979086868388.3%
DeepSeek V3 (2025-03-24)1009086837987.6%
DeepSeek-V2 Chat909090907286.2%
Qwen 2.5 72B938686837985.5%
Llama 3.1 70B868686867984.8%
GPT-4.1 Mini938383837984.1%
Writer: Palmyra X5938383797983.4%
DeepSeek V3 (2024-12-26)909083767282.1%
Inception Mercury938983766981.9%
Nemotron 3 Nano908379797681.4%
Cohere Command R+ (Aug. 2024)938676696978.5%
Hermes 3 70B838379727277.9%
Ministral 3B867976767277.9%
Skyfall 36B V2977972726977.8%
Llama 3.1 8B838276767277.8%
Gemma 3 12B96969393075.8%
Gemma 3 4B767676767275.0%
Grok 4.3909086862174.5%
Arcee AI: Trinity Mini797672727274.5%
Llama 3.1 Nemotron 70B868683793473.8%
GPT-4o Mini (temp=0)767676696973.1%
Ministral 3 3B797272726973.1%
GPT-4.1 Nano868669665973.1%
Qwen3 235B A22B Instruct 2507938383792171.7%
Arcee AI: Trinity Large (Preview)97908683071.0%
GPT-4o Mini (temp=1)727269696669.7%
Rocinante 12B72724834045.4%
LFM2 24B212121212120.7%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009799.3%
Z.AI GLM 5 Turbo1001001001009799.3%
Grok 4.3 (Reasoning)1001001001009799.3%
GPT-5.4 (Reasoning)1001001001009799.3%
Claude Opus 4.7 (Reasoning)1001001001009799.3%
GPT-51001001001009799.3%
Gemma 4 31B (Reasoning)1001001001009799.3%
Qwen 3.5 Plus (2026-04-20)1001001001009799.3%
Z.AI GLM 51001001001009799.3%
Claude Sonnet 4.61001001001009799.3%
Grok 4.1 Fast1001001001009799.3%
Aion 2.01001001001009799.3%
GPT-5.51001001001009799.3%
DeepSeek V4 Flash (Reasoning)1001001001009799.3%
Z.AI GLM 4.71001001001009799.3%
Grok 41001001001009799.3%
DeepSeek V4 Flash1001001001009799.3%
Arcee AI: Trinity Large (Preview)1001001001009799.3%
Ministral 3 14B1001001001009799.3%
GPT-5.5 (Reasoning, Low)100100100979798.7%
MoonshotAI: Kimi K2.61001001001009398.7%
Qwen 3.5 122B100100100979798.7%
GPT-5.2100100100979798.7%
Claude Opus 4.7100100100979798.7%
Gemini 3.5 Flash (Reasoning, Minimal)100100100979798.7%
Stealth: Healer Alpha1001001001009398.7%
Gemini 3 Flash (Preview)1001001001009398.7%
Xiaomi MIMO v2.51001001001009398.7%
DeepSeek-V2 Chat100100100979798.7%
ByteDance Seed 2.0 Lite100100100979798.7%
DeepSeek V3.21001001001009398.7%
Gemma 3 27B1001001001009398.7%
Mistral Small Creative100100100979798.7%
Claude Opus 4.610010097979798.0%
Grok 4.20 (Beta, Reasoning)10010097979798.0%
Qwen 3.5 27B10010097979798.0%
GPT-5.4 Mini (Reasoning)10010097979798.0%
MiniMax M2.7100100100979398.0%
Gemini 2.5 Pro1001001001009098.0%
Nemotron 3 Super100100100979398.0%
GPT-5.4100100100979398.0%
GPT-5.5 (Reasoning)1009797979797.3%
MoonshotAI: Kimi K2.510010097979397.3%
Qwen 3.6 Flash1009797979797.3%
Gemini 3 Flash (Preview, Reasoning)100100100939397.3%
o4 Mini High1009797979797.3%
Qwen 3.6 27B10010097979397.3%
Z.AI GLM 4.610010097979397.3%
Qwen 3.6 35B100100100939397.3%
MiniMax M2.5100100100979097.3%
ByteDance Seed 2.0 Mini100100100939397.3%
Z.AI GLM 4.51009797979797.3%
Grok 4 Fast100100100979097.3%
Grok 4.31009797979797.3%
GPT-4o, May 13th (temp=0)1009797979797.3%
Gemma 4 26B (Reasoning)100100100939096.7%
Qwen 3.5 35B10010097979096.7%
DeepSeek V3 (2024-12-26)100100100978796.7%
DeepSeek V4 Pro1009797979396.7%
GPT-5.4 Mini1009797979396.7%
Gemini 2.5 Flash10010097939396.7%
GPT-4.110010097939096.0%
GPT-5.1979797979396.0%
o4 Mini979797979396.0%
Grok 4.20 (Beta)979797979396.0%
Grok 4.20979797979396.0%
GPT-5 Mini1009793939395.3%
ByteDance Seed 1.61009797939095.3%
GPT-5.4 Mini (Reasoning, Low)1009793939395.3%
GPT-5 Nano979797939395.3%
DeepSeek V3.11009797939095.3%
Ministral 3 8B1009797939095.3%
Stealth: Hunter Alpha979793939394.7%
GPT-4o, May 13th (temp=1)979793939394.7%
Qwen3 235B A22B Instruct 2507979797939094.7%
Gemini 2.5 Flash Lite (Reasoning)1009797908794.0%
Hermes 3 405B979797938794.0%
GPT-4o, Aug. 6th (temp=1)1009793909094.0%
GPT-4o, Aug. 6th (temp=0)979393939394.0%
Mistral Large 2979393939394.0%
Qwen 3 32B979793939094.0%
Gemini 2.5 Flash Lite979797938794.0%
WizardLM 2 8x22b979797938794.0%
Mistral Small 4 (Reasoning)979793939094.0%
Qwen 3.5 Flash979793909093.3%
Mistral Large 3939393939393.3%
Z.AI GLM 4.5 Air979793909093.3%
Mistral Large939393939393.3%
Cydonia 24B V4.11009393909093.3%
Qwen 3.5 397B A17B979790909092.7%
Writer: Palmyra X5979793908792.7%
Xiaomi MIMO v2.5 Pro1001001001006092.0%
GPT-OSS 120B979393908792.0%
Qwen 3.5 9B979793908392.0%
GPT-4.1 Mini979393908792.0%
GPT-5.4 Nano (Reasoning)979393908792.0%
Z.AI GLM 4.7 Flash979393908391.3%
Ministral 8B979390908791.3%
Mistral Small 4979393878390.7%
Qwen 2.5 72B979390878790.7%
Ministral 3B979393878390.7%
Inception Mercury 21009387878089.3%
Gemma 3 12B939087878788.7%
Ministral 3 3B939087878388.0%
Gemini 2.5 Flash (Reasoning)100100100874386.0%
GPT-4o Mini (temp=1)908787808084.7%
Claude 3 Haiku939087777384.0%
Arcee AI: Trinity Mini938783837384.0%
Nemotron 3 Nano908787807784.0%
Inception Mercury908787837083.3%
GPT-4o Mini (temp=0)878080808081.3%
GPT-5.4 Nano (Reasoning, Low)908380777380.7%
Gemma 4 26B100100100100080.0%
ByteDance Seed 1.6 Flash938077735776.0%
GPT-5.4 Nano838077706775.3%
Hermes 3 70B878377676074.7%
Gemma 3 4B807773736373.3%
Cohere Command R+ (Aug. 2024)937777674371.3%
Llama 3.1 Nemotron 70B908383831070.0%
Llama 3.1 70B87838070064.0%
GPT-4.1 Nano777773672062.7%
Llama 3.1 8B73605733044.7%
Rocinante 12B87803020043.3%
Mistral NeMO7373700043.3%
Skyfall 36B V29777370042.0%
LFM2 24B202020202020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.7 Max100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Claude Opus 4.7100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100.0%
GPT-OSS 120B100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
DeepSeek V4 Pro100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V4 Flash100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009599.1%
Grok 4.3 (Reasoning)1001001001009599.1%
Qwen 3.5 397B A17B1001001001009599.1%
Qwen 3.6 Flash1001001001009599.1%
Gemini 3 Flash (Preview, Reasoning)1001001001009599.1%
Aion 2.01001001001009599.1%
Z.AI GLM 4.61001001001009599.1%
GPT-5.51001001001009599.1%
Qwen 3.5 35B1001001001009599.1%
Xiaomi MIMO v2.51001001001009599.1%
DeepSeek V3 (2024-12-26)1001001001009599.1%
GPT-5.4 Mini1001001001009599.1%
DeepSeek V3 (2025-03-24)1001001001009599.1%
Llama 3.1 Nemotron 70B1001001001009599.1%
Ministral 8B1001001001009599.1%
GPT-5.5 (Reasoning)100100100959598.2%
GPT-5.5 (Reasoning, Low)100100100959598.2%
Qwen 3.5 Plus (2026-04-20)100100100959598.2%
MoonshotAI: Kimi K2.5100100100959598.2%
ByteDance Seed 1.6100100100959598.2%
o4 Mini100100100959598.2%
Grok 4.20 (Beta)100100100959598.2%
GPT-5.4 Nano (Reasoning)1001001001009198.2%
Grok 4.3100100100959598.2%
Mistral Medium 3.1100100100959598.2%
o4 Mini High1001001001009098.1%
GPT-5 Nano1001001001009098.1%
Ministral 3B1001001001009098.1%
Claude Sonnet 4.6 (Reasoning)10010095959597.3%
GPT-5.110010095959597.3%
Z.AI GLM 510010095959597.3%
Gemini 2.5 Pro10010095959597.3%
Grok 4 Fast10010095959597.3%
DeepSeek-V2 Chat10010095959597.3%
ByteDance Seed 2.0 Lite100100100959197.3%
Grok 4.20100100100959197.3%
Qwen 2.5 72B100100100959197.3%
Mistral Small Creative100100100959197.3%
Qwen 3.5 9B100100100959197.2%
DeepSeek V3.21001001001008697.2%
Z.AI GLM 4.71001001001008697.2%
GPT-5 Mini1009595959596.4%
Qwen 3.5 122B1009595959596.4%
GPT-4.11009595959596.4%
ByteDance Seed 2.0 Mini10010095959196.4%
MiniMax M2.510010095959196.3%
WizardLM 2 8x22b10010095959196.3%
Mistral Large 3959595959595.5%
GPT-4o, May 13th (temp=1)959595959595.5%
Mistral Large 2959595959595.5%
DeepSeek V3.1959595959595.5%
Mistral Large959595959595.5%
Mistral Small 4 (Reasoning)959595959595.4%
Arcee AI: Trinity Large (Preview)10010095909095.3%
GPT-4o, Aug. 6th (temp=1)100100100908695.2%
Nemotron 3 Super959595959194.5%
GPT-5.4959595959194.5%
Inception Mercury 21009595919194.5%
Claude 3 Haiku10010095918694.5%
Gemini 2.5 Flash Lite (Reasoning)10010095918694.5%
Stealth: Hunter Alpha959595959094.5%
Gemini 2.5 Flash Lite1009595909094.3%
GPT-5.210010091918693.6%
Stealth: Healer Alpha959595919193.6%
Claude Haiku 4.5959595958693.6%
GPT-5.4 Mini (Reasoning, Low)10010091918693.5%
Mistral Small 3.2 24B1009591909093.5%
Ministral 3 14B959591919192.7%
Hermes 3 70B1009191919092.5%
Qwen 3 32B959591909092.5%
Gemma 3 27B959591918691.8%
Writer: Palmyra X5919191919190.7%
Qwen3 235B A22B Instruct 2507919191909090.6%
Llama 3.1 70B959190908690.6%
GPT-4.1 Mini10010090867690.5%
GPT-5.4 Nano (Reasoning, Low)959591868290.0%
Z.AI GLM 4.7 Flash1009586868189.7%
Nemotron 3 Nano959090868689.5%
ByteDance Seed 1.6 Flash959591867789.0%
Cohere Command R+ (Aug. 2024)959190867687.8%
Z.AI GLM 4.5 Air919190867787.1%
GPT-5.4 Nano959182828286.4%
Inception Mercury959586776784.2%
Mistral Small 4959191766283.2%
Gemini 2.5 Flash (Reasoning)10010091824383.1%
Gemma 3 4B908686767683.0%
Ministral 3 3B908181818182.9%
Qwen 3.6 27B100100100951481.9%
Arcee AI: Trinity Mini908181767280.3%
Gemma 4 26B (Reasoning)100100100100080.0%
Llama 3.1 8B868181777279.3%
GPT-4o Mini (temp=1)767272727272.6%
Cydonia 24B V4.191919086071.7%
GPT-4o Mini (temp=0)727272727271.6%
GPT-4.1 Nano817667574865.8%
Qwen 3.5 27B1001001000060.0%
Gemma 3 12B685954544956.7%
Mistral NeMO9090810052.4%
Skyfall 36B V286624848048.6%
LFM2 24B292929292928.6%
Rocinante 12B625200022.9%