Recall

Test: Codex Extraction

Avg. Score
92.4%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Preview)99.7%$0.00172.0s98%
2Gemini 3 Flash (Preview)99.7%$0.00273.9s97%
3Qwen 3.5 Plus (2026-02-15)100.0%$0.003010.6s100%
4Mistral Small Creative98.6%$0.00063.9s95%
5Mistral Medium 3.198.5%$0.00265.8s96%
6Ministral 3 8B97.3%$0.00063.3s94%
7GPT-5.4 Mini97.6%$0.00312.2s94%
8Z.AI GLM 4.598.4%$0.002816.8s96%
9Grok 4 Fast97.3%$0.00128.7s92%
10Grok 4.1 Fast99.0%$0.001722.1s96%
11GPT-5.4 Mini (Reasoning, Low)97.2%$0.00423.7s92%
12Ministral 3 14B96.7%$0.00096.0s90%
13Gemini 2.5 Flash Lite95.1%$0.00051.9s88%
14Mistral Large 396.5%$0.00278.2s91%
15MiniMax M2.798.2%$0.002221.7s95%
16Ministral 8B95.1%$0.00043.8s88%
17Claude Haiku 4.597.5%$0.00734.5s92%
18Gemma 3 27B96.8%$0.000514.5s91%
19GPT-4.197.0%$0.00734.7s91%
20Gemini 2.5 Flash Lite (Reasoning)97.0%$0.002115.6s91%
21GPT-5.497.5%$0.0116.4s93%
22Claude 3.7 Sonnet100.0%$0.0217.8s100%
23Claude Sonnet 4.5100.0%$0.0226.6s100%
24Mistral Small 4 (Reasoning)96.1%$0.001915.6s91%
25Claude Sonnet 4100.0%$0.0227.9s100%
26Mistral Small 3.2 24B94.3%$0.00054.4s86%
27GPT-5.4 Nano (Reasoning)95.9%$0.002312.2s89%
28GPT-5.4 (Reasoning, Low)99.3%$0.01710.4s97%
29Gemini 3 Flash (Preview, Reasoning)99.1%$0.009622.2s96%
30Grok 4.20 (Beta)95.3%$0.00492.0s87%
31Z.AI GLM 5 Turbo97.7%$0.006816.0s92%
32Claude Sonnet 4.699.8%$0.0227.4s99%
33Grok 4.20 (Beta, Reasoning)99.3%$0.02012.7s97%
34GPT-4o, Aug. 6th (temp=0)95.6%$0.01004.0s87%
35Mistral Large 296.3%$0.0118.3s89%
36DeepSeek V3.297.6%$0.001136.7s91%
37GPT-4o, May 13th (temp=0)98.5%$0.0254.3s97%
38DeepSeek V3 (2025-03-24)96.4%$0.001427.0s87%
39Stealth: Healer Alpha94.7%$0.000024.6s86%
40Z.AI GLM 4.698.1%$0.005738.8s93%
41DeepSeek-V2 Chat94.3%$0.001914.8s83%
42GPT-4o, Aug. 6th (temp=1)94.4%$0.00923.8s85%
43Inception Mercury 292.5%$0.00223.5s80%
44GPT-5.4 Mini (Reasoning)99.0%$0.01725.6s95%
45o4 Mini97.7%$0.01421.5s92%
46MiniMax M2.595.3%$0.002330.1s87%
47Gemini 2.5 Flash95.8%$0.00232.5s74%
48GPT-5.297.6%$0.01816.8s92%
49Qwen 2.5 72B91.5%$0.000811.0s81%
50Claude 3 Haiku91.3%$0.00175.5s80%
51Stealth: Hunter Alpha94.8%$0.000036.7s88%
52Z.AI GLM 599.2%$0.009151.4s97%
53Ministral 3B90.7%$0.00021.8s76%
54Mistral Small 490.7%$0.00083.2s77%
55DeepSeek V3 (2024-12-26)92.8%$0.001713.5s80%
56WizardLM 2 8x22b94.7%$0.002631.3s86%
57GPT-5.4 Nano (Reasoning, Low)90.0%$0.00113.7s78%
58Hermes 3 405B93.0%$0.004017.8s83%
59GPT-4.1 Mini90.0%$0.00156.6s78%
60GPT-5 Mini97.3%$0.007045.5s92%
61Qwen 3.5 Flash96.9%$0.003149.5s91%
62Claude Opus 4.599.3%$0.0377.7s97%
63GPT-4o, May 13th (temp=1)95.4%$0.0254.0s89%
64Claude Opus 4.699.0%$0.0378.1s97%
65Qwen 3.5 35B98.4%$0.01547.3s95%
66Qwen 3 32B91.8%$0.001037.9s86%
67GPT-5.4 Nano87.7%$0.00113.8s75%
68Writer: Palmyra X588.9%$0.00527.5s78%
69Nemotron 3 Super96.1%$0.00001.0m90%
70o4 Mini High98.4%$0.02540.5s95%
71Gemini 2.5 Pro98.4%$0.03422.8s95%
72Grok 499.3%$0.03137.2s97%
73Claude 3.5 Sonnet99.2%$0.04313.9s96%
74Aion 2.098.5%$0.00801.2m93%
75Ministral 3 3B85.2%$0.00041.8s69%
76ByteDance Seed 2.0 Lite98.6%$0.00781.4m95%
77GPT-5 Nano97.2%$0.00431.3m91%
78Inception Mercury82.8%$0.00063.9s70%
79ByteDance Seed 1.697.2%$0.00731.3m91%
80Qwen 3.5 9B97.0%$0.00131.5m91%
81Arcee AI: Trinity Mini79.7%$0.00035.8s69%
82GPT-5.197.9%$0.03543.6s94%
83Gemini 2.5 Flash (Reasoning)90.4%$0.008211.8s65%
84Z.AI GLM 4.7 Flash93.2%$0.00191.2m83%
85Claude Opus 4.6 (Reasoning)99.5%$0.05521.4s98%
86Gemma 3 4B79.2%$0.00026.7s66%
87Qwen3 235B A22B Instruct 250787.5%$0.000719.3s62%
88ByteDance Seed 1.6 Flash86.4%$0.001139.3s72%
89Z.AI GLM 4.798.3%$0.00981.7m93%
90Hermes 3 70B81.2%$0.001215.8s67%
91GPT-4o Mini (temp=0)76.6%$0.00057.9s68%
92GPT-5.4 (Reasoning)99.8%$0.04453.6s99%
93Claude Sonnet 4.6 (Reasoning)99.0%$0.05236.0s96%
94DeepSeek V3.191.3%$0.001226.1s55%
95Mistral Large90.8%$0.0117.6s55%
96Qwen 3.5 397B A17B97.6%$0.0121.8m93%
97Gemini 3 Pro (Preview)99.1%$0.05535.6s94%
98GPT-4o Mini (temp=1)75.3%$0.00068.0s62%
99Cohere Command R+ (Aug. 2024)81.8%$0.01417.9s64%
100Llama 3.1 70B81.1%$0.001616.0s52%
101Llama 3.1 Nemotron 70B84.5%$0.005017.3s50%
102GPT-599.4%$0.0491.3m98%
103Gemma 3 12B78.3%$0.000314.1s47%
104Arcee AI: Trinity Large (Preview)84.8%$0.000020.5s41%
105MoonshotAI: Kimi K2.598.1%$0.0132.4m93%
106Nemotron 3 Nano84.3%$0.00131.6m75%
107GPT-4.1 Nano68.6%$0.00032.8s50%
108Claude Opus 4100.0%$0.11013.5s100%
109ByteDance Seed 2.0 Mini97.4%$0.00343.4m93%
110Llama 3.1 8B66.4%$0.000110.0s39%
111Gemini 3.1 Pro (Preview)95.2%$0.0651.0m67%
112Qwen 3.5 27B89.2%$0.0211.6m40%
113LFM2 24B35.7%$0.000212.5s14%
114Rocinante 12B40.3%$0.001325.6s13%
115Qwen 3.5 122B98.0%$0.0793.6m95%
116Mistral NeMO23.9%$0.00061.4s0%
92.44%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small Creative100100100100100100.0%
GPT-5 Mini1001001001009899.6%
MoonshotAI: Kimi K2.51001001001009899.6%
Qwen 3.5 27B1001001001009899.6%
Grok 4 Fast1001001001009899.6%
Gemini 2.5 Flash Lite (Reasoning)1001001001009899.6%
Z.AI GLM 4.61001001001009899.5%
MiniMax M2.71001001001009899.5%
Claude 3.5 Sonnet1001001001009899.5%
Qwen 3.5 397B A17B100100100989899.2%
Qwen 3.5 122B100100100989899.2%
Gemini 2.5 Pro100100100989899.2%
DeepSeek V3 (2025-03-24)100100100989899.1%
GPT-5.21001001001009498.8%
ByteDance Seed 2.0 Mini10010098989898.8%
Z.AI GLM 4.7100100100989698.8%
Qwen 3.5 9B100100100989598.7%
GPT-5 Nano1001001001009398.7%
Grok 4.20 (Beta)100100100989598.6%
GPT-5.110010098989698.5%
GPT-5.4 (Reasoning, Low)100100100989498.5%
Aion 2.0100100100989498.4%
Z.AI GLM 5 Turbo100100100969698.3%
Stealth: Hunter Alpha100100100989398.3%
Qwen 3.5 Flash10010098989598.3%
Z.AI GLM 4.5100100100969598.3%
o4 Mini High100100100959598.1%
Ministral 3 8B100100100959598.1%
GPT-51009898989698.1%
Gemini 2.5 Flash (Reasoning)1009898989698.1%
GPT-5.41009898989698.1%
Qwen 3.5 35B10010096969697.7%
Claude 3 Haiku10010098959597.7%
Ministral 3 14B10010098969397.5%
Gemma 3 27B989898989597.5%
o4 Mini10010095959597.2%
GPT-5.4 Mini10010095959396.8%
Ministral 3 3B1009898959396.8%
GPT-4o, May 13th (temp=0)989898959596.7%
GPT-4.1989896969496.5%
Mistral Small 4 (Reasoning)1009896949496.5%
Gemini 2.5 Flash Lite1009895959396.4%
Ministral 3B100100100938896.3%
GPT-5.4 Nano (Reasoning, Low)10010095939296.2%
Z.AI GLM 4.7 Flash10010095949196.1%
Mistral Small 41009895959196.0%
MiniMax M2.51009595959395.9%
Stealth: Healer Alpha1009696939395.7%
GPT-4o, May 13th (temp=1)989898988795.5%
GPT-5.4 Nano (Reasoning)989696949495.4%
Llama 3.1 Nemotron 70B989696959195.2%
DeepSeek-V2 Chat1009893939194.9%
Ministral 8B989595959194.9%
Nemotron 3 Super989693939394.7%
GPT-4o, Aug. 6th (temp=0)1009393939394.4%
WizardLM 2 8x22b10010095957994.0%
GPT-5.4 Nano1009694938693.8%
GPT-4.1 Mini1009593918893.6%
Mistral Small 3.2 24B959595919193.5%
DeepSeek V3 (2024-12-26)1009593918893.5%
Qwen3 235B A22B Instruct 2507959593938893.0%
GPT-4o, Aug. 6th (temp=1)959393919192.6%
Qwen 2.5 72B989593888892.6%
Gemma 3 12B959391918991.9%
ByteDance Seed 1.6 Flash969490888690.9%
Hermes 3 405B959591848489.8%
Cohere Command R+ (Aug. 2024)969191868489.6%
Writer: Palmyra X5959593917088.8%
Qwen 3 32B939186868688.5%
Gemini 2.5 Flash1001001001004087.9%
Inception Mercury 2919186848286.7%
Gemma 3 4B888686858285.5%
Llama 3.1 70B989681797084.9%
Gemini 3.1 Pro (Preview)10010098982383.9%
Nemotron 3 Nano848484827982.4%
Inception Mercury868481797881.6%
GPT-4o Mini (temp=0)868479797480.5%
Arcee AI: Trinity Mini868679777280.1%
DeepSeek V3.1100100100100080.0%
Mistral Large100100100100080.0%
Hermes 3 70B888179777279.5%
GPT-4o Mini (temp=1)817979676574.5%
LFM2 24B777775756573.6%
Arcee AI: Trinity Large (Preview)95958888073.5%
GPT-4.1 Nano857775685872.8%
Llama 3.1 8B84797775564.0%
Rocinante 12B8786740049.5%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Qwen 3.5 397B A17B1001001001009799.3%
Grok 4.20 (Beta, Reasoning)1001001001009799.3%
Qwen 3.5 27B1001001001009799.3%
o4 Mini1001001001009799.3%
Inception Mercury 21001001001009799.3%
Gemma 3 27B1001001001009799.3%
GPT-5.21001001001009699.3%
GPT-4.11001001001009699.3%
Gemini 2.5 Pro1001001001009699.3%
GPT-5.41001001001009699.3%
Gemini 3.1 Pro (Preview)100100100979798.6%
Claude Sonnet 4.6 (Reasoning)100100100979798.6%
Gemini 3.1 Flash Lite (Preview)100100100979798.6%
ByteDance Seed 2.0 Lite100100100979798.6%
Gemini 2.5 Flash100100100979798.6%
Mistral Small Creative100100100979798.6%
GPT-5.4 (Reasoning, Low)100100100969698.6%
Mistral Small 4 (Reasoning)100100100969698.6%
Claude Opus 4.6 (Reasoning)10010097979797.9%
Claude Opus 4.610010097979797.9%
Grok 410010097979797.9%
Z.AI GLM 4.510010097979797.9%
Qwen 3.5 122B1001001001009097.9%
GPT-5.4 Mini (Reasoning)1001001001009097.9%
Z.AI GLM 4.71001001001009097.9%
GPT-5.4 Mini1001001001009097.9%
GPT-5.4 Nano (Reasoning)10010097979697.9%
GPT-5 Mini10010097969697.9%
Claude Opus 4.51009797979797.2%
Mistral Large 31009797979797.2%
Nemotron 3 Super100100100979097.2%
MoonshotAI: Kimi K2.51001001001008697.2%
Aion 2.01001001001008697.2%
ByteDance Seed 2.0 Mini1009797979797.2%
Claude 3.5 Sonnet10010097979397.2%
Ministral 3 14B10010097969397.2%
Grok 4.1 Fast1009797979396.6%
Z.AI GLM 4.6100100100978696.6%
Gemini 3 Pro (Preview)100100100978696.6%
Claude Haiku 4.5979797979796.6%
GPT-5 Nano979797979796.6%
GPT-4o, May 13th (temp=1)100100100909095.9%
GPT-4o, Aug. 6th (temp=1)100100100909095.9%
Mistral Large 2100100100909095.9%
Mistral Medium 3.1979797979395.9%
Ministral 3 8B10010093939395.9%
Z.AI GLM 4.7 Flash10010093939395.8%
Gemini 2.5 Flash Lite10010097968695.8%
Qwen 3.5 Flash1009797978995.8%
MiniMax M2.71009793939395.2%
Grok 4 Fast979797939395.2%
ByteDance Seed 1.61009797968695.1%
GPT-5.4 Nano979796939395.1%
Ministral 8B979796939395.1%
DeepSeek V3.110010097908694.5%
WizardLM 2 8x22b1009797909094.5%
DeepSeek V3.210010093909094.5%
Mistral Large1009796909094.5%
Gemini 2.5 Flash (Reasoning)969696938994.3%
GPT-4o, Aug. 6th (temp=0)1009797908693.8%
Z.AI GLM 5 Turbo1009797868693.1%
Mistral Small 41009393909093.1%
GPT-5.4 Nano (Reasoning, Low)979696908693.0%
Qwen 3 32B979793938392.4%
MiniMax M2.5979793868691.7%
Stealth: Hunter Alpha979793908391.7%
Stealth: Healer Alpha1009790868391.0%
Mistral Small 3.2 24B939090909090.3%
ByteDance Seed 1.6 Flash1009090868389.6%
Claude 3 Haiku939386868689.0%
Hermes 3 405B939390838388.3%
Grok 4.20 (Beta)979086868388.3%
DeepSeek V3 (2025-03-24)1009086837987.6%
DeepSeek-V2 Chat909090907286.2%
Qwen 2.5 72B938686837985.5%
Llama 3.1 70B868686867984.8%
GPT-4.1 Mini938383837984.1%
Writer: Palmyra X5938383797983.4%
DeepSeek V3 (2024-12-26)909083767282.1%
Inception Mercury938983766981.9%
Nemotron 3 Nano908379797681.4%
Cohere Command R+ (Aug. 2024)938676696978.5%
Ministral 3B867976767277.9%
Hermes 3 70B838379727277.9%
Llama 3.1 8B838276767277.8%
Gemma 3 12B96969393075.8%
Gemma 3 4B767676767275.0%
Arcee AI: Trinity Mini797672727274.5%
Llama 3.1 Nemotron 70B868683793473.8%
GPT-4o Mini (temp=0)767676696973.1%
Ministral 3 3B797272726973.1%
GPT-4.1 Nano868669665973.1%
Qwen3 235B A22B Instruct 2507938383792171.7%
Arcee AI: Trinity Large (Preview)97908683071.0%
GPT-4o Mini (temp=1)727269696669.7%
Rocinante 12B72724834045.4%
LFM2 24B212121212120.7%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009799.3%
Z.AI GLM 5 Turbo1001001001009799.3%
GPT-5.4 (Reasoning)1001001001009799.3%
GPT-51001001001009799.3%
Z.AI GLM 51001001001009799.3%
Claude Sonnet 4.61001001001009799.3%
Grok 4.1 Fast1001001001009799.3%
Aion 2.01001001001009799.3%
Z.AI GLM 4.71001001001009799.3%
Grok 41001001001009799.3%
Arcee AI: Trinity Large (Preview)1001001001009799.3%
Ministral 3 14B1001001001009799.3%
Qwen 3.5 122B100100100979798.7%
GPT-5.2100100100979798.7%
Stealth: Healer Alpha1001001001009398.7%
Gemini 3 Flash (Preview)1001001001009398.7%
DeepSeek-V2 Chat100100100979798.7%
ByteDance Seed 2.0 Lite100100100979798.7%
DeepSeek V3.21001001001009398.7%
Gemma 3 27B1001001001009398.7%
Mistral Small Creative100100100979798.7%
Claude Opus 4.610010097979798.0%
Grok 4.20 (Beta, Reasoning)10010097979798.0%
Qwen 3.5 27B10010097979798.0%
GPT-5.4 Mini (Reasoning)10010097979798.0%
MiniMax M2.7100100100979398.0%
Gemini 2.5 Pro1001001001009098.0%
Nemotron 3 Super100100100979398.0%
GPT-5.4100100100979398.0%
MoonshotAI: Kimi K2.510010097979397.3%
Gemini 3 Flash (Preview, Reasoning)100100100939397.3%
o4 Mini High1009797979797.3%
Z.AI GLM 4.610010097979397.3%
MiniMax M2.5100100100979097.3%
ByteDance Seed 2.0 Mini100100100939397.3%
Grok 4 Fast100100100979097.3%
Z.AI GLM 4.51009797979797.3%
GPT-4o, May 13th (temp=0)1009797979797.3%
DeepSeek V3 (2024-12-26)100100100978796.7%
GPT-5.4 Mini1009797979396.7%
Gemini 2.5 Flash10010097939396.7%
Qwen 3.5 35B10010097979096.7%
GPT-5.1979797979396.0%
GPT-4.110010097939096.0%
o4 Mini979797979396.0%
Grok 4.20 (Beta)979797979396.0%
GPT-5 Mini1009793939395.3%
ByteDance Seed 1.61009797939095.3%
GPT-5.4 Mini (Reasoning, Low)1009793939395.3%
GPT-5 Nano979797939395.3%
DeepSeek V3.11009797939095.3%
Ministral 3 8B1009797939095.3%
Stealth: Hunter Alpha979793939394.7%
GPT-4o, May 13th (temp=1)979793939394.7%
Qwen3 235B A22B Instruct 2507979797939094.7%
Gemini 2.5 Flash Lite (Reasoning)1009797908794.0%
Hermes 3 405B979797938794.0%
GPT-4o, Aug. 6th (temp=1)1009793909094.0%
GPT-4o, Aug. 6th (temp=0)979393939394.0%
Mistral Large 2979393939394.0%
Mistral Small 4 (Reasoning)979793939094.0%
Qwen 3 32B979793939094.0%
Gemini 2.5 Flash Lite979797938794.0%
WizardLM 2 8x22b979797938794.0%
Qwen 3.5 Flash979793909093.3%
Mistral Large 3939393939393.3%
Mistral Large939393939393.3%
Writer: Palmyra X5979793908792.7%
Qwen 3.5 397B A17B979790909092.7%
GPT-5.4 Nano (Reasoning)979393908792.0%
Qwen 3.5 9B979793908392.0%
GPT-4.1 Mini979393908792.0%
Z.AI GLM 4.7 Flash979393908391.3%
Ministral 8B979390908791.3%
Mistral Small 4979393878390.7%
Ministral 3B979393878390.7%
Qwen 2.5 72B979390878790.7%
Inception Mercury 21009387878089.3%
Gemma 3 12B939087878788.7%
Ministral 3 3B939087878388.0%
Gemini 2.5 Flash (Reasoning)100100100874386.0%
GPT-4o Mini (temp=1)908787808084.7%
Nemotron 3 Nano908787807784.0%
Arcee AI: Trinity Mini938783837384.0%
Claude 3 Haiku939087777384.0%
Inception Mercury908787837083.3%
GPT-4o Mini (temp=0)878080808081.3%
GPT-5.4 Nano (Reasoning, Low)908380777380.7%
ByteDance Seed 1.6 Flash938077735776.0%
GPT-5.4 Nano838077706775.3%
Hermes 3 70B878377676074.7%
Gemma 3 4B807773736373.3%
Cohere Command R+ (Aug. 2024)937777674371.3%
Llama 3.1 Nemotron 70B908383831070.0%
Llama 3.1 70B87838070064.0%
GPT-4.1 Nano777773672062.7%
Llama 3.1 8B73605733044.7%
Rocinante 12B87803020043.3%
Mistral NeMO7373700043.3%
LFM2 24B202020202020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009599.1%
Qwen 3.5 397B A17B1001001001009599.1%
Gemini 3 Flash (Preview, Reasoning)1001001001009599.1%
Aion 2.01001001001009599.1%
Z.AI GLM 4.61001001001009599.1%
Qwen 3.5 35B1001001001009599.1%
DeepSeek V3 (2024-12-26)1001001001009599.1%
GPT-5.4 Mini1001001001009599.1%
DeepSeek V3 (2025-03-24)1001001001009599.1%
Llama 3.1 Nemotron 70B1001001001009599.1%
Ministral 8B1001001001009599.1%
MoonshotAI: Kimi K2.5100100100959598.2%
ByteDance Seed 1.6100100100959598.2%
o4 Mini100100100959598.2%
Grok 4.20 (Beta)100100100959598.2%
GPT-5.4 Nano (Reasoning)1001001001009198.2%
Mistral Medium 3.1100100100959598.2%
o4 Mini High1001001001009098.1%
GPT-5 Nano1001001001009098.1%
Ministral 3B1001001001009098.1%
Claude Sonnet 4.6 (Reasoning)10010095959597.3%
GPT-5.110010095959597.3%
Z.AI GLM 510010095959597.3%
Gemini 2.5 Pro10010095959597.3%
Grok 4 Fast10010095959597.3%
DeepSeek-V2 Chat10010095959597.3%
ByteDance Seed 2.0 Lite100100100959197.3%
Qwen 2.5 72B100100100959197.3%
Mistral Small Creative100100100959197.3%
Qwen 3.5 9B100100100959197.2%
DeepSeek V3.21001001001008697.2%
Z.AI GLM 4.71001001001008697.2%
GPT-5 Mini1009595959596.4%
Qwen 3.5 122B1009595959596.4%
GPT-4.11009595959596.4%
ByteDance Seed 2.0 Mini10010095959196.4%
MiniMax M2.510010095959196.3%
WizardLM 2 8x22b10010095959196.3%
Mistral Large 3959595959595.5%
GPT-4o, May 13th (temp=1)959595959595.5%
Mistral Large 2959595959595.5%
DeepSeek V3.1959595959595.5%
Mistral Large959595959595.5%
Mistral Small 4 (Reasoning)959595959595.4%
Arcee AI: Trinity Large (Preview)10010095909095.3%
GPT-4o, Aug. 6th (temp=1)100100100908695.2%
Nemotron 3 Super959595959194.5%
GPT-5.4959595959194.5%
Claude 3 Haiku10010095918694.5%
Inception Mercury 21009595919194.5%
Gemini 2.5 Flash Lite (Reasoning)10010095918694.5%
Stealth: Hunter Alpha959595959094.5%
Gemini 2.5 Flash Lite1009595909094.3%
GPT-5.210010091918693.6%
Stealth: Healer Alpha959595919193.6%
Claude Haiku 4.5959595958693.6%
GPT-5.4 Mini (Reasoning, Low)10010091918693.5%
Mistral Small 3.2 24B1009591909093.5%
Ministral 3 14B959591919192.7%
Hermes 3 70B1009191919092.5%
Qwen 3 32B959591909092.5%
Gemma 3 27B959591918691.8%
Writer: Palmyra X5919191919190.7%
Qwen3 235B A22B Instruct 2507919191909090.6%
Llama 3.1 70B959190908690.6%
GPT-4.1 Mini10010090867690.5%
GPT-5.4 Nano (Reasoning, Low)959591868290.0%
Z.AI GLM 4.7 Flash1009586868189.7%
Nemotron 3 Nano959090868689.5%
ByteDance Seed 1.6 Flash959591867789.0%
Cohere Command R+ (Aug. 2024)959190867687.8%
GPT-5.4 Nano959182828286.4%
Inception Mercury959586776784.2%
Mistral Small 4959191766283.2%
Gemini 2.5 Flash (Reasoning)10010091824383.1%
Gemma 3 4B908686767683.0%
Ministral 3 3B908181818182.9%
Arcee AI: Trinity Mini908181767280.3%
Llama 3.1 8B868181777279.3%
GPT-4o Mini (temp=1)767272727272.6%
GPT-4o Mini (temp=0)727272727271.6%
GPT-4.1 Nano817667574865.8%
Qwen 3.5 27B1001001000060.0%
Gemma 3 12B685954544956.7%
Mistral NeMO9090810052.4%
LFM2 24B292929292928.6%
Rocinante 12B625200022.9%