Accuracy

Test: Codex Extraction

Avg. Score
81.4%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3 Flash (Preview)90.9%$0.00273.9s85%
2Z.AI GLM 5 Turbo92.2%$0.006816.0s87%
3Xiaomi MIMO v2.590.7%$0.003413.4s84%
4Grok 4 Fast89.1%$0.00128.7s80%
5Gemini 3.5 Flash (Reasoning, Minimal)91.8%$0.0103.1s82%
6Qwen 3.5 Plus (2026-02-15)90.4%$0.003010.6s81%
7Gemini 3 Flash (Preview, Reasoning)93.5%$0.009622.2s86%
8DeepSeek V4 Flash87.1%$0.00037.7s80%
9Mistral Small Creative87.4%$0.00063.9s77%
10Stealth: Healer Alpha89.3%$0.000024.6s83%
11Grok 4.20 (Beta)87.2%$0.00492.0s78%
12Stealth: Hunter Alpha90.8%$0.000036.7s84%
13Mistral Medium 3.187.8%$0.00265.8s76%
14Claude Haiku 4.586.8%$0.00734.5s80%
15Grok 4.1 Fast89.5%$0.001722.1s79%
16Mistral Large 387.8%$0.00278.2s76%
17Grok 4.2086.4%$0.00484.9s77%
18Xiaomi MIMO v2.5 Pro89.4%$0.004818.7s80%
19Claude Sonnet 4.691.4%$0.0227.4s86%
20DeepSeek-V2 Chat86.5%$0.001914.8s76%
21DeepSeek V4 Flash (Reasoning)89.7%$0.000937.0s81%
22Gemini 2.5 Flash82.8%$0.00232.5s75%
23Z.AI GLM 4.586.9%$0.002816.8s77%
24Gemini 2.5 Flash (Reasoning)86.6%$0.008211.8s79%
25Mistral Large 288.0%$0.0118.3s78%
26Claude Opus 4.695.1%$0.0378.1s90%
27DeepSeek V4 Pro84.7%$0.002116.0s76%
28DeepSeek V3 (2024-12-26)84.0%$0.001713.5s76%
29Grok 4.20 (Reasoning)91.5%$0.01237.2s86%
30Claude Opus 4.594.9%$0.0377.7s89%
31Grok 4.20 (Beta, Reasoning)90.1%$0.02012.7s80%
32Gemma 4 31B85.5%$0.000825.9s75%
33Gemini 2.5 Flash Lite (Reasoning)84.3%$0.002115.6s72%
34MiniMax M2.785.8%$0.002221.7s73%
35Gemini 3.1 Flash Lite (Reasoning)79.8%$0.00182.0s71%
36DeepSeek V3.286.4%$0.001136.7s76%
37Writer: Palmyra X580.9%$0.00527.5s74%
38Qwen 3.5 Flash88.5%$0.003149.5s81%
39Z.AI GLM 4.688.3%$0.005738.8s78%
40Mistral Small 3.2 24B78.9%$0.00054.4s70%
41Inception Mercury 280.2%$0.00223.5s70%
42Z.AI GLM 591.1%$0.009151.4s83%
43Ministral 3 14B80.7%$0.00096.0s68%
44Gemini 3.1 Flash Lite (Preview)78.5%$0.00172.0s69%
45GPT-5.4 Mini (Reasoning)88.3%$0.01725.6s80%
46MiniMax M2.584.9%$0.002330.1s73%
47Claude Sonnet 4.587.9%$0.0226.6s75%
48Grok 494.3%$0.03137.2s88%
49Ministral 3 8B79.7%$0.00063.3s65%
50Nemotron 3 Super89.5%$0.00001.0m78%
51Claude 3.7 Sonnet86.3%$0.0217.8s76%
52Aion 2.092.6%$0.00801.2m86%
53GPT-5 Mini86.9%$0.007045.5s80%
54DeepSeek V3 (2025-03-24)84.7%$0.001427.0s70%
55Qwen3 235B A22B Instruct 250780.2%$0.000719.3s71%
56Hermes 3 405B83.4%$0.004017.8s69%
57Grok 4.383.5%$0.00564.3s64%
58Gemini 2.5 Pro92.4%$0.03422.8s85%
59Ministral 8B77.5%$0.00043.8s67%
60Gemini 3.1 Flash Lite78.0%$0.00175.1s68%
61GPT-4o, Aug. 6th (temp=0)77.9%$0.01004.0s74%
62Gemini 2.5 Flash Lite77.6%$0.00051.9s65%
63Claude Sonnet 486.7%$0.0227.9s74%
64Gemini 3.5 Flash (Reasoning)93.4%$0.04016.4s85%
65GPT-5.4 Nano (Reasoning)79.8%$0.002312.2s68%
66Z.AI GLM 5.192.6%$0.0151.1m84%
67Qwen 2.5 72B77.2%$0.000811.0s67%
68GPT-5.4 (Reasoning, Low)83.1%$0.01710.4s73%
69GPT-5.585.8%$0.0276.0s76%
70WizardLM 2 8x22b82.5%$0.002631.3s70%
71GPT-5.4 Mini (Reasoning, Low)76.5%$0.00423.7s67%
72Qwen 3.5 35B87.5%$0.01547.3s80%
73GPT-5.4 Mini75.4%$0.00312.2s65%
74Qwen 3.6 Flash81.8%$0.009630.2s75%
75GPT-5.283.0%$0.01816.8s74%
76GPT-5.479.7%$0.0116.4s67%
77o4 Mini83.8%$0.01421.5s72%
78o4 Mini High89.0%$0.02540.5s81%
79GPT-4.175.6%$0.00734.7s66%
80GPT-5.192.1%$0.03543.6s86%
81Qwen 3.6 35B81.3%$0.007237.6s72%
82Claude Opus 4.6 (Reasoning)94.5%$0.05521.4s89%
83GPT-4o, Aug. 6th (temp=1)74.4%$0.00923.8s66%
84Claude Opus 4.789.5%$0.0476.1s80%
85GPT-4o, May 13th (temp=0)79.8%$0.0254.3s72%
86Claude Opus 4.7 (Reasoning)89.1%$0.0465.9s79%
87GPT-4.1 Mini73.5%$0.00156.6s60%
88Gemma 3 4B70.1%$0.00026.7s63%
89DeepSeek V3.185.4%$0.001226.1s54%
90Z.AI GLM 4.5 Air79.4%$0.002340.1s67%
91Claude 3 Haiku70.8%$0.00175.5s62%
92Hermes 3 70B76.6%$0.001215.8s59%
93GPT-OSS 120B81.1%$0.001157.1s71%
94Grok 4.3 (Reasoning)89.5%$0.0181.3m85%
95Arcee AI: Trinity Mini69.4%$0.00035.8s61%
96GPT-5.5 (Reasoning, Low)86.1%$0.03813.3s76%
97Qwen 3 32B75.9%$0.001037.9s68%
98Mistral Small 469.0%$0.00083.2s60%
99ByteDance Seed 2.0 Lite87.1%$0.00781.4m77%
100Mistral Large83.3%$0.0117.6s52%
101Claude Sonnet 4.6 (Reasoning)91.8%$0.05236.0s86%
102Mistral Small 4 (Reasoning)72.6%$0.001915.6s59%
103Z.AI GLM 4.790.0%$0.00981.7m81%
104Ministral 3B65.4%$0.00021.8s59%
105Qwen 3.5 Plus (2026-04-20)88.0%$0.0151.4m79%
106GPT-4o, May 13th (temp=1)76.2%$0.0254.0s67%
107Inception Mercury66.9%$0.00063.9s57%
108Claude 3.5 Sonnet83.1%$0.04313.9s76%
109Qwen 3.5 397B A17B91.3%$0.0121.8m82%
110Cydonia 24B V4.174.6%$0.001012.2s50%
111Gemma 3 27B67.8%$0.000514.5s58%
112Ministral 3 3B63.6%$0.00041.8s57%
113GPT-4o Mini (temp=0)66.1%$0.00057.9s57%
114GPT-4o Mini (temp=1)66.7%$0.00068.0s56%
115GPT-5.4 Nano64.7%$0.00113.8s56%
116ByteDance Seed 1.6 Flash72.1%$0.001139.3s61%
117Llama 3.1 70B71.9%$0.001616.0s51%
118GPT-5.4 Nano (Reasoning, Low)65.0%$0.00113.7s53%
119ByteDance Seed 1.681.1%$0.00731.3m70%
120Nemotron 3 Nano82.0%$0.00131.6m72%
121GPT-594.0%$0.0491.3m89%
122Qwen 3.6 27B83.5%$0.0181.2m72%
123Gemini 3 Pro (Preview)90.1%$0.05535.6s80%
124Qwen 3.5 9B82.0%$0.00131.5m68%
125Gemma 4 31B (Reasoning)87.6%$0.00162.2m78%
126Qwen3.7 Max89.0%$0.0411.1m79%
127GPT-5.5 (Reasoning)86.8%$0.05625.8s78%
128DeepSeek V4 Pro (Reasoning)90.0%$0.00932.3m82%
129GPT-5.4 (Reasoning)87.0%$0.04453.6s77%
130Llama 3.1 Nemotron 70B69.9%$0.005017.3s49%
131Arcee AI: Trinity Large (Preview)72.5%$0.000020.5s39%
132Gemma 3 12B63.0%$0.000314.1s46%
133MoonshotAI: Kimi K2.588.9%$0.0132.4m81%
134Cohere Command R+ (Aug. 2024)66.1%$0.01417.9s51%
135GPT-5 Nano71.8%$0.00431.3m59%
136Gemma 4 26B68.1%$0.000618.7s33%
137MoonshotAI: Kimi K2.687.9%$0.0262.4m83%
138GPT-4.1 Nano51.8%$0.00032.8s42%
139Z.AI GLM 4.7 Flash69.1%$0.00191.2m52%
140Gemini 3.1 Pro (Preview)88.3%$0.0651.0m78%
141Llama 3.1 8B56.2%$0.000110.0s37%
142Claude Opus 491.4%$0.11013.5s84%
143Qwen3.6 Max Preview90.7%$0.0452.4m84%
144Skyfall 36B V256.4%$0.001812.7s30%
145ByteDance Seed 2.0 Mini84.4%$0.00343.4m77%
146Gemma 4 26B (Reasoning)81.7%$0.00232.6m52%
147Qwen 3.5 27B81.4%$0.0211.6m41%
148Rocinante 12B39.8%$0.001325.6s20%
149Qwen 3.5 122B90.4%$0.0793.6m83%
150Mistral NeMO22.4%$0.00061.4s0%
151LFM2 24B17.5%$0.000212.5s0%
81.37%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3 Pro (Preview)969695959595.2%
Gemini 3.5 Flash (Reasoning)989494949394.7%
Gemini 3 Flash (Preview, Reasoning)979695939294.3%
Aion 2.0989493929093.6%
Claude Opus 4.5959393939393.1%
Claude Opus 4.6939392929292.2%
Claude Opus 4929292929191.7%
Qwen3.6 Max Preview969292908991.7%
Claude Opus 4.6 (Reasoning)969191918991.5%
Nemotron 3 Super969291898991.4%
Gemini 3 Flash (Preview)949391908991.3%
Stealth: Hunter Alpha969590898691.3%
o4 Mini High949490898891.2%
GPT-5939391908891.1%
GPT-5.4 (Reasoning)949290908990.9%
Z.AI GLM 5939391888890.5%
Claude Opus 4.7919190898889.9%
Grok 4929190908689.9%
Z.AI GLM 5.1939290898589.8%
Claude Sonnet 4.6 (Reasoning)929089898889.7%
Qwen3.7 Max959291908089.7%
Claude Opus 4.7 (Reasoning)919090898989.7%
Grok 4.3 (Reasoning)959290858489.5%
Gemini 2.5 Pro929189888689.4%
Xiaomi MIMO v2.5929189888789.3%
Gemini 3.5 Flash (Reasoning, Minimal)929189888589.2%
Gemini 3.1 Pro (Preview)929190898288.9%
Grok 4.20 (Reasoning)948988878588.5%
Stealth: Healer Alpha938888888588.2%
Z.AI GLM 5 Turbo898988878787.9%
Qwen 3.5 27B928988868487.8%
DeepSeek V4 Flash (Reasoning)898988888587.8%
GPT-5.4 Mini (Reasoning)898888888687.6%
Qwen 3.5 Plus (2026-02-15)898888888687.4%
GPT-5.1898887878587.3%
Gemma 4 26B (Reasoning)919188848287.3%
Claude Sonnet 4.6908987858587.2%
GPT-5.5 (Reasoning)898886868686.9%
DeepSeek-V2 Chat919087868186.9%
DeepSeek V4 Pro (Reasoning)928886858286.8%
Mistral Medium 3.1878787868586.7%
Qwen 3.5 122B949187818086.7%
Qwen 3.5 Plus (2026-04-20)928888877986.7%
Qwen 3.5 35B929285838286.7%
Gemini 2.5 Flash (Reasoning)908987858386.6%
GPT-5.5888886868686.4%
Xiaomi MIMO v2.5 Pro928886848286.3%
DeepSeek V4 Flash929191877086.1%
GPT-5.4 (Reasoning, Low)888787878186.1%
Grok 4.20 (Beta, Reasoning)938887847886.0%
MoonshotAI: Kimi K2.5918887867886.0%
Nemotron 3 Nano949187847385.9%
Z.AI GLM 4.7919185838085.8%
Mistral Small Creative898785848385.6%
GPT-5.5 (Reasoning, Low)878685858585.6%
Grok 4.20898685848385.5%
MoonshotAI: Kimi K2.6898784848385.5%
Qwen 3.5 397B A17B898685848385.4%
Z.AI GLM 4.5888685858285.4%
Qwen 3.5 Flash898986828085.3%
Grok 4 Fast888686858185.3%
GPT-5.2888785848185.1%
Gemini 2.5 Flash888885838285.1%
Claude Haiku 4.5888584848384.8%
Claude 3.7 Sonnet898484838384.6%
Gemini 2.5 Flash Lite (Reasoning)928881807984.2%
Mistral Large 2868685847984.1%
o4 Mini938987826884.1%
Z.AI GLM 4.6878786818084.0%
Inception Mercury 2898785817783.9%
Gemma 4 31B868484848383.9%
GPT-5 Mini898383838083.7%
Grok 4.20 (Beta)858484838383.7%
Claude Sonnet 4918885787683.6%
DeepSeek V3.2898483817983.2%
Grok 4.3908682807783.0%
Mistral Large 3878580808082.7%
GPT-5.4888382827982.6%
Hermes 3 405B888583807782.6%
Qwen 3.6 27B898683817382.3%
GPT-OSS 120B958580767582.0%
GPT-5.4 Mini (Reasoning, Low)868382807981.8%
WizardLM 2 8x22b888584807181.6%
Gemma 4 31B (Reasoning)838282827881.5%
Mistral Small 3.2 24B858483817481.3%
Grok 4.1 Fast888880767380.9%
GPT-5.4 Nano (Reasoning)888481767580.8%
Claude 3.5 Sonnet858482777580.7%
Claude Sonnet 4.5828181807880.5%
Qwen 3.6 Flash868280787680.4%
ByteDance Seed 2.0 Lite858483777380.2%
MiniMax M2.7848280787580.0%
DeepSeek V3 (2024-12-26)848278777779.8%
Qwen 3.6 35B848180797579.7%
GPT-4o, Aug. 6th (temp=0)818181817679.6%
Hermes 3 70B858382796879.2%
ByteDance Seed 2.0 Mini838278787579.1%
DeepSeek V4 Pro878377757078.6%
Gemini 3.1 Flash Lite (Reasoning)818080757578.4%
Qwen3 235B A22B Instruct 2507787878777677.4%
Writer: Palmyra X5847776757577.3%
GPT-4o, May 13th (temp=0)807878777377.2%
Qwen 3.5 9B868583805277.2%
MiniMax M2.5848280756577.1%
Ministral 3 14B797777767476.7%
Llama 3.1 Nemotron 70B818078766676.1%
ByteDance Seed 1.6837774747276.0%
Qwen 3 32B828279696976.0%
Gemini 3.1 Flash Lite (Preview)787877747476.0%
DeepSeek V3 (2025-03-24)838374706775.3%
Gemini 3.1 Flash Lite827874727075.2%
Llama 3.1 70B797874737074.6%
Cydonia 24B V4.1858174696574.6%
GPT-5.4 Mini807773717174.3%
GPT-4.1 Mini797777706874.1%
GPT-4.1767474746973.3%
GPT-4o, Aug. 6th (temp=1)777674716973.3%
Z.AI GLM 4.5 Air787876686773.3%
Gemini 2.5 Flash Lite787670696571.7%
DeepSeek V3.193938684071.3%
GPT-4o Mini (temp=1)757474706371.2%
Mistral Small 4 (Reasoning)797870696071.0%
Ministral 8B737272696871.0%
GPT-4o, May 13th (temp=1)767471696570.9%
Ministral 3 8B737371696870.8%
GPT-5 Nano797877724770.3%
Mistral Small 4767271706170.1%
GPT-4o Mini (temp=0)757270696470.1%
Qwen 2.5 72B767272666570.0%
Cohere Command R+ (Aug. 2024)857267646269.9%
LFM2 24B747473735469.9%
Gemma 3 4B737271676569.6%
Arcee AI: Trinity Mini757169676569.3%
Claude 3 Haiku716966656467.0%
ByteDance Seed 1.6 Flash787770634567.0%
Mistral Large87838280066.4%
Skyfall 36B V2928262484666.2%
GPT-5.4 Nano (Reasoning, Low)757064625765.8%
GPT-5.4 Nano737170694465.4%
Inception Mercury716564616064.4%
Ministral 3B696563625963.7%
Gemma 3 12B686767645363.7%
Ministral 3 3B666361616162.2%
Z.AI GLM 4.7 Flash806160594260.5%
Gemma 3 27B686758584759.4%
Arcee AI: Trinity Large (Preview)74726862055.4%
Llama 3.1 8B76726656054.1%
Gemma 4 26B8484750048.8%
GPT-4.1 Nano584846393244.6%
Rocinante 12B7771680043.1%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6999999999998.6%
Claude Opus 4.6 (Reasoning)999999979397.3%
Claude Opus 4.5999996969697.1%
Claude Opus 4.71009695929295.1%
GPT-5969595959594.9%
Claude Opus 4.7 (Reasoning)969595959294.5%
Grok 4979494949194.2%
GPT-5.1969595919193.5%
Gemini 3.1 Pro (Preview)969695918993.3%
Gemini 3 Pro (Preview)959493918892.3%
Z.AI GLM 5 Turbo969292919092.1%
Claude Sonnet 4.6929292929192.1%
Grok 4.20 (Reasoning)949392919092.0%
Gemini 3 Flash (Preview, Reasoning)969392918691.7%
Aion 2.0959292908991.5%
Z.AI GLM 4.7989490888691.3%
Qwen 3.5 Flash969392898691.3%
Claude Sonnet 4.6 (Reasoning)929291919091.2%
Mistral Large 2949191908890.9%
Gemini 3.5 Flash (Reasoning)979391878590.8%
Mistral Large 3929292928690.8%
Qwen 3.5 27B959490898590.6%
GPT-5.4 (Reasoning)929190908990.4%
Qwen 3.5 122B939191908790.3%
Xiaomi MIMO v2.5939291918590.1%
Gemini 2.5 Pro939290908590.1%
Z.AI GLM 4.6969492848389.8%
Qwen 3.5 397B A17B969291878289.5%
Grok 4.20 (Beta, Reasoning)968988888689.4%
DeepSeek V3.1909089898989.4%
DeepSeek V4 Flash (Reasoning)939190878789.3%
Gemma 4 31B969387868589.2%
GPT-5.5 (Reasoning, Low)929188888789.2%
Mistral Large929090888689.1%
DeepSeek V4 Pro (Reasoning)928989888789.1%
Claude Opus 4959190868489.0%
Stealth: Healer Alpha919190878588.8%
GPT-5.5 (Reasoning)929291858388.7%
Stealth: Hunter Alpha939089878388.6%
Qwen3.7 Max898989888688.5%
Grok 4.3 (Reasoning)919090898288.5%
DeepSeek V4 Flash949289858288.4%
DeepSeek-V2 Chat919191888188.4%
DeepSeek V3.2919189878488.3%
Qwen3.6 Max Preview939289867988.1%
Gemini 3.1 Flash Lite (Reasoning)918989888388.0%
GPT-5 Mini929188858487.9%
Xiaomi MIMO v2.5 Pro939385858487.9%
MoonshotAI: Kimi K2.6918989888087.7%
Gemma 4 31B (Reasoning)958985858487.7%
ByteDance Seed 2.0 Mini898988888387.5%
Qwen 3.5 Plus (2026-02-15)919088858487.5%
Grok 4.1 Fast909088868387.4%
Grok 4 Fast939189858087.4%
Gemini 3 Flash (Preview)918986868587.3%
Claude 3.5 Sonnet908988868387.2%
Gemini 3.5 Flash (Reasoning, Minimal)909089887987.2%
Qwen 3.5 35B918887868587.2%
Z.AI GLM 5.1918988848387.0%
ByteDance Seed 2.0 Lite918988838386.8%
DeepSeek V3 (2025-03-24)888787868586.6%
GPT-5.5918787868286.5%
Qwen 3.5 Plus (2026-04-20)888786868486.2%
MiniMax M2.7898787848385.9%
MoonshotAI: Kimi K2.5898989837985.8%
Gemini 3.1 Flash Lite898886848185.7%
Gemini 2.5 Flash (Reasoning)898785858285.6%
DeepSeek V3 (2024-12-26)918988827885.6%
Grok 4.20878685858585.5%
MiniMax M2.5898786848085.3%
GPT-5.4 Mini (Reasoning)908888818085.3%
o4 Mini High908887827985.3%
Claude Haiku 4.5868684848485.2%
Claude Sonnet 4.5868685858385.1%
Qwen 3.5 9B938888867085.1%
GPT-OSS 120B878785848284.9%
Gemini 3.1 Flash Lite (Preview)888786848084.8%
Grok 4.20 (Beta)878685848284.8%
Z.AI GLM 4.5938583827984.5%
Gemini 2.5 Flash Lite (Reasoning)908884827684.0%
Mistral Small Creative858584838384.0%
Z.AI GLM 5878584828183.7%
GPT-5.2898582818083.6%
WizardLM 2 8x22b888883817683.2%
GPT-5.4858484818183.1%
Nemotron 3 Super868682818083.0%
GPT-5.4 (Reasoning, Low)858584837882.9%
GPT-5 Nano888682817882.9%
Nemotron 3 Nano908581817782.7%
Z.AI GLM 4.5 Air928680797582.5%
DeepSeek V4 Pro868583797982.2%
Qwen 3.6 35B868483827682.2%
Mistral Small 3.2 24B848484817782.2%
Claude 3.7 Sonnet858282818182.1%
ByteDance Seed 1.6868381807982.0%
Mistral Medium 3.1848482808081.9%
Writer: Palmyra X5858482797881.7%
Ministral 3 8B828282827981.6%
Cydonia 24B V4.1898583816981.3%
Claude Sonnet 4868181807881.2%
GPT-4o, May 13th (temp=0)838382807881.1%
Ministral 3 14B848282817781.1%
Qwen 3.6 Flash848481797881.1%
Gemini 2.5 Flash848482797680.8%
Qwen 2.5 72B828281807680.3%
Qwen3 235B A22B Instruct 2507908482806480.2%
Ministral 8B828181797780.1%
GPT-4o, Aug. 6th (temp=0)838080807780.0%
Qwen 3.6 27B888583786680.0%
o4 Mini878583757079.9%
Inception Mercury 2917977767579.6%
Gemini 2.5 Flash Lite878281806779.3%
GPT-4.1 Mini867979777679.2%
GPT-4o, May 13th (temp=1)848281806878.7%
Gemma 4 26B (Reasoning)898986784978.1%
GPT-4o, Aug. 6th (temp=1)848280796578.0%
Hermes 3 405B827977777478.0%
GPT-4.1818080767277.9%
Gemma 4 26B787877777777.7%
GPT-5.4 Mini (Reasoning, Low)828076767377.7%
Llama 3.1 70B808080796677.0%
GPT-5.4 Mini797979757076.6%
Qwen 3 32B857976756876.6%
Hermes 3 70B817774747275.4%
Claude 3 Haiku817974706874.5%
Z.AI GLM 4.7 Flash828170686673.4%
Mistral Small 4 (Reasoning)777673716873.1%
GPT-5.4 Nano (Reasoning)838272636372.5%
GPT-4o Mini (temp=1)837669676571.8%
Grok 4.3868379733571.2%
ByteDance Seed 1.6 Flash827867676271.1%
Gemma 3 4B737272706871.1%
Arcee AI: Trinity Mini787469686169.8%
Mistral Small 4817068685768.8%
Inception Mercury757465635867.0%
Skyfall 36B V2747169645666.9%
Arcee AI: Trinity Large (Preview)86868278066.5%
GPT-4o Mini (temp=0)737067626066.3%
Cohere Command R+ (Aug. 2024)726665635864.8%
GPT-5.4 Nano726461615863.1%
Llama 3.1 8B706861605662.8%
Ministral 3B666363626062.8%
Gemma 3 27B727062575362.7%
Ministral 3 3B636362616161.6%
GPT-5.4 Nano (Reasoning, Low)716664554860.9%
GPT-4.1 Nano665756545357.3%
Llama 3.1 Nemotron 70B73716966055.9%
Rocinante 12B76717060055.4%
Gemma 3 12B71716765054.9%
Mistral NeMO000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.5 Flash (Reasoning, Minimal)999998969597.7%
Z.AI GLM 5.1989797949395.7%
Grok 4999695939295.2%
Gemini 3.5 Flash (Reasoning)989796939295.0%
Claude Opus 4.6 (Reasoning)969595959594.9%
Z.AI GLM 5 Turbo969594949394.5%
Claude Opus 4.6959595959394.3%
Gemini 3.1 Pro (Preview)969594949294.3%
Gemini 3 Flash (Preview, Reasoning)989594939294.3%
Claude Opus 4.7 (Reasoning)959595939294.1%
Z.AI GLM 5969595929193.9%
Grok 4.1 Fast969494939293.7%
Gemini 3 Flash (Preview)959593939193.6%
Claude Opus 4.7959493939293.4%
GPT-5.5959593939093.2%
Qwen 3.5 397B A17B959593928992.9%
Claude Opus 4.5949494948892.9%
GPT-5949393929292.7%
GPT-5.5 (Reasoning, Low)969392919192.7%
Qwen3.7 Max979494918792.6%
DeepSeek V4 Pro (Reasoning)979393918892.4%
Gemini 2.5 Pro949393919092.2%
GPT-5.1959292928892.0%
GPT-5.4 (Reasoning)969292908991.8%
Gemma 4 31B949292919091.7%
Gemini 3 Pro (Preview)949492918891.6%
Xiaomi MIMO v2.5 Pro979392898791.4%
Claude Sonnet 4.6939191919191.4%
GPT-5.5 (Reasoning)939392908991.4%
MoonshotAI: Kimi K2.5959390898991.2%
GPT-5.4 (Reasoning, Low)959190908990.9%
Grok 4.20 (Reasoning)939292918690.8%
Claude Sonnet 4.6 (Reasoning)929191909090.7%
GPT-5 Mini939090908990.6%
Stealth: Hunter Alpha949190898890.3%
Aion 2.0929191918690.3%
Grok 4 Fast949290888790.1%
Qwen 3.5 122B939290908689.9%
Claude Haiku 4.5929090908889.7%
Gemma 4 26B (Reasoning)939191898489.5%
Grok 4.20 (Beta, Reasoning)949288888589.5%
Qwen 3.5 27B929190898689.5%
DeepSeek V4 Pro939089888789.4%
Grok 4.3 (Reasoning)919190888689.4%
Gemma 4 31B (Reasoning)919089898789.3%
Z.AI GLM 4.6949290878389.3%
ByteDance Seed 2.0 Lite939290868589.2%
o4 Mini High929088888789.0%
Qwen 3.5 Plus (2026-02-15)918989888788.9%
Z.AI GLM 4.7959189858488.8%
Qwen3.6 Max Preview949089878488.8%
GPT-5.4928888888688.5%
Stealth: Healer Alpha939189858488.5%
Claude Sonnet 4.5939087878588.4%
Claude Opus 4929088868688.4%
Xiaomi MIMO v2.5939189858187.9%
GPT-5.2929288878087.8%
Gemini 2.5 Flash908989888387.7%
Grok 4.3908987868587.6%
MoonshotAI: Kimi K2.6908887858587.1%
DeepSeek V3.1898888858487.0%
GPT-5.4 Mini (Reasoning)908787878487.0%
MiniMax M2.5928786868587.0%
Claude 3.5 Sonnet888887868587.0%
Grok 4.20 (Beta)898887858586.9%
Grok 4.20928985838386.5%
o4 Mini908888838286.2%
Claude Sonnet 4918686858386.2%
Nemotron 3 Super918984848386.1%
Qwen 3.5 Flash938787828185.8%
Gemini 2.5 Flash (Reasoning)898884848385.7%
DeepSeek V4 Flash (Reasoning)888885848385.7%
Writer: Palmyra X5898987818085.4%
Qwen 3.5 Plus (2026-04-20)938685847885.3%
Claude 3.7 Sonnet908885847985.2%
Qwen 3.5 35B878784848385.1%
DeepSeek V3.2878785848185.1%
Mistral Medium 3.1878684848485.0%
Mistral Small Creative888884838184.7%
Qwen 3.6 27B898585828084.2%
DeepSeek V4 Flash868584838183.8%
Z.AI GLM 4.5878584837983.4%
Qwen3 235B A22B Instruct 2507888787807383.0%
Qwen 3.5 9B888785787582.7%
MiniMax M2.7928381807782.6%
ByteDance Seed 2.0 Mini858583817982.5%
Mistral Large 2878281818182.4%
Qwen 3.6 35B908380797882.1%
Mistral Large838382818182.0%
Qwen 3.6 Flash878583807581.7%
DeepSeek V3 (2024-12-26)898878777681.7%
Mistral Large 3838181818181.6%
GPT-4.1868483817181.1%
Arcee AI: Trinity Large (Preview)888379777680.6%
DeepSeek V3 (2025-03-24)848279797880.5%
DeepSeek-V2 Chat898883716879.9%
GPT-4o, May 13th (temp=0)818079797979.9%
Gemini 3.1 Flash Lite (Preview)858178777679.5%
Gemini 3.1 Flash Lite (Reasoning)838080787679.3%
ByteDance Seed 1.6 Flash838381767479.3%
GPT-4o, May 13th (temp=1)838280767479.1%
ByteDance Seed 1.6888077767278.6%
Hermes 3 405B827979777778.5%
Z.AI GLM 4.5 Air858280767078.4%
Nemotron 3 Nano808079777578.2%
GPT-4o, Aug. 6th (temp=0)787878787878.1%
GPT-4.1 Mini868178727177.5%
GPT-5.4 Nano (Reasoning)868178756777.3%
GPT-4o, Aug. 6th (temp=1)817877767577.3%
Gemini 3.1 Flash Lite857676757477.0%
Gemini 2.5 Flash Lite (Reasoning)807979766976.9%
WizardLM 2 8x22b867975736675.8%
Mistral Small 3.2 24B777777757075.3%
Inception Mercury 2828077746375.3%
GPT-5.4 Mini (Reasoning, Low)787676757175.2%
GPT-4o Mini (temp=0)767676767175.0%
GPT-OSS 120B797575737174.7%
GPT-5.4 Mini867474706874.3%
Ministral 3 8B797675727074.3%
Qwen 2.5 72B777574747174.2%
Gemma 3 27B797971717074.1%
Ministral 3 14B777575727274.1%
Ministral 8B777372717072.8%
Cydonia 24B V4.1827774676472.8%
Claude 3 Haiku827575666672.6%
GPT-5 Nano817973676372.4%
Qwen 3 32B767372716771.9%
Gemini 2.5 Flash Lite757471706971.8%
Mistral Small 4 (Reasoning)807872666371.7%
Hermes 3 70B817373686271.6%
Mistral Small 4737271676369.4%
GPT-4o Mini (temp=1)767169676469.4%
Z.AI GLM 4.7 Flash847573694268.5%
Llama 3.1 Nemotron 70B766867656568.3%
Ministral 3 3B747371645968.1%
Ministral 3B747070655667.3%
Arcee AI: Trinity Mini777169615867.0%
Gemma 3 12B676762616063.6%
Gemma 3 4B756361595562.4%
Inception Mercury666362616062.4%
GPT-5.4 Nano (Reasoning, Low)696964585262.3%
Gemma 4 26B81797575062.0%
GPT-5.4 Nano696865594861.8%
Cohere Command R+ (Aug. 2024)696560585361.0%
Llama 3.1 70B80796963058.1%
GPT-4.1 Nano535351484550.1%
Llama 3.1 8B66665655048.3%
Mistral NeMO8272720045.3%
Rocinante 12B72605616040.9%
Skyfall 36B V27871540040.6%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5.11009898979798.0%
Gemini 2.5 Pro989898989697.9%
Qwen 3.5 Plus (2026-02-15)989898989697.9%
Grok 4989898979697.7%
Claude Sonnet 4.5989898969697.5%
Mistral Medium 3.1989897979697.5%
GPT-5989797979797.3%
Qwen 3.5 397B A17B989898979497.3%
Nemotron 3 Super989897979697.2%
Claude Opus 4.5989898949496.7%
DeepSeek V3 (2025-03-24)999797959496.5%
Claude Opus 4989898949396.5%
Mistral Large 3989895959596.3%
Z.AI GLM 5979797959496.2%
Grok 4.1 Fast989696969496.0%
Claude Sonnet 4969696969696.0%
DeepSeek V4 Flash (Reasoning)989696959495.9%
Claude Sonnet 4.6 (Reasoning)989696949495.7%
GPT-5.1989696959295.6%
Mistral Large989895959195.6%
Claude Opus 4.6959595959595.3%
Xiaomi MIMO v2.5989696949295.3%
Grok 4.20 (Beta, Reasoning)989796949095.2%
Mistral Small Creative959595959595.2%
MiniMax M2.7989896948894.9%
Aion 2.0989897919094.9%
Claude Sonnet 4.6969696939394.8%
Qwen 3.5 122B989694929294.7%
Grok 4.20 (Reasoning)989892929294.6%
Hermes 3 405B959595949394.6%
Mistral Large 2989595929294.6%
Qwen3.6 Max Preview949494949494.4%
Z.AI GLM 4.5959595959294.3%
Z.AI GLM 5 Turbo969695939294.2%
Claude Opus 4.6 (Reasoning)949494949494.1%
DeepSeek V3.1979796918993.9%
Gemini 3 Flash (Preview, Reasoning)989797977993.9%
Z.AI GLM 4.7989494939093.9%
Qwen 3.5 Plus (2026-04-20)979795919093.9%
Grok 4 Fast979594948893.7%
Grok 4.20 (Beta)969594929093.7%
GPT-5.4 Mini (Reasoning)989594938793.5%
Gemini 3.5 Flash (Reasoning, Minimal)979696928593.2%
Gemini 3.5 Flash (Reasoning)989797957893.2%
Claude 3.7 Sonnet949494929293.2%
Stealth: Hunter Alpha979594928793.1%
MoonshotAI: Kimi K2.5969492909092.4%
Ministral 3 8B949494919092.3%
Grok 4.3979493898792.3%
Gemini 2.5 Flash Lite (Reasoning)969493928792.3%
ByteDance Seed 2.0 Lite969494908692.0%
Gemma 4 31B (Reasoning)989894927791.9%
Stealth: Healer Alpha969592888791.9%
Xiaomi MIMO v2.5 Pro1009696927591.9%
DeepSeek V4 Pro (Reasoning)969595947991.8%
Gemini 3 Flash (Preview)949492928591.5%
Qwen 3.5 Flash979290908891.5%
MoonshotAI: Kimi K2.6939291919091.3%
Qwen 3.5 35B959292928691.2%
Ministral 3 14B959292888891.0%
DeepSeek-V2 Chat969292908490.9%
Grok 4.3 (Reasoning)939191918890.7%
o4 Mini High979594907890.7%
DeepSeek V4 Flash929190898990.2%
MiniMax M2.5969695897590.2%
Z.AI GLM 4.6969689858490.1%
WizardLM 2 8x22b979593877589.4%
DeepSeek V3.2989592897088.9%
DeepSeek V3 (2024-12-26)929090878688.9%
DeepSeek V4 Pro929089898488.8%
GPT-5.4 Nano (Reasoning)919188888688.7%
Gemini 2.5 Flash (Reasoning)989190887588.4%
ByteDance Seed 2.0 Mini948988878588.4%
Grok 4.20969688877388.2%
ByteDance Seed 1.6989286818187.7%
Claude Haiku 4.5949090907487.7%
Gemini 2.5 Flash Lite928887868587.5%
Arcee AI: Trinity Large (Preview)959287857887.5%
Qwen 3.6 27B949390907087.5%
Ministral 8B958783838286.1%
GPT-5 Mini939292797385.6%
Qwen3.7 Max979678787785.3%
o4 Mini969493726984.8%
Qwen 2.5 72B908783818184.4%
Qwen 3.6 Flash928783807883.9%
Gemma 4 26B868684828283.8%
Z.AI GLM 4.5 Air939083787383.5%
Qwen 3.5 9B939385806482.8%
GPT-OSS 120B888781807982.8%
Inception Mercury 2888681787682.0%
Gemini 3 Pro (Preview)997878777581.4%
Qwen 3.6 35B908887776481.4%
Nemotron 3 Nano898280797781.3%
GPT-4o, May 13th (temp=0)967977777681.1%
GPT-5.5 (Reasoning)957877767580.1%
Hermes 3 70B1009188734880.0%
Qwen3 235B A22B Instruct 2507818181797779.9%
Claude Opus 4.7797979797979.4%
Llama 3.1 Nemotron 70B828279797579.3%
Writer: Palmyra X5818179797579.3%
Qwen 3 32B848180767478.9%
Claude Opus 4.7 (Reasoning)797977777777.9%
Llama 3.1 70B828178777077.9%
Gemini 2.5 Flash787878777777.7%
Claude 3.5 Sonnet797777777777.5%
Gemma 3 4B817876767577.3%
GPT-5.5787878767677.2%
Gemma 4 31B797976767577.1%
GPT-5.5 (Reasoning, Low)797877767577.1%
Gemini 3.1 Pro (Preview)787878757576.9%
Mistral Small 3.2 24B898174726876.8%
GPT-5.4 Mini928171706876.6%
GPT-4o, May 13th (temp=1)897877756276.3%
GPT-5.2777675757375.3%
Gemma 3 27B797776727275.0%
GPT-5.4 (Reasoning)767675747474.9%
Mistral Small 4 (Reasoning)908881664774.3%
Gemini 3.1 Flash Lite787773737074.2%
Inception Mercury847773696773.8%
GPT-4o, Aug. 6th (temp=0)747474747273.8%
Z.AI GLM 4.7 Flash927870686173.7%
Gemini 3.1 Flash Lite (Preview)777776716773.5%
Gemini 3.1 Flash Lite (Reasoning)777773707073.5%
GPT-5.4 (Reasoning, Low)757473727072.6%
Gemma 4 26B (Reasoning)92929085071.8%
Arcee AI: Trinity Mini797471706371.5%
GPT-5.4 Mini (Reasoning, Low)857373685771.4%
ByteDance Seed 1.6 Flash817670676271.2%
GPT-5.4 Nano (Reasoning, Low)818073635971.2%
GPT-4.1777472656270.0%
Gemma 3 12B797571715469.9%
Cydonia 24B V4.191908781069.8%
GPT-4o, Aug. 6th (temp=1)817065656468.9%
Claude 3 Haiku757169686268.9%
Cohere Command R+ (Aug. 2024)888171673668.6%
GPT-5.4 Nano777370636068.5%
Ministral 3B727068676368.0%
Mistral Small 4847065635567.6%
GPT-5.4676564646264.5%
GPT-4.1 Mini777370633463.1%
Ministral 3 3B656463625862.6%
GPT-5 Nano787058564761.7%
Llama 3.1 8B706363604259.6%
Qwen 3.5 27B9896940057.7%
GPT-4.1 Nano796653413755.1%
GPT-4o Mini (temp=1)645454505054.5%
GPT-4o Mini (temp=0)545454545052.9%
Skyfall 36B V274636358051.8%
Mistral NeMO8180610044.3%
Rocinante 12B692900019.7%
LFM2 24B000000.0%