Accuracy

Test: Codex Extraction

Avg. Score
80.3%
Scenarios
4

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3 Flash (Preview)90.9%$0.00273.9s85%
2Z.AI GLM 5 Turbo92.2%$0.006816.0s87%
3Grok 4 Fast89.1%$0.00128.7s80%
4Qwen 3.5 Plus (2026-02-15)90.4%$0.003010.6s81%
5Gemini 3 Flash (Preview, Reasoning)93.5%$0.009622.2s86%
6Mistral Small Creative87.4%$0.00063.9s77%
7Stealth: Healer Alpha89.3%$0.000024.6s83%
8Grok 4.20 (Beta)87.2%$0.00492.0s78%
9Stealth: Hunter Alpha90.8%$0.000036.7s84%
10Mistral Medium 3.187.8%$0.00265.8s76%
11Claude Haiku 4.586.8%$0.00734.5s80%
12Grok 4.1 Fast89.5%$0.001722.1s79%
13Mistral Large 387.8%$0.00278.2s76%
14Claude Sonnet 4.691.4%$0.0227.4s86%
15DeepSeek-V2 Chat86.5%$0.001914.8s76%
16Gemini 2.5 Flash82.8%$0.00232.5s75%
17Z.AI GLM 4.586.9%$0.002816.8s77%
18Gemini 2.5 Flash (Reasoning)86.6%$0.008211.8s79%
19Mistral Large 288.0%$0.0118.3s78%
20Claude Opus 4.695.1%$0.0378.1s90%
21DeepSeek V3 (2024-12-26)84.0%$0.001713.5s76%
22Claude Opus 4.594.9%$0.0377.7s89%
23Grok 4.20 (Beta, Reasoning)90.1%$0.02012.7s80%
24Gemini 2.5 Flash Lite (Reasoning)84.3%$0.002115.6s72%
25MiniMax M2.785.8%$0.002221.7s73%
26DeepSeek V3.286.4%$0.001136.7s76%
27Writer: Palmyra X580.9%$0.00527.5s74%
28Qwen 3.5 Flash88.5%$0.003149.5s81%
29Z.AI GLM 4.688.3%$0.005738.8s78%
30Mistral Small 3.2 24B78.9%$0.00054.4s70%
31Inception Mercury 280.2%$0.00223.5s70%
32Z.AI GLM 591.1%$0.009151.4s83%
33Ministral 3 14B80.7%$0.00096.0s68%
34Gemini 3.1 Flash Lite (Preview)78.5%$0.00172.0s69%
35GPT-5.4 Mini (Reasoning)88.3%$0.01725.6s80%
36MiniMax M2.584.9%$0.002330.1s73%
37Claude Sonnet 4.587.9%$0.0226.6s75%
38Grok 494.3%$0.03137.2s88%
39Ministral 3 8B79.7%$0.00063.3s65%
40Nemotron 3 Super89.5%$0.00001.0m78%
41Claude 3.7 Sonnet86.3%$0.0217.8s76%
42Aion 2.092.6%$0.00801.2m86%
43GPT-5 Mini86.9%$0.007045.5s80%
44DeepSeek V3 (2025-03-24)84.7%$0.001427.0s70%
45Qwen3 235B A22B Instruct 250780.2%$0.000719.3s71%
46Hermes 3 405B83.4%$0.004017.8s69%
47Gemini 2.5 Pro92.4%$0.03422.8s85%
48Ministral 8B77.5%$0.00043.8s67%
49GPT-4o, Aug. 6th (temp=0)77.9%$0.01004.0s74%
50Gemini 2.5 Flash Lite77.6%$0.00051.9s65%
51Claude Sonnet 486.7%$0.0227.9s74%
52GPT-5.4 Nano (Reasoning)79.8%$0.002312.2s68%
53Qwen 2.5 72B77.2%$0.000811.0s67%
54GPT-5.4 (Reasoning, Low)83.1%$0.01710.4s73%
55WizardLM 2 8x22b82.5%$0.002631.3s70%
56GPT-5.4 Mini (Reasoning, Low)76.5%$0.00423.7s67%
57Qwen 3.5 35B87.5%$0.01547.3s80%
58GPT-5.4 Mini75.4%$0.00312.2s65%
59GPT-5.283.0%$0.01816.8s74%
60GPT-5.479.7%$0.0116.4s67%
61o4 Mini83.8%$0.01421.5s72%
62o4 Mini High89.0%$0.02540.5s81%
63GPT-4.175.6%$0.00734.7s66%
64GPT-5.192.1%$0.03543.6s86%
65Claude Opus 4.6 (Reasoning)94.5%$0.05521.4s89%
66GPT-4o, Aug. 6th (temp=1)74.4%$0.00923.8s66%
67GPT-4o, May 13th (temp=0)79.8%$0.0254.3s72%
68GPT-4.1 Mini73.5%$0.00156.6s60%
69Gemma 3 4B70.1%$0.00026.7s63%
70DeepSeek V3.185.4%$0.001226.1s54%
71Claude 3 Haiku70.8%$0.00175.5s62%
72Hermes 3 70B76.6%$0.001215.8s59%
73Arcee AI: Trinity Mini69.4%$0.00035.8s61%
74Qwen 3 32B75.9%$0.001037.9s68%
75Mistral Small 469.0%$0.00083.2s60%
76ByteDance Seed 2.0 Lite87.1%$0.00781.4m77%
77Mistral Large83.3%$0.0117.6s52%
78Claude Sonnet 4.6 (Reasoning)91.8%$0.05236.0s86%
79Mistral Small 4 (Reasoning)72.6%$0.001915.6s59%
80Z.AI GLM 4.790.0%$0.00981.7m81%
81Ministral 3B65.4%$0.00021.8s59%
82GPT-4o, May 13th (temp=1)76.2%$0.0254.0s67%
83Inception Mercury66.9%$0.00063.9s57%
84Claude 3.5 Sonnet83.1%$0.04313.9s76%
85Qwen 3.5 397B A17B91.3%$0.0121.8m82%
86Gemma 3 27B67.8%$0.000514.5s58%
87Ministral 3 3B63.6%$0.00041.8s57%
88GPT-4o Mini (temp=0)66.1%$0.00057.9s57%
89GPT-4o Mini (temp=1)66.7%$0.00068.0s56%
90GPT-5.4 Nano64.7%$0.00113.8s56%
91ByteDance Seed 1.6 Flash72.1%$0.001139.3s61%
92Llama 3.1 70B71.9%$0.001616.0s51%
93GPT-5.4 Nano (Reasoning, Low)65.0%$0.00113.7s53%
94ByteDance Seed 1.681.1%$0.00731.3m70%
95Nemotron 3 Nano82.0%$0.00131.6m72%
96GPT-594.0%$0.0491.3m89%
97Gemini 3 Pro (Preview)90.1%$0.05535.6s80%
98Qwen 3.5 9B82.0%$0.00131.5m68%
99GPT-5.4 (Reasoning)87.0%$0.04453.6s77%
100Llama 3.1 Nemotron 70B69.9%$0.005017.3s49%
101Arcee AI: Trinity Large (Preview)72.5%$0.000020.5s39%
102Gemma 3 12B63.0%$0.000314.1s46%
103MoonshotAI: Kimi K2.588.9%$0.0132.4m81%
104Cohere Command R+ (Aug. 2024)66.1%$0.01417.9s51%
105GPT-5 Nano71.8%$0.00431.3m59%
106GPT-4.1 Nano51.8%$0.00032.8s42%
107Z.AI GLM 4.7 Flash69.1%$0.00191.2m52%
108Gemini 3.1 Pro (Preview)88.3%$0.0651.0m78%
109Llama 3.1 8B56.2%$0.000110.0s37%
110Claude Opus 491.4%$0.11013.5s84%
111ByteDance Seed 2.0 Mini84.4%$0.00343.4m77%
112Qwen 3.5 27B81.4%$0.0211.6m41%
113Rocinante 12B39.8%$0.001325.6s20%
114Qwen 3.5 122B90.4%$0.0793.6m83%
115Mistral NeMO22.4%$0.00061.4s0%
116LFM2 24B17.5%$0.000212.5s0%
80.30%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3 Pro (Preview)969695959595.2%
Gemini 3 Flash (Preview, Reasoning)979695939294.3%
Aion 2.0989493929093.6%
Claude Opus 4.5959393939393.1%
Claude Opus 4.6939392929292.2%
Claude Opus 4929292929191.7%
Claude Opus 4.6 (Reasoning)969191918991.5%
Nemotron 3 Super969291898991.4%
Gemini 3 Flash (Preview)949391908991.3%
Stealth: Hunter Alpha969590898691.3%
o4 Mini High949490898891.2%
GPT-5939391908891.1%
GPT-5.4 (Reasoning)949290908990.9%
Z.AI GLM 5939391888890.5%
Grok 4929190908689.9%
Claude Sonnet 4.6 (Reasoning)929089898889.7%
Gemini 2.5 Pro929189888689.4%
Gemini 3.1 Pro (Preview)929190898288.9%
Stealth: Healer Alpha938888888588.2%
Z.AI GLM 5 Turbo898988878787.9%
Qwen 3.5 27B928988868487.8%
GPT-5.4 Mini (Reasoning)898888888687.6%
Qwen 3.5 Plus (2026-02-15)898888888687.4%
GPT-5.1898887878587.3%
Claude Sonnet 4.6908987858587.2%
DeepSeek-V2 Chat919087868186.9%
Mistral Medium 3.1878787868586.7%
Qwen 3.5 122B949187818086.7%
Qwen 3.5 35B929285838286.7%
Gemini 2.5 Flash (Reasoning)908987858386.6%
GPT-5.4 (Reasoning, Low)888787878186.1%
Grok 4.20 (Beta, Reasoning)938887847886.0%
MoonshotAI: Kimi K2.5918887867886.0%
Nemotron 3 Nano949187847385.9%
Z.AI GLM 4.7919185838085.8%
Mistral Small Creative898785848385.6%
Qwen 3.5 397B A17B898685848385.4%
Z.AI GLM 4.5888685858285.4%
Qwen 3.5 Flash898986828085.3%
Grok 4 Fast888686858185.3%
GPT-5.2888785848185.1%
Gemini 2.5 Flash888885838285.1%
Claude Haiku 4.5888584848384.8%
Claude 3.7 Sonnet898484838384.6%
Gemini 2.5 Flash Lite (Reasoning)928881807984.2%
Mistral Large 2868685847984.1%
o4 Mini938987826884.1%
Z.AI GLM 4.6878786818084.0%
Inception Mercury 2898785817783.9%
GPT-5 Mini898383838083.7%
Grok 4.20 (Beta)858484838383.7%
Claude Sonnet 4918885787683.6%
DeepSeek V3.2898483817983.2%
Mistral Large 3878580808082.7%
GPT-5.4888382827982.6%
Hermes 3 405B888583807782.6%
GPT-5.4 Mini (Reasoning, Low)868382807981.8%
WizardLM 2 8x22b888584807181.6%
Mistral Small 3.2 24B858483817481.3%
Grok 4.1 Fast888880767380.9%
GPT-5.4 Nano (Reasoning)888481767580.8%
Claude 3.5 Sonnet858482777580.7%
Claude Sonnet 4.5828181807880.5%
ByteDance Seed 2.0 Lite858483777380.2%
MiniMax M2.7848280787580.0%
DeepSeek V3 (2024-12-26)848278777779.8%
GPT-4o, Aug. 6th (temp=0)818181817679.6%
Hermes 3 70B858382796879.2%
ByteDance Seed 2.0 Mini838278787579.1%
Qwen3 235B A22B Instruct 2507787878777677.4%
Writer: Palmyra X5847776757577.3%
GPT-4o, May 13th (temp=0)807878777377.2%
Qwen 3.5 9B868583805277.2%
MiniMax M2.5848280756577.1%
Ministral 3 14B797777767476.7%
Llama 3.1 Nemotron 70B818078766676.1%
ByteDance Seed 1.6837774747276.0%
Qwen 3 32B828279696976.0%
Gemini 3.1 Flash Lite (Preview)787877747476.0%
DeepSeek V3 (2025-03-24)838374706775.3%
Llama 3.1 70B797874737074.6%
GPT-5.4 Mini807773717174.3%
GPT-4.1 Mini797777706874.1%
GPT-4.1767474746973.3%
GPT-4o, Aug. 6th (temp=1)777674716973.3%
Gemini 2.5 Flash Lite787670696571.7%
DeepSeek V3.193938684071.3%
GPT-4o Mini (temp=1)757474706371.2%
Mistral Small 4 (Reasoning)797870696071.0%
Ministral 8B737272696871.0%
GPT-4o, May 13th (temp=1)767471696570.9%
Ministral 3 8B737371696870.8%
GPT-5 Nano797877724770.3%
Mistral Small 4767271706170.1%
GPT-4o Mini (temp=0)757270696470.1%
Qwen 2.5 72B767272666570.0%
Cohere Command R+ (Aug. 2024)857267646269.9%
LFM2 24B747473735469.9%
Gemma 3 4B737271676569.6%
Arcee AI: Trinity Mini757169676569.3%
Claude 3 Haiku716966656467.0%
ByteDance Seed 1.6 Flash787770634567.0%
Mistral Large87838280066.4%
GPT-5.4 Nano (Reasoning, Low)757064625765.8%
GPT-5.4 Nano737170694465.4%
Inception Mercury716564616064.4%
Ministral 3B696563625963.7%
Gemma 3 12B686767645363.7%
Ministral 3 3B666361616162.2%
Z.AI GLM 4.7 Flash806160594260.5%
Gemma 3 27B686758584759.4%
Arcee AI: Trinity Large (Preview)74726862055.4%
Llama 3.1 8B76726656054.1%
GPT-4.1 Nano584846393244.6%
Rocinante 12B7771680043.1%
Mistral NeMO000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6999999999998.6%
Claude Opus 4.6 (Reasoning)999999979397.3%
Claude Opus 4.5999996969697.1%
GPT-5969595959594.9%
Grok 4979494949194.2%
GPT-5.1969595919193.5%
Gemini 3.1 Pro (Preview)969695918993.3%
Gemini 3 Pro (Preview)959493918892.3%
Z.AI GLM 5 Turbo969292919092.1%
Claude Sonnet 4.6929292929192.1%
Gemini 3 Flash (Preview, Reasoning)969392918691.7%
Aion 2.0959292908991.5%
Z.AI GLM 4.7989490888691.3%
Qwen 3.5 Flash969392898691.3%
Claude Sonnet 4.6 (Reasoning)929291919091.2%
Mistral Large 2949191908890.9%
Mistral Large 3929292928690.8%
Qwen 3.5 27B959490898590.6%
GPT-5.4 (Reasoning)929190908990.4%
Qwen 3.5 122B939191908790.3%
Gemini 2.5 Pro939290908590.1%
Z.AI GLM 4.6969492848389.8%
Qwen 3.5 397B A17B969291878289.5%
Grok 4.20 (Beta, Reasoning)968988888689.4%
DeepSeek V3.1909089898989.4%
Mistral Large929090888689.1%
Claude Opus 4959190868489.0%
Stealth: Healer Alpha919190878588.8%
Stealth: Hunter Alpha939089878388.6%
DeepSeek-V2 Chat919191888188.4%
DeepSeek V3.2919189878488.3%
GPT-5 Mini929188858487.9%
ByteDance Seed 2.0 Mini898988888387.5%
Qwen 3.5 Plus (2026-02-15)919088858487.5%
Grok 4.1 Fast909088868387.4%
Grok 4 Fast939189858087.4%
Gemini 3 Flash (Preview)918986868587.3%
Claude 3.5 Sonnet908988868387.2%
Qwen 3.5 35B918887868587.2%
ByteDance Seed 2.0 Lite918988838386.8%
DeepSeek V3 (2025-03-24)888787868586.6%
MiniMax M2.7898787848385.9%
MoonshotAI: Kimi K2.5898989837985.8%
Gemini 2.5 Flash (Reasoning)898785858285.6%
DeepSeek V3 (2024-12-26)918988827885.6%
MiniMax M2.5898786848085.3%
GPT-5.4 Mini (Reasoning)908888818085.3%
o4 Mini High908887827985.3%
Claude Haiku 4.5868684848485.2%
Claude Sonnet 4.5868685858385.1%
Qwen 3.5 9B938888867085.1%
Gemini 3.1 Flash Lite (Preview)888786848084.8%
Grok 4.20 (Beta)878685848284.8%
Z.AI GLM 4.5938583827984.5%
Gemini 2.5 Flash Lite (Reasoning)908884827684.0%
Mistral Small Creative858584838384.0%
Z.AI GLM 5878584828183.7%
GPT-5.2898582818083.6%
WizardLM 2 8x22b888883817683.2%
GPT-5.4858484818183.1%
Nemotron 3 Super868682818083.0%
GPT-5.4 (Reasoning, Low)858584837882.9%
GPT-5 Nano888682817882.9%
Nemotron 3 Nano908581817782.7%
Mistral Small 3.2 24B848484817782.2%
Claude 3.7 Sonnet858282818182.1%
ByteDance Seed 1.6868381807982.0%
Mistral Medium 3.1848482808081.9%
Writer: Palmyra X5858482797881.7%
Ministral 3 8B828282827981.6%
Claude Sonnet 4868181807881.2%
GPT-4o, May 13th (temp=0)838382807881.1%
Ministral 3 14B848282817781.1%
Gemini 2.5 Flash848482797680.8%
Qwen 2.5 72B828281807680.3%
Qwen3 235B A22B Instruct 2507908482806480.2%
Ministral 8B828181797780.1%
GPT-4o, Aug. 6th (temp=0)838080807780.0%
o4 Mini878583757079.9%
Inception Mercury 2917977767579.6%
Gemini 2.5 Flash Lite878281806779.3%
GPT-4.1 Mini867979777679.2%
GPT-4o, May 13th (temp=1)848281806878.7%
GPT-4o, Aug. 6th (temp=1)848280796578.0%
Hermes 3 405B827977777478.0%
GPT-4.1818080767277.9%
GPT-5.4 Mini (Reasoning, Low)828076767377.7%
Llama 3.1 70B808080796677.0%
GPT-5.4 Mini797979757076.6%
Qwen 3 32B857976756876.6%
Hermes 3 70B817774747275.4%
Claude 3 Haiku817974706874.5%
Z.AI GLM 4.7 Flash828170686673.4%
Mistral Small 4 (Reasoning)777673716873.1%
GPT-5.4 Nano (Reasoning)838272636372.5%
GPT-4o Mini (temp=1)837669676571.8%
ByteDance Seed 1.6 Flash827867676271.1%
Gemma 3 4B737272706871.1%
Arcee AI: Trinity Mini787469686169.8%
Mistral Small 4817068685768.8%
Inception Mercury757465635867.0%
Arcee AI: Trinity Large (Preview)86868278066.5%
GPT-4o Mini (temp=0)737067626066.3%
Cohere Command R+ (Aug. 2024)726665635864.8%
GPT-5.4 Nano726461615863.1%
Llama 3.1 8B706861605662.8%
Ministral 3B666363626062.8%
Gemma 3 27B727062575362.7%
Ministral 3 3B636362616161.6%
GPT-5.4 Nano (Reasoning, Low)716664554860.9%
GPT-4.1 Nano665756545357.3%
Llama 3.1 Nemotron 70B73716966055.9%
Rocinante 12B76717060055.4%
Gemma 3 12B71716765054.9%
Mistral NeMO000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4999695939295.2%
Claude Opus 4.6 (Reasoning)969595959594.9%
Z.AI GLM 5 Turbo969594949394.5%
Claude Opus 4.6959595959394.3%
Gemini 3.1 Pro (Preview)969594949294.3%
Gemini 3 Flash (Preview, Reasoning)989594939294.3%
Z.AI GLM 5969595929193.9%
Grok 4.1 Fast969494939293.7%
Gemini 3 Flash (Preview)959593939193.6%
Qwen 3.5 397B A17B959593928992.9%
Claude Opus 4.5949494948892.9%
GPT-5949393929292.7%
Gemini 2.5 Pro949393919092.2%
GPT-5.1959292928892.0%
GPT-5.4 (Reasoning)969292908991.8%
Gemini 3 Pro (Preview)949492918891.6%
Claude Sonnet 4.6939191919191.4%
MoonshotAI: Kimi K2.5959390898991.2%
GPT-5.4 (Reasoning, Low)959190908990.9%
Claude Sonnet 4.6 (Reasoning)929191909090.7%
GPT-5 Mini939090908990.6%
Stealth: Hunter Alpha949190898890.3%
Aion 2.0929191918690.3%
Grok 4 Fast949290888790.1%
Qwen 3.5 122B939290908689.9%
Claude Haiku 4.5929090908889.7%
Grok 4.20 (Beta, Reasoning)949288888589.5%
Qwen 3.5 27B929190898689.5%
Z.AI GLM 4.6949290878389.3%
ByteDance Seed 2.0 Lite939290868589.2%
o4 Mini High929088888789.0%
Qwen 3.5 Plus (2026-02-15)918989888788.9%
Z.AI GLM 4.7959189858488.8%
GPT-5.4928888888688.5%
Stealth: Healer Alpha939189858488.5%
Claude Sonnet 4.5939087878588.4%
Claude Opus 4929088868688.4%
GPT-5.2929288878087.8%
Gemini 2.5 Flash908989888387.7%
DeepSeek V3.1898888858487.0%
GPT-5.4 Mini (Reasoning)908787878487.0%
MiniMax M2.5928786868587.0%
Claude 3.5 Sonnet888887868587.0%
Grok 4.20 (Beta)898887858586.9%
o4 Mini908888838286.2%
Claude Sonnet 4918686858386.2%
Nemotron 3 Super918984848386.1%
Qwen 3.5 Flash938787828185.8%
Gemini 2.5 Flash (Reasoning)898884848385.7%
Writer: Palmyra X5898987818085.4%
Claude 3.7 Sonnet908885847985.2%
Qwen 3.5 35B878784848385.1%
DeepSeek V3.2878785848185.1%
Mistral Medium 3.1878684848485.0%
Mistral Small Creative888884838184.7%
Z.AI GLM 4.5878584837983.4%
Qwen3 235B A22B Instruct 2507888787807383.0%
Qwen 3.5 9B888785787582.7%
MiniMax M2.7928381807782.6%
ByteDance Seed 2.0 Mini858583817982.5%
Mistral Large 2878281818182.4%
Mistral Large838382818182.0%
DeepSeek V3 (2024-12-26)898878777681.7%
Mistral Large 3838181818181.6%
GPT-4.1868483817181.1%
Arcee AI: Trinity Large (Preview)888379777680.6%
DeepSeek V3 (2025-03-24)848279797880.5%
DeepSeek-V2 Chat898883716879.9%
GPT-4o, May 13th (temp=0)818079797979.9%
Gemini 3.1 Flash Lite (Preview)858178777679.5%
ByteDance Seed 1.6 Flash838381767479.3%
GPT-4o, May 13th (temp=1)838280767479.1%
ByteDance Seed 1.6888077767278.6%
Hermes 3 405B827979777778.5%
Nemotron 3 Nano808079777578.2%
GPT-4o, Aug. 6th (temp=0)787878787878.1%
GPT-4.1 Mini868178727177.5%
GPT-5.4 Nano (Reasoning)868178756777.3%
GPT-4o, Aug. 6th (temp=1)817877767577.3%
Gemini 2.5 Flash Lite (Reasoning)807979766976.9%
WizardLM 2 8x22b867975736675.8%
Mistral Small 3.2 24B777777757075.3%
Inception Mercury 2828077746375.3%
GPT-5.4 Mini (Reasoning, Low)787676757175.2%
GPT-4o Mini (temp=0)767676767175.0%
GPT-5.4 Mini867474706874.3%
Ministral 3 8B797675727074.3%
Qwen 2.5 72B777574747174.2%
Gemma 3 27B797971717074.1%
Ministral 3 14B777575727274.1%
Ministral 8B777372717072.8%
Claude 3 Haiku827575666672.6%
GPT-5 Nano817973676372.4%
Qwen 3 32B767372716771.9%
Gemini 2.5 Flash Lite757471706971.8%
Mistral Small 4 (Reasoning)807872666371.7%
Hermes 3 70B817373686271.6%
Mistral Small 4737271676369.4%
GPT-4o Mini (temp=1)767169676469.4%
Z.AI GLM 4.7 Flash847573694268.5%
Llama 3.1 Nemotron 70B766867656568.3%
Ministral 3 3B747371645968.1%
Ministral 3B747070655667.3%
Arcee AI: Trinity Mini777169615867.0%
Gemma 3 12B676762616063.6%
Gemma 3 4B756361595562.4%
Inception Mercury666362616062.4%
GPT-5.4 Nano (Reasoning, Low)696964585262.3%
GPT-5.4 Nano696865594861.8%
Cohere Command R+ (Aug. 2024)696560585361.0%
Llama 3.1 70B80796963058.1%
GPT-4.1 Nano535351484550.1%
Llama 3.1 8B66665655048.3%
Mistral NeMO8272720045.3%
Rocinante 12B72605616040.9%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro989898989697.9%
Qwen 3.5 Plus (2026-02-15)989898989697.9%
Grok 4989898979697.7%
Claude Sonnet 4.5989898969697.5%
Mistral Medium 3.1989897979697.5%
GPT-5989797979797.3%
Qwen 3.5 397B A17B989898979497.3%
Nemotron 3 Super989897979697.2%
Claude Opus 4.5989898949496.7%
DeepSeek V3 (2025-03-24)999797959496.5%
Claude Opus 4989898949396.5%
Mistral Large 3989895959596.3%
Z.AI GLM 5979797959496.2%
Grok 4.1 Fast989696969496.0%
Claude Sonnet 4969696969696.0%
Claude Sonnet 4.6 (Reasoning)989696949495.7%
GPT-5.1989696959295.6%
Mistral Large989895959195.6%
Claude Opus 4.6959595959595.3%
Grok 4.20 (Beta, Reasoning)989796949095.2%
Mistral Small Creative959595959595.2%
MiniMax M2.7989896948894.9%
Aion 2.0989897919094.9%
Claude Sonnet 4.6969696939394.8%
Qwen 3.5 122B989694929294.7%
Hermes 3 405B959595949394.6%
Mistral Large 2989595929294.6%
Z.AI GLM 4.5959595959294.3%
Z.AI GLM 5 Turbo969695939294.2%
Claude Opus 4.6 (Reasoning)949494949494.1%
DeepSeek V3.1979796918993.9%
Gemini 3 Flash (Preview, Reasoning)989797977993.9%
Z.AI GLM 4.7989494939093.9%
Grok 4 Fast979594948893.7%
Grok 4.20 (Beta)969594929093.7%
GPT-5.4 Mini (Reasoning)989594938793.5%
Claude 3.7 Sonnet949494929293.2%
Stealth: Hunter Alpha979594928793.1%
MoonshotAI: Kimi K2.5969492909092.4%
Ministral 3 8B949494919092.3%
Gemini 2.5 Flash Lite (Reasoning)969493928792.3%
ByteDance Seed 2.0 Lite969494908692.0%
Stealth: Healer Alpha969592888791.9%
Gemini 3 Flash (Preview)949492928591.5%
Qwen 3.5 Flash979290908891.5%
Qwen 3.5 35B959292928691.2%
Ministral 3 14B959292888891.0%
DeepSeek-V2 Chat969292908490.9%
o4 Mini High979594907890.7%
MiniMax M2.5969695897590.2%
Z.AI GLM 4.6969689858490.1%
WizardLM 2 8x22b979593877589.4%
DeepSeek V3.2989592897088.9%
DeepSeek V3 (2024-12-26)929090878688.9%
GPT-5.4 Nano (Reasoning)919188888688.7%
Gemini 2.5 Flash (Reasoning)989190887588.4%
ByteDance Seed 2.0 Mini948988878588.4%
ByteDance Seed 1.6989286818187.7%
Claude Haiku 4.5949090907487.7%
Gemini 2.5 Flash Lite928887868587.5%
Arcee AI: Trinity Large (Preview)959287857887.5%
Ministral 8B958783838286.1%
GPT-5 Mini939292797385.6%
o4 Mini969493726984.8%
Qwen 2.5 72B908783818184.4%
Qwen 3.5 9B939385806482.8%
Inception Mercury 2888681787682.0%
Gemini 3 Pro (Preview)997878777581.4%
Nemotron 3 Nano898280797781.3%
GPT-4o, May 13th (temp=0)967977777681.1%
Hermes 3 70B1009188734880.0%
Qwen3 235B A22B Instruct 2507818181797779.9%
Llama 3.1 Nemotron 70B828279797579.3%
Writer: Palmyra X5818179797579.3%
Qwen 3 32B848180767478.9%
Llama 3.1 70B828178777077.9%
Gemini 2.5 Flash787878777777.7%
Claude 3.5 Sonnet797777777777.5%
Gemma 3 4B817876767577.3%
Gemini 3.1 Pro (Preview)787878757576.9%
Mistral Small 3.2 24B898174726876.8%
GPT-5.4 Mini928171706876.6%
GPT-4o, May 13th (temp=1)897877756276.3%
GPT-5.2777675757375.3%
Gemma 3 27B797776727275.0%
GPT-5.4 (Reasoning)767675747474.9%
Mistral Small 4 (Reasoning)908881664774.3%
Inception Mercury847773696773.8%
GPT-4o, Aug. 6th (temp=0)747474747273.8%
Z.AI GLM 4.7 Flash927870686173.7%
Gemini 3.1 Flash Lite (Preview)777776716773.5%
GPT-5.4 (Reasoning, Low)757473727072.6%
Arcee AI: Trinity Mini797471706371.5%
GPT-5.4 Mini (Reasoning, Low)857373685771.4%
ByteDance Seed 1.6 Flash817670676271.2%
GPT-5.4 Nano (Reasoning, Low)818073635971.2%
GPT-4.1777472656270.0%
Gemma 3 12B797571715469.9%
GPT-4o, Aug. 6th (temp=1)817065656468.9%
Claude 3 Haiku757169686268.9%
Cohere Command R+ (Aug. 2024)888171673668.6%
GPT-5.4 Nano777370636068.5%
Ministral 3B727068676368.0%
Mistral Small 4847065635567.6%
GPT-5.4676564646264.5%
GPT-4.1 Mini777370633463.1%
Ministral 3 3B656463625862.6%
GPT-5 Nano787058564761.7%
Llama 3.1 8B706363604259.6%
Qwen 3.5 27B9896940057.7%
GPT-4.1 Nano796653413755.1%
GPT-4o Mini (temp=1)645454505054.5%
GPT-4o Mini (temp=0)545454545052.9%
Mistral NeMO8180610044.3%
Rocinante 12B692900019.7%
LFM2 24B000000.0%