Accuracy (recall)

Test: Codex Violation Detection

Avg. Score
73.3%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash (Reasoning)94.0%$0.007912.5s86%
2Grok 4.1 Fast93.8%$0.002121.1s84%
3GPT-5.495.9%$0.0138.8s85%
4Z.AI GLM 5 Turbo94.8%$0.007217.1s85%
5Gemini 3 Flash (Preview, Reasoning)94.7%$0.01118.0s86%
6DeepSeek V4 Flash (Reasoning)94.4%$0.001040.3s82%
7GPT-5.596.7%$0.0307.7s90%
8Grok 4 Fast92.4%$0.001812.8s72%
9Grok 4.20 (Beta, Reasoning)96.1%$0.02616.8s89%
10Gemma 4 31B92.5%$0.000937.9s78%
11Grok 4.20 (Reasoning)95.8%$0.01545.4s90%
12GPT-5.295.7%$0.02424.2s87%
13Gemini 3 Flash (Preview)88.8%$0.00314.5s69%
14Claude Opus 4.598.0%$0.0419.7s91%
15Gemini 2.5 Pro97.4%$0.03523.2s91%
16GPT-5.4 (Reasoning, Low)92.4%$0.02013.5s79%
17Claude Sonnet 4.593.6%$0.0248.9s80%
18Xiaomi MIMO v2.5 Pro92.5%$0.009137.0s80%
19Gemini 2.5 Flash82.5%$0.00252.8s69%
20Claude Opus 4.697.3%$0.04010.2s88%
21Stealth: Healer Alpha89.0%$0.000021.8s67%
22GPT-5.5 (Reasoning, Low)95.5%$0.04014.2s88%
23Qwen 3.6 Flash92.4%$0.01129.9s72%
24Stealth: Hunter Alpha89.5%$0.000044.0s71%
25Qwen 3.5 Flash92.8%$0.00381.0m77%
26Xiaomi MIMO v2.589.2%$0.005821.8s65%
27Inception Mercury 282.2%$0.00304.6s62%
28Claude Sonnet 490.3%$0.0239.0s72%
29Qwen 3.6 35B92.1%$0.01351.0s78%
30Qwen 3.5 35B93.3%$0.01754.8s81%
31ByteDance Seed 1.691.9%$0.00671.0m77%
32Z.AI GLM 592.1%$0.01356.4s79%
33Gemini 3.1 Flash Lite81.1%$0.00183.2s59%
34GPT-5 Mini91.2%$0.009251.7s74%
35GPT-5.4 Mini (Reasoning, Low)83.0%$0.00556.7s61%
36Gemini 3.1 Flash Lite (Preview)81.9%$0.00192.2s57%
37o4 Mini88.9%$0.01928.1s74%
38Z.AI GLM 4.791.3%$0.00911.0m77%
39Gemini 2.5 Flash Lite (Reasoning)81.1%$0.002017.6s62%
40Z.AI GLM 5.195.5%$0.0171.6m88%
41ByteDance Seed 2.0 Lite89.8%$0.00671.1m73%
42Qwen 3.5 Plus (2026-04-20)93.4%$0.0151.4m83%
43Claude Sonnet 4.682.9%$0.0249.9s70%
44Gemini 3.1 Flash Lite (Reasoning)79.3%$0.00183.7s52%
45Grok 4.3 (Reasoning)94.3%$0.0191.4m81%
46GPT-5.4 Mini (Reasoning)87.9%$0.02028.6s68%
47Qwen 3.5 27B95.6%$0.0211.7m89%
48DeepSeek V4 Flash78.1%$0.00039.3s53%
49Mistral Large 375.5%$0.003010.2s58%
50Qwen 3.5 122B92.7%$0.0261.2m81%
51Mistral Large79.3%$0.0129.1s59%
52MiniMax M2.779.1%$0.002229.4s59%
53MoonshotAI: Kimi K2.592.9%$0.0151.5m79%
54Qwen 3.5 Plus (2026-02-15)85.0%$0.004134.2s55%
55DeepSeek V4 Pro81.7%$0.003125.6s53%
56GPT-5.192.7%$0.03752.8s79%
57Mistral Large 275.9%$0.0128.8s57%
58o4 Mini High90.7%$0.03351.3s76%
59Grok 4.2077.0%$0.00536.4s50%
60Aion 2.091.7%$0.00841.3m64%
61Claude Opus 4.788.3%$0.0527.6s76%
62Claude Opus 4.7 (Reasoning)92.8%$0.06711.4s82%
63Mistral Medium 3.175.1%$0.00297.7s46%
64Grok 4.20 (Beta)74.8%$0.00592.5s47%
65Gemini 3.1 Pro (Preview)98.3%$0.06852.3s93%
66GPT-5.4 (Reasoning)91.7%$0.04243.1s75%
67MiniMax M2.574.6%$0.002025.5s52%
68Claude Opus 4.6 (Reasoning)97.3%$0.07632.0s90%
69GPT-5.5 (Reasoning)95.6%$0.07130.3s88%
70DeepSeek V3.275.0%$0.001317.2s45%
71Gemma 4 31B (Reasoning)94.4%$0.00172.8m85%
72Z.AI GLM 4.575.6%$0.002618.5s45%
73GPT-4.177.7%$0.008110.0s44%
74Z.AI GLM 4.686.1%$0.00491.3m59%
75Mistral Small 4 (Reasoning)70.9%$0.002016.3s47%
76ByteDance Seed 1.6 Flash67.9%$0.000912.0s46%
77Gemma 4 26B (Reasoning)90.0%$0.00222.0m65%
78Gemini 3 Pro (Preview)91.0%$0.05034.0s70%
79Z.AI GLM 4.5 Air77.1%$0.002539.8s46%
80DeepSeek-V2 Chat71.1%$0.002013.1s41%
81Grok 494.1%$0.0511.2m80%
82GPT-4.1 Mini63.7%$0.00155.8s43%
83Claude Haiku 4.566.8%$0.00785.8s45%
84DeepSeek V3 (2024-12-26)69.9%$0.001915.6s41%
85GPT-OSS 120B81.0%$0.00201.5m59%
86Qwen 3.6 27B89.5%$0.0221.4m63%
87Qwen3 235B A22B Instruct 250767.3%$0.000721.1s43%
88Writer: Palmyra X567.2%$0.006212.1s44%
89Gemma 4 26B75.0%$0.000625.0s35%
90Stealth: Aurora Alpha77.5%6.2s46%
91ByteDance Seed 2.0 Mini90.8%$0.00292.7m73%
92Claude Sonnet 4.6 (Reasoning)94.9%$0.07651.3s84%
93Claude 3.7 Sonnet77.4%$0.02310.2s42%
94Grok 4.364.3%$0.00616.1s39%
95Inception Mercury63.1%$0.00059.5s34%
96DeepSeek V3 (2025-03-24)68.9%$0.001522.9s34%
97GPT-4o, Aug. 6th (temp=1)65.4%$0.0113.6s36%
98Mistral Small Creative56.8%$0.00064.3s36%
99GPT-4o, Aug. 6th (temp=0)67.9%$0.0155.0s36%
100Claude 3.5 Sonnet79.2%$0.04210.5s47%
101Qwen3.6 Max Preview96.4%$0.0502.5m90%
102GPT-593.3%$0.0611.6m81%
103Z.AI GLM 4.7 Flash68.2%$0.00181.1m45%
104DeepSeek V3.166.1%$0.001431.1s33%
105Nemotron 3 Super83.1%$0.00002.3m55%
106Qwen 3 32B64.1%$0.001027.8s31%
107Qwen 3.5 9B84.7%$0.00202.4m55%
108Mistral Small 3.2 24B53.8%$0.000610.4s32%
109GPT-5.4 Nano (Reasoning)67.5%$0.003516.6s23%
110GPT-4o, May 13th (temp=0)71.6%$0.0305.4s36%
111Gemma 3 27B53.2%$0.000513.3s33%
112Qwen 3.5 397B A17B93.7%$0.0262.9m75%
113GPT-5.4 Mini54.5%$0.00333.1s29%
114DeepSeek V4 Pro (Reasoning)91.5%$0.0142.8m60%
115Gemini 2.5 Flash Lite47.9%$0.00052.3s24%
116GPT-4o, May 13th (temp=1)58.1%$0.0263.7s34%
117Ministral 3 14B46.4%$0.00106.3s25%
118Hermes 3 405B56.2%$0.004417.2s20%
119Llama 3.1 Nemotron 70B55.8%$0.005518.5s19%
120GPT-5 Nano67.2%$0.00491.9m42%
121Ministral 3 8B39.9%$0.00074.8s20%
122Llama 3.1 70B51.2%$0.002124.3s18%
123Qwen 2.5 72B42.3%$0.000813.9s18%
124Ministral 8B34.7%$0.00055.6s18%
125Arcee AI: Trinity Mini33.3%$0.000410.3s18%
126Claude Opus 484.4%$0.11615.8s65%
127Mistral Small 430.7%$0.00084.3s18%
128Ministral 3 3B28.6%$0.00053.3s16%
129Gemma 3 12B35.9%$0.000312.0s11%
130MoonshotAI: Kimi K2.694.7%$0.0384.1m75%
131GPT-4o Mini (temp=0)31.3%$0.000625.0s19%
132GPT-4o Mini (temp=1)29.4%$0.00067.4s13%
133Ministral 3B25.0%$0.00022.9s14%
134Cohere Command R+ (Aug. 2024)32.3%$0.01410.0s15%
135GPT-5.4 Nano (Reasoning, Low)32.9%$0.00136.1s2%
136Hermes 3 70B28.5%$0.001329.2s14%
137Claude 3 Haiku19.4%$0.00153.6s12%
138GPT-5.4 Nano21.3%$0.00082.9s4%
139Mistral NeMO20.6%$0.000713.3s7%
140Nemotron 3 Nano64.5%$0.00313.5m37%
141Llama 3.1 8B16.1%$0.000216.1s0%
142WizardLM 2 8x22b17.1%$0.003615.7s0%
143Rocinante 12B6.4%$0.00096.0s0%
144GPT-4.1 Nano2.6%$0.00043.9s0%
145Gemma 3 4B3.2%$0.000212.5s0%
146Arcee AI: Trinity Large (Preview)46.2%$0.00003.1m24%
147LFM2 24B0.0%$0.00081.9m0%
73.28%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 410098989797959594949196.1%
Qwen3.6 Max Preview9797979494949494919194.2%
Claude Sonnet 4.6 (Reasoning)9795959494949492928893.6%
GPT-5.59797959494949292898993.5%
Grok 4.20 (Reasoning)10097979494929291918693.5%
Qwen 3.5 397B A17B100100979594948989888593.2%
GPT-5.5 (Reasoning)9797949494929289898992.9%
GPT-5.5 (Reasoning, Low)9595959592929292918392.6%
GPT-5.49794949292929191919192.6%
Gemini 2.5 Pro10095929292898888888691.2%
MoonshotAI: Kimi K2.610098959494898888867991.2%
GPT-5.29797949292918988868591.2%
DeepSeek V4 Flash (Reasoning)9595949292929189858290.9%
Qwen 3.5 27B9492929291919191888690.9%
Grok 4 Fast100100979291898686838290.8%
GPT-59494929291918888888890.6%
Grok 4.1 Fast10097949191918989828290.6%
Grok 4.20 (Beta, Reasoning)9797949289898886868290.2%
Claude Opus 4.6 (Reasoning)9492919191919188868590.0%
Gemini 3.1 Pro (Preview)9494919191898888888690.0%
GPT-5.4 (Reasoning, Low)9494929291898888868389.8%
Z.AI GLM 5.19595929191918985858289.7%
GPT-5.4 (Reasoning)9494929191918685858289.1%
Claude Opus 4.59292919189898888858388.9%
Grok 4.3 (Reasoning)10091888888888888858088.3%
Claude Opus 4.7 (Reasoning)9191918988888886858388.0%
Z.AI GLM 5 Turbo9292918989868683838387.7%
Nemotron 3 Super9491898989888683838087.4%
Claude Opus 4.79291918989868583838087.1%
GPT-5 Mini9191888886868585858587.0%
Gemini 2.5 Flash (Reasoning)9189898988858585857786.4%
ByteDance Seed 1.69491918988858382807986.2%
Z.AI GLM 4.69492928888868382777786.1%
Qwen 3.6 35B9797919185858280777485.9%
Qwen 3.5 Plus (2026-04-20)9791898988858582777485.8%
o4 Mini High9491918885838282827985.6%
GPT-5.19491888585858282808085.2%
Qwen 3.6 Flash8988888888858580797684.5%
Gemma 4 31B (Reasoning)9291898988888379746884.2%
Qwen 3.5 122B8988868683828280797983.5%
Z.AI GLM 59785858382828279797783.0%
Gemini 3 Flash (Preview, Reasoning)8988858383828282827483.0%
Qwen 3.5 35B9291898585837979767183.0%
Claude Opus 4.68585858583838080808082.7%
Aion 2.01009794929189888888082.7%
Xiaomi MIMO v2.5 Pro9288868583838079797082.6%
Qwen 3.6 27B9291858585837776767382.3%
Z.AI GLM 4.79488888383827979747382.3%
Gemma 4 31B8686858382828080797982.3%
Claude Opus 49289858382827977777081.7%
Qwen 3.5 Flash9491868582827977766281.4%
MoonshotAI: Kimi K2.59186868682828080736581.2%
Gemma 4 26B (Reasoning)8885808079797776747479.2%
Claude Sonnet 4.68682828280777776767479.2%
Claude Sonnet 48382808080797977737178.5%
Gemini 3 Flash (Preview)8582828079767676747478.3%
Claude Sonnet 4.58583828079777777717078.2%
o4 Mini8886828279767673706777.7%
Stealth: Hunter Alpha8883838280777673685977.0%
GPT-5.4 Mini (Reasoning)8382828080747373717076.8%
ByteDance Seed 2.0 Lite8585838079777668656276.1%
Stealth: Healer Alpha9491918279767462584274.8%
ByteDance Seed 2.0 Mini8382767474747473736474.7%
Qwen 3.5 9B8582797777767371626274.4%
Xiaomi MIMO v2.5898585827979767070071.4%
DeepSeek V4 Pro (Reasoning)97959492888582760070.9%
Gemini 2.5 Flash7977767674737064615970.8%
Gemini 3 Pro (Preview)8382807676767473671570.2%
Inception Mercury 27977747171707070615870.0%
Gemini 2.5 Flash Lite (Reasoning)7977767373686767625569.5%
GPT-5.4 Mini (Reasoning, Low)7777717070686764625968.5%
DeepSeek V4 Pro7776717168686764565367.1%
GPT-OSS 120B7676737067676261565365.9%
Qwen 3.5 Plus (2026-02-15)928583827977763938065.2%
Mistral Large7970676765656261595865.2%
Mistral Large 37373706868646161565264.4%
GPT-4.17774706765646258564563.8%
Gemma 4 26B7067676565646459585663.3%
MiniMax M2.77771686464645956554862.6%
Gemini 3.1 Flash Lite7067626262585858564860.0%
DeepSeek V4 Flash7973676764645955383259.5%
Gemini 3.1 Flash Lite (Preview)6765656259595656535259.4%
Stealth: Aurora Alpha6867656559585858554259.4%
Mistral Large 26767676159585656534758.9%
Gemini 3.1 Flash Lite (Reasoning)6565626161616159533257.9%
GPT-5.4 Nano (Reasoning)82807674706864590057.3%
DeepSeek V3.29465595952474747443955.3%
MiniMax M2.57371646153535042413654.4%
Z.AI GLM 4.57667555353504545453652.6%
Z.AI GLM 4.5 Air807168686756523223352.0%
Grok 4.207361555352504747393551.1%
Claude Haiku 4.56758505050504847474250.9%
DeepSeek-V2 Chat6158565350504744393349.1%
Mistral Medium 3.16256535252484848472148.8%
DeepSeek V3 (2024-12-26)6764595858564835241748.5%
GPT-5 Nano5858565250484544353548.0%
Grok 4.20 (Beta)6158565550483939383648.0%
Qwen3 235B A22B Instruct 25075855555555444442363347.6%
DeepSeek V3.1716764625848413624047.1%
Mistral Small 4 (Reasoning)5350504747444441392944.4%
Claude 3.7 Sonnet6159484439393838353043.2%
Writer: Palmyra X5565653474444413938342.1%
Grok 4.373616155504742210040.9%
Z.AI GLM 4.7 Flash5347454241383333322939.4%
DeepSeek V3 (2025-03-24)686248474533323023038.9%
ByteDance Seed 1.6 Flash5941393938383532292337.3%
Mistral Small Creative474141413835333329534.2%
Claude 3.5 Sonnet3836353333323232303033.2%
Qwen 3 32B4441393938322618171230.6%
GPT-4.1 Mini5344333330292621201430.3%
Nemotron 3 Nano393836323029292714327.7%
Inception Mercury523835352923202017226.8%
GPT-5.4 Nano (Reasoning, Low)594847413932000026.7%
GPT-5.4 Mini3832322929292318151425.8%
Mistral Small 3.2 24B44393935242118189024.8%
GPT-4o, Aug. 6th (temp=0)44352929292927230024.4%
GPT-4o, May 13th (temp=1)363029292926241817624.4%
GPT-4o, Aug. 6th (temp=1)3532292724212015141423.0%
Ministral 3 8B353029242121201814321.5%
Ministral 3 14B3232272424211460018.0%
GPT-4o, May 13th (temp=0)39353333320000017.3%
Qwen 2.5 72B27262017151515129816.4%
Ministral 8B333224232312222215.3%
Gemma 3 27B3626151514121199014.7%
Mistral Small 4292018121211920011.2%
Gemini 2.5 Flash Lite3230261850000011.1%
Hermes 3 405B2718111188665210.0%
Hermes 3 70B14111188852006.4%
Llama 3.1 Nemotron 70B12121186530005.6%
Arcee AI: Trinity Mini159665332004.8%
Cohere Command R+ (Aug. 2024)1212965300004.7%
Gemma 3 12B119865000003.8%
GPT-5.4 Nano239330000003.8%
GPT-4o Mini (temp=1)96652000002.7%
Llama 3.1 70B126532000002.7%
Claude 3 Haiku118322200002.6%
WizardLM 2 8x22b96622000002.4%
Mistral NeMO110000000001.1%
GPT-4o Mini (temp=0)80000000000.8%
Llama 3.1 8B53000000000.8%
Ministral 3 3B53000000000.8%
GPT-4.1 Nano30000000000.3%
Ministral 3B30000000000.3%
Gemma 3 4B20000000000.2%
Arcee AI: Trinity Large (Preview)00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Gemini 3.1 Pro (Preview)1001001001001001009797979798.9%
GPT-5.410010010010097979797979297.9%
Claude Opus 4.610010010010097979797979297.9%
GPT-5.5 (Reasoning, Low)100100100100100959595959297.1%
Grok 410010010010095959595959596.8%
MoonshotAI: Kimi K2.6100100979797979595959296.6%
Z.AI GLM 5 Turbo100100100100100979792928796.6%
Z.AI GLM 5.11001001009795959595959296.3%
MoonshotAI: Kimi K2.51001001009797959592929296.1%
Claude Sonnet 4.6 (Reasoning)10097979797979792929296.1%
Claude Sonnet 4.610010010010095959595928795.8%
DeepSeek V4 Pro (Reasoning)10097979595959595928794.7%
GPT-5.5 (Reasoning)9797959595959592929294.5%
DeepSeek V4 Flash (Reasoning)1001001009797959289848493.9%
Claude Sonnet 410010010010092929292878493.9%
Qwen3.6 Max Preview10095959595929292928993.7%
Qwen 3.6 35B1001001009595959589897693.4%
Grok 4.3 (Reasoning)10095959595959589898493.2%
GPT-5.29797959595928989898792.6%
Qwen 3.5 35B10097979592928989878792.6%
Claude Opus 4.6 (Reasoning)100100959595928787878792.4%
Grok 4.20 (Beta)9797979797928989877992.4%
Qwen 3.6 Flash9595959595958989878792.1%
ByteDance Seed 1.610095959592898989898491.8%
Xiaomi MIMO v2.5 Pro100100959292929289848291.8%
Gemini 3 Flash (Preview, Reasoning)10097959592928987878491.8%
GPT-5.4 (Reasoning)9595959292929289878791.6%
Grok 4.20 (Beta, Reasoning)9595959595898989898491.6%
Grok 4.20 (Reasoning)9595959589898989898491.1%
Qwen 3.5 Flash10095929289898989878791.1%
Gemma 4 31B (Reasoning)10097959595959284797690.8%
Claude Opus 4.79292929292929292927990.8%
Grok 4.209797979797898989827190.8%
Qwen 3.5 122B10095929292898987878290.5%
Gemini 2.5 Flash (Reasoning)10095928989898989878490.5%
Qwen 3.5 27B9292929289898989898790.3%
Qwen 3.5 Plus (2026-04-20)10095898989898987878490.0%
Grok 4 Fast9595958989898989897990.0%
GPT-59595928989898989878289.7%
Qwen 3.5 397B A17B9292929292929284848489.7%
Grok 4.1 Fast9595959292898987847989.7%
Gemma 4 31B9292929292929284848489.7%
Claude Opus 4.7 (Reasoning)9595958989898989828289.5%
GPT-5.4 (Reasoning, Low)9595898989898989848489.5%
Qwen 3.6 27B10095959592898989846389.2%
Aion 2.09595898989898987847688.4%
GPT-5.4 Mini (Reasoning)9595898989898484848488.4%
Z.AI GLM 4.710092929289878484847487.9%
Z.AI GLM 5100100959289828279797987.6%
Stealth: Healer Alpha10092898989878484827987.6%
Stealth: Hunter Alpha10095959289878784826687.6%
Xiaomi MIMO v2.59595898989878484828287.6%
GPT-5.19292928989898482828287.4%
Gemini 3 Flash (Preview)9292929292848484847687.4%
o4 Mini High9595878787878484828286.8%
Gemini 3 Pro (Preview)10095929289898279767486.8%
Z.AI GLM 4.69792928989898784747186.6%
ByteDance Seed 2.0 Lite9589898989848282828286.3%
GPT-4.1100100959292848476716686.1%
Nemotron 3 Super8989898484848484827985.0%
o4 Mini9589898787878282797485.0%
Qwen 3.5 9B9595898984848276747484.2%
DeepSeek V4 Flash9592929287877979746684.2%
Gemma 4 26B (Reasoning)9592878484847979797683.9%
Gemini 3.1 Flash Lite (Preview)9789828282828282827983.7%
ByteDance Seed 2.0 Mini9595928482828279766883.4%
Gemini 2.5 Flash9292898987828271716882.4%
Gemini 3.1 Flash Lite (Reasoning)10092848484797676747182.1%
DeepSeek V4 Pro10087878484797974747482.1%
Claude Opus 49592878479797979716881.3%
GPT-5.4 Mini (Reasoning, Low)8989878482797976747481.3%
Qwen 3.5 Plus (2026-02-15)9292848484847979686381.1%
Inception Mercury 28484848279797979766879.5%
Writer: Palmyra X58989878482797471686678.9%
Stealth: Aurora Alpha8989847676767676766678.7%
GPT-5 Mini8989878787878479534578.7%
GPT-OSS 120B8482797979797976766878.2%
Gemini 3.1 Flash Lite8984827979767474746877.9%
Z.AI GLM 4.58787828282797674636377.4%
MiniMax M2.58989828279767471686177.1%
Gemini 2.5 Flash Lite (Reasoning)8484828279767674686376.8%
GPT-5.4 Nano (Reasoning)8484827976767471716376.1%
Claude 3.5 Sonnet8282828274747466666173.9%
MiniMax M2.78989847976716663615873.7%
Claude 3.7 Sonnet7674747474747471717173.2%
Mistral Medium 3.17676747474747468666371.8%
Mistral Large8784797168686666616171.1%
GPT-4o, May 13th (temp=0)7979747474686868636170.8%
DeepSeek V3 (2024-12-26)8482797168686663636170.5%
Mistral Large 28279797166666666636370.0%
DeepSeek-V2 Chat8976767468686363615869.7%
DeepSeek V3.29584827674717163611869.5%
Qwen3 235B A22B Instruct 25078482797674686663584569.5%
Grok 4.38276746868686663615868.4%
Z.AI GLM 4.7 Flash7974747168686363615567.6%
Mistral Large 37471716868686663636167.4%
Mistral Small 4 (Reasoning)7976746863636358555365.3%
GPT-4.1 Mini7674747166666361503463.4%
Z.AI GLM 4.5 Air878279797466665332061.6%
DeepSeek V3 (2025-03-24)898974746866635829061.1%
Nemotron 3 Nano7974666363615847453759.2%
Claude Haiku 4.587797979717163610058.9%
ByteDance Seed 1.6 Flash7468666361616155502958.7%
GPT-4o, Aug. 6th (temp=0)7461615858585555555358.7%
Mistral Small 3.2 24B6863636158555555555358.7%
GPT-5 Nano82827674716866610057.9%
DeepSeek V3.182828279717155420056.3%
Gemini 2.5 Flash Lite7668616158504747453955.3%
Qwen 3 32B7674636158535350372655.0%
GPT-4o, Aug. 6th (temp=1)686866636158554745053.2%
Inception Mercury8466665353504542422952.9%
GPT-4o, May 13th (temp=1)7471685353534545323252.4%
GPT-5.4 Mini8782797471665000050.8%
Mistral Small Creative6353535047474539392145.8%
Arcee AI: Trinity Large (Preview)766158505042393732044.5%
Ministral 3 8B535047474545423939040.8%
Gemma 3 27B4745423939373737373739.7%
Ministral 8B5045424239373737343439.7%
Ministral 3 14B5047454239373734322438.7%
Hermes 3 405B5050474542373429292138.4%
Ministral 3B535045393926241813831.6%
GPT-5.4 Nano (Reasoning, Low)8479797100000031.3%
Mistral Small 4584734323229292413029.7%
GPT-4o Mini (temp=0)3232323232323232321129.5%
Gemma 3 12B3734342929262624211127.1%
Arcee AI: Trinity Mini373737373426241611826.6%
Qwen 2.5 72B4239393734322183025.5%
Ministral 3 3B50393932292616165025.3%
Gemma 4 26B797979000000023.7%
Llama 3.1 Nemotron 70B393229262624211816023.2%
Llama 3.1 70B47343226211616133020.8%
GPT-5.4 Nano53453732290000019.5%
Hermes 3 70B34342626181616113018.4%
Cohere Command R+ (Aug. 2024)39343226248550017.4%
GPT-4o Mini (temp=1)2421181613111150011.8%
Claude 3 Haiku26181185553008.2%
Mistral NeMO1811000000002.9%
Llama 3.1 8B50000000000.5%
GPT-4.1 Nano30000000000.3%
Rocinante 12B30000000000.3%
WizardLM 2 8x22b00000000000.0%
Gemma 3 4B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959098.5%
Claude Opus 4.6100100100100100100100100908597.5%
Claude Opus 4.6 (Reasoning)100100100100100100100100858597.0%
Grok 4.20 (Reasoning)10010010010095959595959597.0%
Gemini 2.5 Pro10010010010010010010090908096.0%
Qwen 3.5 397B A17B1001001001001001009090908595.5%
GPT-5.210095959595959595959595.5%
GPT-5 Mini100100959595959595958595.0%
Qwen 3.5 27B100100100100100909090909095.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100959090858594.5%
Grok 4.3 (Reasoning)100100100100100959590907094.0%
Qwen3.6 Max Preview10010010010090909090909094.0%
Claude Opus 4.7 (Reasoning)10010010010090909090858593.0%
Gemma 4 31B (Reasoning)100100100100100909085858093.0%
Qwen 3.5 Flash10010010010090909090908093.0%
Z.AI GLM 5.1100100100100100909085857592.5%
GPT-5.110010010010090909090857091.5%
Qwen 3.5 122B100100100100100909090806591.5%
Gemini 2.5 Flash (Reasoning)1001001009590909090808091.5%
Aion 2.01001001009090909090857591.0%
Gemma 4 31B100100909090909090908091.0%
Stealth: Hunter Alpha10010010010090908585857591.0%
Qwen 3.5 Plus (2026-02-15)10010010010085858585858591.0%
DeepSeek V4 Pro (Reasoning)100100959090909090907090.5%
GPT-5.5 (Reasoning)9595959595959080808090.0%
Claude Opus 4.79090909090909090909090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009070707090.0%
GPT-5100100959590909085856589.5%
Qwen 3.5 Plus (2026-04-20)10090909090909090907589.5%
GPT-5.51001001008585858585858589.5%
DeepSeek V4 Flash (Reasoning)100100959090908585807589.0%
GPT-5.5 (Reasoning, Low)9595959595959080757088.5%
Z.AI GLM 5100100909090909090757088.5%
Qwen 3.6 Flash10090909090909090757588.0%
Xiaomi MIMO v2.5 Pro100100959590908585707088.0%
Grok 4 Fast100100959585858580807588.0%
o4 Mini9090909090908585858087.5%
Gemini 3 Pro (Preview)10090909090909080807587.5%
Qwen 3.6 27B9090909090909085807587.0%
Grok 4.1 Fast100100959590858580756587.0%
Claude Sonnet 4.610090858585858585858587.0%
Z.AI GLM 5 Turbo1001001009090907575757086.5%
MoonshotAI: Kimi K2.6100100959590858570707086.0%
ByteDance Seed 1.69090909090909075757585.5%
o4 Mini High9590909090858580807085.5%
ByteDance Seed 2.0 Mini10090909090908080756585.0%
Qwen 3.5 35B100100909090907575706584.5%
Xiaomi MIMO v2.59090909090858575757584.5%
Qwen 3.6 35B9090909090908075757084.0%
GPT-5.410010010010085707070707083.5%
Gemma 4 26B (Reasoning)1009090909090909085081.5%
Grok 410095909090858565605581.5%
ByteDance Seed 2.0 Lite9090908580807575756580.5%
Claude Sonnet 4.510090858585757570707080.5%
Z.AI GLM 4.69090858585807575756080.0%
Z.AI GLM 4.710090858080757575707080.0%
MoonshotAI: Kimi K2.59090909090858065606080.0%
Claude Haiku 4.58080808080808080808080.0%
GPT-OSS 120B9090908080757575706579.0%
Z.AI GLM 4.5 Air9080808080808080756078.5%
Stealth: Healer Alpha9090909085857565654578.0%
DeepSeek V4 Flash9090858075757570707078.0%
Claude Sonnet 410085858570707070707077.5%
MiniMax M2.79080808080807575755577.0%
Gemma 4 26B8080808075757575757577.0%
Claude 3.7 Sonnet9090807575757575656576.5%
Gemini 3 Flash (Preview)9075757575757575757076.0%
Gemini 2.5 Flash9090808075757070705575.5%
Inception Mercury 27575757575757575757074.5%
GPT-5.4 (Reasoning, Low)10075707070656565656571.0%
GPT-5.4 (Reasoning)8585807065656565656571.0%
Gemini 3.1 Flash Lite (Preview)7570707070707070707070.5%
Gemini 3.1 Flash Lite7570707070707070707070.5%
Mistral Large9080757575656560606070.5%
Claude 3.5 Sonnet8075707070707065656570.0%
Grok 4.208080757070707065606070.0%
MiniMax M2.58080757570706565606070.0%
Claude Opus 49075757575656560606070.0%
Gemini 3.1 Flash Lite (Reasoning)7070707070707070706569.5%
Z.AI GLM 4.57575757575707065605569.5%
GPT-5.4 Mini (Reasoning)7575757070707070655069.0%
Nemotron 3 Super10095907070706055453569.0%
ByteDance Seed 1.6 Flash9080757070656560555568.5%
Qwen 3.5 9B908080807575706565068.0%
Grok 4.20 (Beta)8080757570707055554567.5%
GPT-5.4 Mini (Reasoning, Low)8075757570656555555567.0%
Gemini 2.5 Flash Lite (Reasoning)7575757070656560555566.5%
Mistral Medium 3.1757575757575757070066.5%
GPT-4.1 Mini7070707070706565655066.5%
Mistral Large 38575656565606060606065.5%
GPT-4o, May 13th (temp=0)7070706565656565655065.0%
GPT-5.4 Nano (Reasoning)858075757570656560065.0%
Mistral Large 28075656565656560505064.0%
Llama 3.1 Nemotron 70B7565656565656060606064.0%
Stealth: Aurora Alpha8580757575757575101063.5%
Mistral Small 3.2 24B7575757065655555505063.5%
GPT-4.17565656565656560555063.0%
GPT-4o, Aug. 6th (temp=0)6565656565656560605563.0%
Z.AI GLM 4.7 Flash7070707065656055554562.5%
Mistral Small 4 (Reasoning)8075706565655555504062.0%
DeepSeek-V2 Chat8070707070605550454561.5%
DeepSeek V4 Pro858570707060555555060.5%
GPT-4o, Aug. 6th (temp=1)7070656560555555555060.0%
Grok 4.38070707065655555452059.5%
Qwen 3 32B8080656565555545453559.0%
Writer: Palmyra X57070606060555555454557.5%
Llama 3.1 70B7065605555555555555057.5%
Mistral Small Creative7570706055555550454057.5%
Ministral 3 14B80807575656565605557.5%
GPT-4o, May 13th (temp=1)7065606060605550504057.0%
GPT-5 Nano7565656060555545454557.0%
Qwen3 235B A22B Instruct 25076565656060555050454055.5%
DeepSeek V3 (2024-12-26)7065605555555545454555.0%
DeepSeek V3 (2025-03-24)7070605555555045454555.0%
Gemini 2.5 Flash Lite7560605555555550403554.0%
Qwen 2.5 72B7060606055555050453554.0%
Nemotron 3 Nano6560606055555545454054.0%
Gemma 3 27B7065555555505045454553.5%
DeepSeek V3.26565655555554540353051.0%
Inception Mercury6565605555505050301549.5%
Arcee AI: Trinity Large (Preview)7055555555505045352049.0%
GPT-5.4 Mini656055555550454545548.0%
Hermes 3 405B6555504545454540404047.0%
DeepSeek V3.1705555555045404040045.0%
Ministral 3 3B5045454040403535303039.0%
Ministral 3 8B555045454035302520535.0%
Ministral 8B65454545353030250032.0%
Gemma 3 12B5035353525252520201528.5%
Ministral 3B554030303030301515528.0%
GPT-5.4 Nano (Reasoning, Low)7575656000000027.5%
GPT-4o Mini (temp=1)3535353535252520101026.5%
Hermes 3 70B45403535303025205026.5%
GPT-4o Mini (temp=0)3525252525252525252526.0%
Arcee AI: Trinity Mini4535303025202020201025.5%
Mistral Small 4403535303025202015025.0%
Cohere Command R+ (Aug. 2024)504540403530000024.0%
Mistral NeMO504040352525000021.5%
Claude 3 Haiku303020202020202015520.0%
Llama 3.1 8B35303030302015100020.0%
GPT-5.4 Nano352020151515000012.0%
Gemma 3 4B30252010105000010.0%
Rocinante 12B5045000000009.5%
GPT-4.1 Nano00000000000.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.11001001001001001001001001009599.5%
Gemini 3.1 Flash Lite1001001001001001001001001009599.5%
Claude Opus 4.6 (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Grok 4.3 (Reasoning)1001001001001001001001001009099.0%
GPT-5.5 (Reasoning, Low)1001001001001001001001001009099.0%
MoonshotAI: Kimi K2.61001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Grok 4.20 (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 4.71001001001001001001001001009099.0%
ByteDance Seed 2.0 Mini1001001001001001001001001009099.0%
Qwen 3.5 Flash1001001001001001001001001009099.0%
DeepSeek V3 (2024-12-26)1001001001001001001001001009099.0%
MoonshotAI: Kimi K2.5100100100100100100100100959098.5%
Gemini 2.5 Pro100100100100100100100100959098.5%
Gemma 4 31B (Reasoning)100100100100100100100100909098.0%
Grok 4.1 Fast100100100100100100100100909098.0%
Grok 4 Fast100100100100100100100100909098.0%
Xiaomi MIMO v2.5100100100100100100100100909098.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100909098.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001008098.0%
DeepSeek V4 Flash (Reasoning)100100100100100959595959597.5%
Gemma 4 26B (Reasoning)10010010010010010010090909097.0%
Qwen 3.5 35B10010010010010010010090909097.0%
DeepSeek V4 Pro (Reasoning)10010010010010010010090909097.0%
Gemini 3 Pro (Preview)10010010010010010010090909097.0%
Qwen 3.5 9B10010010010010010010090909097.0%
GPT-5.4 Mini (Reasoning, Low)10010010010010010010090909097.0%
GPT-4o, May 13th (temp=0)100100100100100100100100858597.0%
Claude 3.5 Sonnet100100100100100100100100858597.0%
Z.AI GLM 4.61001001001001001009590909096.5%
DeepSeek-V2 Chat10010010010010010010090908596.5%
Qwen 3.5 Plus (2026-02-15)1001001009595959595959596.5%
DeepSeek V3.210010010010010010010090858095.5%
Z.AI GLM 4.5100100100100100909090909095.0%
o4 Mini High100100100100100909090909095.0%
Mistral Medium 3.11001001001001001008585858594.0%
o4 Mini1001001009090909090909093.0%
Grok 4.20 (Beta)100100100100100858585858592.5%
Grok 4.20100100100100100858585858592.5%
Grok 4100100909090909090909092.0%
GPT-4o, Aug. 6th (temp=1)10010010010090908585858592.0%
DeepSeek V3.110010010010090908585858592.0%
Gemma 4 31B10090909090909090909091.0%
Llama 3.1 70B10090909090909090909091.0%
Llama 3.1 Nemotron 70B10090909090909090909091.0%
Claude Opus 410010010010085858585858591.0%
GPT-OSS 120B100100909090909090908091.0%
Gemma 4 26B9090909090909090909090.0%
ByteDance Seed 2.0 Lite100100909090909090808090.0%
Inception Mercury 29090909090909090909090.0%
Gemini 2.5 Flash9090909090909090909090.0%
Z.AI GLM 4.5 Air100100909090909090807589.5%
Mistral Large10090909090909085858589.5%
Gemini 3.1 Flash Lite (Reasoning)10010010010010010010010095089.5%
Claude Opus 4.7100100909090909090757589.0%
Aion 2.010010010010010010010010090089.0%
Mistral Large 29090909090909090858589.0%
DeepSeek V4 Flash9090909090909090858589.0%
Hermes 3 405B1001001009090909085806088.5%
GPT-5.4 Nano (Reasoning)1001001001001001001009090088.0%
Inception Mercury10090909090909090757588.0%
Mistral Large 39090909090858585858587.5%
Gemini 2.5 Flash Lite (Reasoning)9090909090909090806586.5%
MiniMax M2.7100100909090858575757586.5%
Nemotron 3 Nano10090909090808080808086.0%
GPT-4o, May 13th (temp=1)1001001009085858585656586.0%
Gemma 3 12B9090909090908080807585.5%
Qwen 3.6 27B10010010010090909090801085.0%
MiniMax M2.5100100909090857575757085.0%
Claude Sonnet 4.68585858585858585858585.0%
Grok 4.310095909090858075756584.5%
DeepSeek V4 Pro10010095959590909085084.0%
GPT-4.1 Mini9090909085808075757583.0%
ByteDance Seed 1.6 Flash9090909080808080756582.0%
Stealth: Aurora Alpha909090909090909090081.0%
Qwen 3 32B9090909090808075705080.5%
Mistral Small 3.2 24B9090908075757575757580.0%
Qwen3 235B A22B Instruct 25078585858580757575756578.5%
GPT-5 Nano9090808080807070707078.0%
Mistral Small 4 (Reasoning)9090808080757575706578.0%
Qwen 2.5 72B9090908080808080703577.5%
Z.AI GLM 4.7 Flash10090908075757565605576.5%
Arcee AI: Trinity Large (Preview)9085858075707070706576.0%
Writer: Palmyra X58585858575757065656575.5%
Gemini 2.5 Flash Lite7575757575757575757074.5%
GPT-5.4 Mini9085807575757565654072.5%
Mistral Small Creative8080707070707070706571.5%
Cohere Command R+ (Aug. 2024)9090807070706560605571.0%
Claude Haiku 4.57575757575757560606070.5%
Gemma 3 27B7575706565656555555564.5%
Ministral 3 8B7070656560606050503558.5%
Ministral 3 14B857565606060605545557.0%
Ministral 8B7575656560605555352056.5%
GPT-4o Mini (temp=0)6060606060605050505056.0%
Ministral 3 3B6565656560605555403056.0%
GPT-5.4 Nano90807575706055400054.5%
Hermes 3 70B807565606060504540053.5%
GPT-4o Mini (temp=1)6060605050505050503551.5%
Ministral 3B6560555050505045403550.0%
Llama 3.1 8B7570656060404035302049.5%
Mistral Small 4706565605550504530049.0%
Mistral NeMO6555554545454545454549.0%
Arcee AI: Trinity Mini6060555045454035353546.0%
GPT-5.4 Nano (Reasoning, Low)100908080750000042.5%
Claude 3 Haiku4545454035353535201535.0%
Rocinante 12B554035151515555019.0%
GPT-4.1 Nano20201510101010100010.5%
WizardLM 2 8x22b1010000000002.0%
Gemma 3 4B50000000000.5%
LFM2 24B00000000000.0%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Qwen 3.5 27B1001001001001001001001001009099.0%
Qwen 3.6 27B1001001001001001001001001009099.0%
DeepSeek V4 Flash (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 4.51001001001001001001001001009099.0%
Grok 4 Fast1001001001001001001001001009099.0%
Xiaomi MIMO v2.51001001001001001001001001009099.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100909098.0%
Grok 410010010010010010010090909097.0%
Claude Sonnet 4.51001001001001001001001001007097.0%
Qwen 3.5 9B100100100100100100100100908097.0%
Gemini 3.1 Flash Lite (Preview)10010010010010010010090909097.0%
Claude Opus 4.710010010010010010010090908096.0%
Stealth: Hunter Alpha1001001001001001009090909096.0%
Gemini 3.1 Flash Lite (Reasoning)1001001001001001009090909096.0%
GPT-51001001001001001009090909096.0%
Z.AI GLM 4.6100100100100100100100100907096.0%
DeepSeek V3 (2025-03-24)100100100100100100100100808096.0%
DeepSeek V3.2100100100100100909090909095.0%
Gemini 3.1 Flash Lite100100100100100909090909095.0%
DeepSeek V4 Pro10010010010010010010090907095.0%
DeepSeek V4 Flash10010010010010010010080808094.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010080808094.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001003093.0%
Mistral Large1001001009090909090909093.0%
GPT-OSS 120B1001001001001001008080808092.0%
DeepSeek-V2 Chat1001001001001001009080807092.0%
Llama 3.1 Nemotron 70B100100100100100100100100903092.0%
Llama 3.1 70B1001001001001001009080707091.0%
Mistral Large 39090909090909090909090.0%
Mistral Large 29090909090909090909090.0%
DeepSeek V3 (2024-12-26)1001001001001001008080806090.0%
Z.AI GLM 4.5 Air10010010010010010010080705090.0%
Nemotron 3 Nano1001001001001001008080806090.0%
GPT-5.4 Nano (Reasoning)10010010010010010010010080088.0%
Mistral Small 4 (Reasoning)100100909090808080807086.0%
Inception Mercury 21001001008080808080808086.0%
Claude Opus 4100100100100100707070707085.0%
Z.AI GLM 4.7 Flash100100100100100808070705085.0%
Stealth: Aurora Alpha100100100100100808080803085.0%
DeepSeek V3.1100100100100100100908070084.0%
MiniMax M2.71001001008080808080707084.0%
GPT-4o, May 13th (temp=0)10010010010090907060606083.0%
ByteDance Seed 1.6 Flash10080808080808080807081.0%
Gemini 2.5 Flash10080808080808080707080.0%
Inception Mercury10080808080808080606078.0%
MiniMax M2.5100100908070707050505073.0%
GPT-5 Nano100100100808080805050072.0%
Mistral Medium 3.110070707070707070606071.0%
Claude Sonnet 4.67070707070707070707070.0%
Gemma 3 27B7070707070707070707070.0%
GPT-4o, Aug. 6th (temp=0)9090906060606060606069.0%
GPT-4o, May 13th (temp=1)9090707060606060606068.0%
GPT-4.1 Mini8080707070707070703068.0%
Qwen 2.5 72B8080808080805050505068.0%
GPT-4o, Aug. 6th (temp=1)9090706060606060606067.0%
Grok 4.2010090807060606050504066.0%
Grok 4.20 (Beta)8070706060606060605063.0%
Grok 4.37070707070706060504063.0%
Arcee AI: Trinity Mini10080808060605050303062.0%
Writer: Palmyra X5707070707070606060060.0%
GPT-4o Mini (temp=1)8080808050505050404060.0%
Qwen3 235B A22B Instruct 2507707070707060606060059.0%
Arcee AI: Trinity Large (Preview)7070707060605050404058.0%
Claude Haiku 4.57070707060605040404057.0%
GPT-5.4 Mini9070706060606040401056.0%
GPT-4o Mini (temp=0)8080505050505050505056.0%
Gemma 3 12B8050505050505050505053.0%
Mistral Small Creative6060606060605050403053.0%
GPT-5.4 Nano (Reasoning, Low)100100100803020000043.0%
Llama 3.1 8B707070704030302020042.0%
Gemini 2.5 Flash Lite5050404040404040404042.0%
Cohere Command R+ (Aug. 2024)806060605040202020041.0%
Ministral 3 3B70706050404040300040.0%
Mistral Small 3.2 24B6050504040403030301038.0%
Mistral NeMO705050504040303010037.0%
Claude 3 Haiku6060404040302020101033.0%
GPT-5.4 Nano6060505040303000032.0%
Ministral 3 14B5050404040202020201031.0%
Ministral 3B70706040202010100030.0%
Mistral Small 450504040403030200030.0%
Hermes 3 70B60505030302020100027.0%
GPT-4.1 Nano202020200000008.0%
Ministral 8B50101000000007.0%
Rocinante 12B40201000000007.0%
Gemma 3 4B20202000000006.0%
Ministral 3 8B100000000001.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
ByteDance Seed 1.61001001001001001001001001009299.2%
GPT-5.21001001001001001001001001009299.2%
GPT-5.51001001001001001001001001009299.2%
Claude Opus 41001001001001001001001001009299.2%
Z.AI GLM 5.11001001001001001001001001009299.2%
MoonshotAI: Kimi K2.61001001001001001001001001009299.2%
Qwen 3.5 Plus (2026-04-20)1001001001001001001001001009299.2%
Aion 2.01001001001001001001001001009299.2%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Grok 41001001001001001001001001009299.2%
Qwen 3.5 122B100100100100100100100100929298.3%
Grok 4.20 (Beta, Reasoning)100100100100100100100100929298.3%
DeepSeek V4 Flash (Reasoning)100100100100100100100100929298.3%
Inception Mercury 2100100100100100100100100929298.3%
GPT-5.4 (Reasoning)10010010010010010010092929297.5%
Z.AI GLM 51001001001001001001001001007597.5%
Qwen 3.5 9B100100100100100100100100928397.5%
Xiaomi MIMO v2.51001001001001001001001001007597.5%
Claude 3.5 Sonnet100100100100100100100100838396.7%
Mistral Medium 3.1100100100100100100100100838396.7%
Grok 4.20 (Reasoning)1001001001001001009292929296.7%
DeepSeek V4 Pro100100100100100100100100927596.7%
GPT-5.110010010010092929292929295.0%
GPT-510010010010092929292929295.0%
Stealth: Healer Alpha100100100100100100100100757595.0%
GPT-4o, May 13th (temp=0)10010010010010010010083838395.0%
Qwen3 235B A22B Instruct 25071001001001001001009292927595.0%
MiniMax M2.710010010010010010010083836793.3%
MiniMax M2.510010010010010010010083757593.3%
Grok 4.3 (Reasoning)100100100100100100100100924293.3%
o4 Mini1001001001001001009292757593.3%
Grok 4 Fast1001001001001001001001001003393.3%
Claude Opus 4.7 (Reasoning)10010010010010010010075757592.5%
GPT-5 Mini10092929292929292929292.5%
o4 Mini High100100100100100929292757592.5%
Llama 3.1 Nemotron 70B100100100100100928383838392.5%
GPT-5.4 Mini (Reasoning)1001001001001001008383836791.7%
Qwen 3.6 35B1001001001001001009275757591.7%
Qwen 3.5 Flash10010010010010010010092755091.7%
Qwen 3.6 Flash100100100100100100100100100090.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001007575757590.0%
Writer: Palmyra X51001001001001001007575757590.0%
Z.AI GLM 4.5100100100100100927575757589.2%
Gemini 3.1 Flash Lite (Preview)10010010010010010010075675089.2%
Gemini 2.5 Flash Lite (Reasoning)100100929292838383838389.2%
GPT-OSS 120B100100100100100929275755889.2%
Claude Sonnet 4.6 (Reasoning)100100100100100757575757587.5%
Xiaomi MIMO v2.5 Pro100100100100100757575757587.5%
Z.AI GLM 4.5 Air1001001001001001008375754287.5%
Z.AI GLM 4.6100100100100100100929275886.7%
Mistral Small 4 (Reasoning)1001001001001001009275584286.7%
DeepSeek V3 (2025-03-24)1001001009283838375757586.7%
Mistral Small Creative1001001009292838383755886.7%
Stealth: Hunter Alpha10010010010083757575757585.8%
GPT-5.4 Mini (Reasoning, Low)10010010010083838367676785.0%
Inception Mercury10010010010083757575755884.2%
GPT-5.4 Nano (Reasoning)10010010010010092928375084.2%
ByteDance Seed 1.6 Flash10092838383838383836784.2%
Gemini 3.1 Flash Lite1001001008383837575755082.5%
GPT-4.1 Mini929292929275