Accuracy (recall)

Test: Codex Violation Detection

Avg. Score
73.7%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash (Reasoning)94.0%$0.007912.5s86%
2Grok 4.1 Fast93.8%$0.002121.1s84%
3GPT-5.495.9%$0.0138.8s85%
4Z.AI GLM 5 Turbo94.8%$0.007217.1s85%
5Gemini 3 Flash (Preview, Reasoning)94.7%$0.01118.0s86%
6DeepSeek V4 Flash (Reasoning)94.4%$0.001040.3s82%
7GPT-5.596.7%$0.0307.7s90%
8Grok 4 Fast92.4%$0.001812.8s72%
9Grok 4.20 (Beta, Reasoning)96.1%$0.02616.8s89%
10Gemma 4 31B92.5%$0.000937.9s78%
11Grok 4.20 (Reasoning)95.8%$0.01545.4s90%
12GPT-5.295.7%$0.02424.2s87%
13Gemini 3.5 Flash (Reasoning, Minimal)90.7%$0.0113.6s73%
14Gemini 3 Flash (Preview)88.8%$0.00314.5s69%
15Claude Opus 4.598.0%$0.0419.7s91%
16Gemini 2.5 Pro97.4%$0.03523.2s91%
17GPT-5.4 (Reasoning, Low)92.4%$0.02013.5s79%
18Claude Sonnet 4.593.6%$0.0248.9s80%
19Xiaomi MIMO v2.5 Pro92.5%$0.009137.0s80%
20Gemini 2.5 Flash82.5%$0.00252.8s69%
21Claude Opus 4.697.3%$0.04010.2s88%
22Stealth: Healer Alpha89.0%$0.000021.8s67%
23GPT-5.5 (Reasoning, Low)95.5%$0.04014.2s88%
24Qwen 3.6 Flash92.4%$0.01129.9s72%
25Stealth: Hunter Alpha89.5%$0.000044.0s71%
26Qwen 3.5 Flash92.8%$0.00381.0m77%
27Xiaomi MIMO v2.589.2%$0.005821.8s65%
28Inception Mercury 282.2%$0.00304.6s62%
29Claude Sonnet 490.3%$0.0239.0s72%
30Qwen 3.6 35B92.1%$0.01351.0s78%
31Qwen 3.5 35B93.3%$0.01754.8s81%
32ByteDance Seed 1.691.9%$0.00671.0m77%
33Z.AI GLM 592.1%$0.01356.4s79%
34Gemini 3.1 Flash Lite81.1%$0.00183.2s59%
35GPT-5 Mini91.2%$0.009251.7s74%
36Gemini 3.5 Flash (Reasoning)94.8%$0.04518.1s86%
37GPT-5.4 Mini (Reasoning, Low)83.0%$0.00556.7s61%
38Gemini 3.1 Flash Lite (Preview)81.9%$0.00192.2s57%
39o4 Mini88.9%$0.01928.1s74%
40Z.AI GLM 4.791.3%$0.00911.0m77%
41Gemini 2.5 Flash Lite (Reasoning)81.1%$0.002017.6s62%
42Z.AI GLM 5.195.5%$0.0171.6m88%
43ByteDance Seed 2.0 Lite89.8%$0.00671.1m73%
44Qwen 3.5 Plus (2026-04-20)93.4%$0.0151.4m83%
45Claude Sonnet 4.682.9%$0.0249.9s70%
46Gemini 3.1 Flash Lite (Reasoning)79.3%$0.00183.7s52%
47Grok 4.3 (Reasoning)94.3%$0.0191.4m81%
48GPT-5.4 Mini (Reasoning)87.9%$0.02028.6s68%
49Qwen 3.5 27B95.6%$0.0211.7m89%
50DeepSeek V4 Flash78.1%$0.00039.3s53%
51Mistral Large 375.5%$0.003010.2s58%
52Qwen 3.5 122B92.7%$0.0261.2m81%
53Mistral Large79.3%$0.0129.1s59%
54MiniMax M2.779.1%$0.002229.4s59%
55MoonshotAI: Kimi K2.592.9%$0.0151.5m79%
56Qwen 3.5 Plus (2026-02-15)85.0%$0.004134.2s55%
57DeepSeek V4 Pro81.7%$0.003125.6s53%
58GPT-5.192.7%$0.03752.8s79%
59Mistral Large 275.9%$0.0128.8s57%
60Qwen3.7 Max96.3%$0.0451.1m85%
61o4 Mini High90.7%$0.03351.3s76%
62Grok 4.2077.0%$0.00536.4s50%
63Aion 2.091.7%$0.00841.3m64%
64Claude Opus 4.788.3%$0.0527.6s76%
65Claude Opus 4.7 (Reasoning)92.8%$0.06711.4s82%
66Mistral Medium 3.175.1%$0.00297.7s46%
67Grok 4.20 (Beta)74.8%$0.00592.5s47%
68Gemini 3.1 Pro (Preview)98.3%$0.06852.3s93%
69GPT-5.4 (Reasoning)91.7%$0.04243.1s75%
70MiniMax M2.574.6%$0.002025.5s52%
71Claude Opus 4.6 (Reasoning)97.3%$0.07632.0s90%
72GPT-5.5 (Reasoning)95.6%$0.07130.3s88%
73DeepSeek V3.275.0%$0.001317.2s45%
74Gemma 4 31B (Reasoning)94.4%$0.00172.8m85%
75Z.AI GLM 4.575.6%$0.002618.5s45%
76GPT-4.177.7%$0.008110.0s44%
77Z.AI GLM 4.686.1%$0.00491.3m59%
78Mistral Small 4 (Reasoning)70.9%$0.002016.3s47%
79ByteDance Seed 1.6 Flash67.9%$0.000912.0s46%
80Gemma 4 26B (Reasoning)90.0%$0.00222.0m65%
81Gemini 3 Pro (Preview)91.0%$0.05034.0s70%
82Z.AI GLM 4.5 Air77.1%$0.002539.8s46%
83DeepSeek-V2 Chat71.1%$0.002013.1s41%
84Grok 494.1%$0.0511.2m80%
85GPT-4.1 Mini63.7%$0.00155.8s43%
86Claude Haiku 4.566.8%$0.00785.8s45%
87DeepSeek V3 (2024-12-26)69.9%$0.001915.6s41%
88GPT-OSS 120B81.0%$0.00201.5m59%
89Qwen 3.6 27B89.5%$0.0221.4m63%
90Qwen3 235B A22B Instruct 250767.3%$0.000721.1s43%
91Writer: Palmyra X567.2%$0.006212.1s44%
92Gemma 4 26B75.0%$0.000625.0s35%
93Stealth: Aurora Alpha77.5%6.2s46%
94ByteDance Seed 2.0 Mini90.8%$0.00292.7m73%
95Claude Sonnet 4.6 (Reasoning)94.9%$0.07651.3s84%
96Claude 3.7 Sonnet77.4%$0.02310.2s42%
97Grok 4.364.3%$0.00616.1s39%
98Inception Mercury63.1%$0.00059.5s34%
99DeepSeek V3 (2025-03-24)68.9%$0.001522.9s34%
100GPT-4o, Aug. 6th (temp=1)65.4%$0.0113.6s36%
101Mistral Small Creative56.8%$0.00064.3s36%
102GPT-4o, Aug. 6th (temp=0)67.9%$0.0155.0s36%
103Claude 3.5 Sonnet79.2%$0.04210.5s47%
104Qwen3.6 Max Preview96.4%$0.0502.5m90%
105GPT-593.3%$0.0611.6m81%
106Z.AI GLM 4.7 Flash68.2%$0.00181.1m45%
107DeepSeek V3.166.1%$0.001431.1s33%
108Nemotron 3 Super83.1%$0.00002.3m55%
109Qwen 3 32B64.1%$0.001027.8s31%
110Qwen 3.5 9B84.7%$0.00202.4m55%
111Mistral Small 3.2 24B53.8%$0.000610.4s32%
112GPT-5.4 Nano (Reasoning)67.5%$0.003516.6s23%
113GPT-4o, May 13th (temp=0)71.6%$0.0305.4s36%
114Gemma 3 27B53.2%$0.000513.3s33%
115Qwen 3.5 397B A17B93.7%$0.0262.9m75%
116GPT-5.4 Mini54.5%$0.00333.1s29%
117DeepSeek V4 Pro (Reasoning)91.5%$0.0142.8m60%
118Gemini 2.5 Flash Lite47.9%$0.00052.3s24%
119GPT-4o, May 13th (temp=1)58.1%$0.0263.7s34%
120Ministral 3 14B46.4%$0.00106.3s25%
121Hermes 3 405B56.2%$0.004417.2s20%
122Llama 3.1 Nemotron 70B55.8%$0.005518.5s19%
123GPT-5 Nano67.2%$0.00491.9m42%
124Ministral 3 8B39.9%$0.00074.8s20%
125Llama 3.1 70B51.2%$0.002124.3s18%
126Qwen 2.5 72B42.3%$0.000813.9s18%
127Ministral 8B34.7%$0.00055.6s18%
128Arcee AI: Trinity Mini33.3%$0.000410.3s18%
129Claude Opus 484.4%$0.11615.8s65%
130Mistral Small 430.7%$0.00084.3s18%
131Ministral 3 3B28.6%$0.00053.3s16%
132Gemma 3 12B35.9%$0.000312.0s11%
133MoonshotAI: Kimi K2.694.7%$0.0384.1m75%
134GPT-4o Mini (temp=0)31.3%$0.000625.0s19%
135GPT-4o Mini (temp=1)29.4%$0.00067.4s13%
136Ministral 3B25.0%$0.00022.9s14%
137Cohere Command R+ (Aug. 2024)32.3%$0.01410.0s15%
138GPT-5.4 Nano (Reasoning, Low)32.9%$0.00136.1s2%
139Hermes 3 70B28.5%$0.001329.2s14%
140Claude 3 Haiku19.4%$0.00153.6s12%
141GPT-5.4 Nano21.3%$0.00082.9s4%
142Mistral NeMO20.6%$0.000713.3s7%
143Nemotron 3 Nano64.5%$0.00313.5m37%
144Llama 3.1 8B16.1%$0.000216.1s0%
145WizardLM 2 8x22b17.1%$0.003615.7s0%
146Rocinante 12B6.4%$0.00096.0s0%
147GPT-4.1 Nano2.6%$0.00043.9s0%
148Gemma 3 4B3.2%$0.000212.5s0%
149Arcee AI: Trinity Large (Preview)46.2%$0.00003.1m24%
150LFM2 24B0.0%$0.00081.9m0%
73.69%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max9898989897959595929296.2%
Grok 410098989797959594949196.1%
Qwen3.6 Max Preview9797979494949494919194.2%
Claude Sonnet 4.6 (Reasoning)9795959494949492928893.6%
GPT-5.59797959494949292898993.5%
Grok 4.20 (Reasoning)10097979494929291918693.5%
Qwen 3.5 397B A17B100100979594948989888593.2%
GPT-5.5 (Reasoning)9797949494929289898992.9%
GPT-5.5 (Reasoning, Low)9595959592929292918392.6%
GPT-5.49794949292929191919192.6%
MoonshotAI: Kimi K2.610098959494898888867991.2%
GPT-5.29797949292918988868591.2%
Gemini 2.5 Pro10095929292898888888691.2%
Qwen 3.5 27B9492929291919191888690.9%
DeepSeek V4 Flash (Reasoning)9595949292929189858290.9%
Grok 4 Fast100100979291898686838290.8%
GPT-59494929291918888888890.6%
Grok 4.1 Fast10097949191918989828290.6%
Grok 4.20 (Beta, Reasoning)9797949289898886868290.2%
Gemini 3.1 Pro (Preview)9494919191898888888690.0%
Claude Opus 4.6 (Reasoning)9492919191919188868590.0%
GPT-5.4 (Reasoning, Low)9494929291898888868389.8%
Z.AI GLM 5.19595929191918985858289.7%
GPT-5.4 (Reasoning)9494929191918685858289.1%
Claude Opus 4.59292919189898888858388.9%
Grok 4.3 (Reasoning)10091888888888888858088.3%
Claude Opus 4.7 (Reasoning)9191918988888886858388.0%
Z.AI GLM 5 Turbo9292918989868683838387.7%
Nemotron 3 Super9491898989888683838087.4%
Claude Opus 4.79291918989868583838087.1%
GPT-5 Mini9191888886868585858587.0%
Gemini 3.5 Flash (Reasoning)9291918886868585827986.5%
Gemini 2.5 Flash (Reasoning)9189898988858585857786.4%
ByteDance Seed 1.69491918988858382807986.2%
Z.AI GLM 4.69492928888868382777786.1%
Qwen 3.6 35B9797919185858280777485.9%
Qwen 3.5 Plus (2026-04-20)9791898988858582777485.8%
o4 Mini High9491918885838282827985.6%
GPT-5.19491888585858282808085.2%
Qwen 3.6 Flash8988888888858580797684.5%
Gemma 4 31B (Reasoning)9291898988888379746884.2%
Qwen 3.5 122B8988868683828280797983.5%
Qwen 3.5 35B9291898585837979767183.0%
Z.AI GLM 59785858382828279797783.0%
Gemini 3 Flash (Preview, Reasoning)8988858383828282827483.0%
Claude Opus 4.68585858583838080808082.7%
Aion 2.01009794929189888888082.7%
Xiaomi MIMO v2.5 Pro9288868583838079797082.6%
Z.AI GLM 4.79488888383827979747382.3%
Qwen 3.6 27B9291858585837776767382.3%
Gemma 4 31B8686858382828080797982.3%
Claude Opus 49289858382827977777081.7%
Qwen 3.5 Flash9491868582827977766281.4%
MoonshotAI: Kimi K2.59186868682828080736581.2%
Claude Sonnet 4.68682828280777776767479.2%
Gemma 4 26B (Reasoning)8885808079797776747479.2%
Claude Sonnet 48382808080797977737178.5%
Gemini 3 Flash (Preview)8582828079767676747478.3%
Claude Sonnet 4.58583828079777777717078.2%
Gemini 3.5 Flash (Reasoning, Minimal)8582828079797774717077.9%
o4 Mini8886828279767673706777.7%
Stealth: Hunter Alpha8883838280777673685977.0%
GPT-5.4 Mini (Reasoning)8382828080747373717076.8%
ByteDance Seed 2.0 Lite8585838079777668656276.1%
Stealth: Healer Alpha9491918279767462584274.8%
ByteDance Seed 2.0 Mini8382767474747473736474.7%
Qwen 3.5 9B8582797777767371626274.4%
Xiaomi MIMO v2.5898585827979767070071.4%
DeepSeek V4 Pro (Reasoning)97959492888582760070.9%
Gemini 2.5 Flash7977767674737064615970.8%
Gemini 3 Pro (Preview)8382807676767473671570.2%
Inception Mercury 27977747171707070615870.0%
Gemini 2.5 Flash Lite (Reasoning)7977767373686767625569.5%
GPT-5.4 Mini (Reasoning, Low)7777717070686764625968.5%
DeepSeek V4 Pro7776717168686764565367.1%
GPT-OSS 120B7676737067676261565365.9%
Qwen 3.5 Plus (2026-02-15)928583827977763938065.2%
Mistral Large7970676765656261595865.2%
Mistral Large 37373706868646161565264.4%
GPT-4.17774706765646258564563.8%
Gemma 4 26B7067676565646459585663.3%
MiniMax M2.77771686464645956554862.6%
Gemini 3.1 Flash Lite7067626262585858564860.0%
DeepSeek V4 Flash7973676764645955383259.5%
Stealth: Aurora Alpha6867656559585858554259.4%
Gemini 3.1 Flash Lite (Preview)6765656259595656535259.4%
Mistral Large 26767676159585656534758.9%
Gemini 3.1 Flash Lite (Reasoning)6565626161616159533257.9%
GPT-5.4 Nano (Reasoning)82807674706864590057.3%
DeepSeek V3.29465595952474747443955.3%
MiniMax M2.57371646153535042413654.4%
Z.AI GLM 4.57667555353504545453652.6%
Z.AI GLM 4.5 Air807168686756523223352.0%
Grok 4.207361555352504747393551.1%
Claude Haiku 4.56758505050504847474250.9%
DeepSeek-V2 Chat6158565350504744393349.1%
Mistral Medium 3.16256535252484848472148.8%
DeepSeek V3 (2024-12-26)6764595858564835241748.5%
Grok 4.20 (Beta)6158565550483939383648.0%
GPT-5 Nano5858565250484544353548.0%
Qwen3 235B A22B Instruct 25075855555555444442363347.6%
DeepSeek V3.1716764625848413624047.1%
Mistral Small 4 (Reasoning)5350504747444441392944.4%
Claude 3.7 Sonnet6159484439393838353043.2%
Writer: Palmyra X5565653474444413938342.1%
Grok 4.373616155504742210040.9%
Z.AI GLM 4.7 Flash5347454241383333322939.4%
DeepSeek V3 (2025-03-24)686248474533323023038.9%
ByteDance Seed 1.6 Flash5941393938383532292337.3%
Mistral Small Creative474141413835333329534.2%
Claude 3.5 Sonnet3836353333323232303033.2%
Qwen 3 32B4441393938322618171230.6%
GPT-4.1 Mini5344333330292621201430.3%
Nemotron 3 Nano393836323029292714327.7%
Inception Mercury523835352923202017226.8%
GPT-5.4 Nano (Reasoning, Low)594847413932000026.7%
GPT-5.4 Mini3832322929292318151425.8%
Mistral Small 3.2 24B44393935242118189024.8%
GPT-4o, May 13th (temp=1)363029292926241817624.4%
GPT-4o, Aug. 6th (temp=0)44352929292927230024.4%
GPT-4o, Aug. 6th (temp=1)3532292724212015141423.0%
Ministral 3 8B353029242121201814321.5%
Ministral 3 14B3232272424211460018.0%
GPT-4o, May 13th (temp=0)39353333320000017.3%
Qwen 2.5 72B27262017151515129816.4%
Ministral 8B333224232312222215.3%
Gemma 3 27B3626151514121199014.7%
Mistral Small 4292018121211920011.2%
Gemini 2.5 Flash Lite3230261850000011.1%
Hermes 3 405B2718111188665210.0%
Hermes 3 70B14111188852006.4%
Llama 3.1 Nemotron 70B12121186530005.6%
Arcee AI: Trinity Mini159665332004.8%
Cohere Command R+ (Aug. 2024)1212965300004.7%
Gemma 3 12B119865000003.8%
GPT-5.4 Nano239330000003.8%
GPT-4o Mini (temp=1)96652000002.7%
Llama 3.1 70B126532000002.7%
Claude 3 Haiku118322200002.6%
WizardLM 2 8x22b96622000002.4%
Mistral NeMO110000000001.1%
GPT-4o Mini (temp=0)80000000000.8%
Llama 3.1 8B53000000000.8%
Ministral 3 3B53000000000.8%
GPT-4.1 Nano30000000000.3%
Ministral 3B30000000000.3%
Gemma 3 4B20000000000.2%
Arcee AI: Trinity Large (Preview)00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Gemini 3.1 Pro (Preview)1001001001001001009797979798.9%
Claude Opus 4.610010010010097979797979297.9%
GPT-5.410010010010097979797979297.9%
GPT-5.5 (Reasoning, Low)100100100100100959595959297.1%
Grok 410010010010095959595959596.8%
Z.AI GLM 5 Turbo100100100100100979792928796.6%
MoonshotAI: Kimi K2.6100100979797979595959296.6%
Z.AI GLM 5.11001001009795959595959296.3%
Claude Sonnet 4.6 (Reasoning)10097979797979792929296.1%
MoonshotAI: Kimi K2.51001001009797959592929296.1%
Claude Sonnet 4.610010010010095959595928795.8%
Qwen3.7 Max1001001009595959595929295.8%
Gemini 3.5 Flash (Reasoning)10095959595959595929294.7%
DeepSeek V4 Pro (Reasoning)10097979595959595928794.7%
GPT-5.5 (Reasoning)9797959595959592929294.5%
DeepSeek V4 Flash (Reasoning)1001001009797959289848493.9%
Claude Sonnet 410010010010092929292878493.9%
Qwen3.6 Max Preview10095959595929292928993.7%
Qwen 3.6 35B1001001009595959589897693.4%
Grok 4.3 (Reasoning)10095959595959589898493.2%
Qwen 3.5 35B10097979592928989878792.6%
GPT-5.29797959595928989898792.6%
Grok 4.20 (Beta)9797979797928989877992.4%
Claude Opus 4.6 (Reasoning)100100959595928787878792.4%
Qwen 3.6 Flash9595959595958989878792.1%
ByteDance Seed 1.610095959592898989898491.8%
Gemini 3 Flash (Preview, Reasoning)10097959592928987878491.8%
Xiaomi MIMO v2.5 Pro100100959292929289848291.8%
GPT-5.4 (Reasoning)9595959292929289878791.6%
Grok 4.20 (Beta, Reasoning)9595959595898989898491.6%
Qwen 3.5 Flash10095929289898989878791.1%
Grok 4.20 (Reasoning)9595959589898989898491.1%
Gemma 4 31B (Reasoning)10097959595959284797690.8%
Claude Opus 4.79292929292929292927990.8%
Grok 4.209797979797898989827190.8%
Qwen 3.5 122B10095929292898987878290.5%
Gemini 2.5 Flash (Reasoning)10095928989898989878490.5%
Qwen 3.5 27B9292929289898989898790.3%
Qwen 3.5 Plus (2026-04-20)10095898989898987878490.0%
Grok 4 Fast9595958989898989897990.0%
GPT-59595928989898989878289.7%
Grok 4.1 Fast9595959292898987847989.7%
Qwen 3.5 397B A17B9292929292929284848489.7%
Gemma 4 31B9292929292929284848489.7%
GPT-5.4 (Reasoning, Low)9595898989898989848489.5%
Claude Opus 4.7 (Reasoning)9595958989898989828289.5%
Qwen 3.6 27B10095959592898989846389.2%
Gemini 3.5 Flash (Reasoning, Minimal)9292929292928784848489.2%
GPT-5.4 Mini (Reasoning)9595898989898484848488.4%
Aion 2.09595898989898987847688.4%
Z.AI GLM 4.710092929289878484847487.9%
Z.AI GLM 5100100959289828279797987.6%
Stealth: Hunter Alpha10095959289878784826687.6%
Stealth: Healer Alpha10092898989878484827987.6%
Xiaomi MIMO v2.59595898989878484828287.6%
GPT-5.19292928989898482828287.4%
Gemini 3 Flash (Preview)9292929292848484847687.4%
o4 Mini High9595878787878484828286.8%
Gemini 3 Pro (Preview)10095929289898279767486.8%
Z.AI GLM 4.69792928989898784747186.6%
ByteDance Seed 2.0 Lite9589898989848282828286.3%
GPT-4.1100100959292848476716686.1%
o4 Mini9589898787878282797485.0%
Nemotron 3 Super8989898484848484827985.0%
Qwen 3.5 9B9595898984848276747484.2%
DeepSeek V4 Flash9592929287877979746684.2%
Gemma 4 26B (Reasoning)9592878484847979797683.9%
Gemini 3.1 Flash Lite (Preview)9789828282828282827983.7%
ByteDance Seed 2.0 Mini9595928482828279766883.4%
Gemini 2.5 Flash9292898987828271716882.4%
Gemini 3.1 Flash Lite (Reasoning)10092848484797676747182.1%
DeepSeek V4 Pro10087878484797974747482.1%
Claude Opus 49592878479797979716881.3%
GPT-5.4 Mini (Reasoning, Low)8989878482797976747481.3%
Qwen 3.5 Plus (2026-02-15)9292848484847979686381.1%
Inception Mercury 28484848279797979766879.5%
Writer: Palmyra X58989878482797471686678.9%
Stealth: Aurora Alpha8989847676767676766678.7%
GPT-5 Mini8989878787878479534578.7%
GPT-OSS 120B8482797979797976766878.2%
Gemini 3.1 Flash Lite8984827979767474746877.9%
Z.AI GLM 4.58787828282797674636377.4%
MiniMax M2.58989828279767471686177.1%
Gemini 2.5 Flash Lite (Reasoning)8484828279767674686376.8%
GPT-5.4 Nano (Reasoning)8484827976767471716376.1%
Claude 3.5 Sonnet8282828274747466666173.9%
MiniMax M2.78989847976716663615873.7%
Claude 3.7 Sonnet7674747474747471717173.2%
Mistral Medium 3.17676747474747468666371.8%
Mistral Large8784797168686666616171.1%
GPT-4o, May 13th (temp=0)7979747474686868636170.8%
DeepSeek V3 (2024-12-26)8482797168686663636170.5%
Mistral Large 28279797166666666636370.0%
DeepSeek-V2 Chat8976767468686363615869.7%
Qwen3 235B A22B Instruct 25078482797674686663584569.5%
DeepSeek V3.29584827674717163611869.5%
Grok 4.38276746868686663615868.4%
Z.AI GLM 4.7 Flash7974747168686363615567.6%
Mistral Large 37471716868686663636167.4%
Mistral Small 4 (Reasoning)7976746863636358555365.3%
GPT-4.1 Mini7674747166666361503463.4%
Z.AI GLM 4.5 Air878279797466665332061.6%
DeepSeek V3 (2025-03-24)898974746866635829061.1%
Nemotron 3 Nano7974666363615847453759.2%
Claude Haiku 4.587797979717163610058.9%
GPT-4o, Aug. 6th (temp=0)7461615858585555555358.7%
Mistral Small 3.2 24B6863636158555555555358.7%
ByteDance Seed 1.6 Flash7468666361616155502958.7%
GPT-5 Nano82827674716866610057.9%
DeepSeek V3.182828279717155420056.3%
Gemini 2.5 Flash Lite7668616158504747453955.3%
Qwen 3 32B7674636158535350372655.0%
GPT-4o, Aug. 6th (temp=1)686866636158554745053.2%
Inception Mercury8466665353504542422952.9%
GPT-4o, May 13th (temp=1)7471685353534545323252.4%
GPT-5.4 Mini8782797471665000050.8%
Mistral Small Creative6353535047474539392145.8%
Arcee AI: Trinity Large (Preview)766158505042393732044.5%
Ministral 3 8B535047474545423939040.8%
Ministral 8B5045424239373737343439.7%
Gemma 3 27B4745423939373737373739.7%
Ministral 3 14B5047454239373734322438.7%
Hermes 3 405B5050474542373429292138.4%
Ministral 3B535045393926241813831.6%
GPT-5.4 Nano (Reasoning, Low)8479797100000031.3%
Mistral Small 4584734323229292413029.7%
GPT-4o Mini (temp=0)3232323232323232321129.5%
Gemma 3 12B3734342929262624211127.1%
Arcee AI: Trinity Mini373737373426241611826.6%
Qwen 2.5 72B4239393734322183025.5%
Ministral 3 3B50393932292616165025.3%
Gemma 4 26B797979000000023.7%
Llama 3.1 Nemotron 70B393229262624211816023.2%
Llama 3.1 70B47343226211616133020.8%
GPT-5.4 Nano53453732290000019.5%
Hermes 3 70B34342626181616113018.4%
Cohere Command R+ (Aug. 2024)39343226248550017.4%
GPT-4o Mini (temp=1)2421181613111150011.8%
Claude 3 Haiku26181185553008.2%
Mistral NeMO1811000000002.9%
Llama 3.1 8B50000000000.5%
GPT-4.1 Nano30000000000.3%
Rocinante 12B30000000000.3%
WizardLM 2 8x22b00000000000.0%
Gemma 3 4B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959098.5%
Claude Opus 4.6100100100100100100100100908597.5%
Claude Opus 4.6 (Reasoning)100100100100100100100100858597.0%
Grok 4.20 (Reasoning)10010010010095959595959597.0%
Gemini 2.5 Pro10010010010010010010090908096.0%
Gemini 3.5 Flash (Reasoning)1001001001001001009090908595.5%
Qwen 3.5 397B A17B1001001001001001009090908595.5%
GPT-5.210095959595959595959595.5%
GPT-5 Mini100100959595959595958595.0%
Qwen 3.5 27B100100100100100909090909095.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100959090858594.5%
Qwen3.6 Max Preview10010010010090909090909094.0%
Grok 4.3 (Reasoning)100100100100100959590907094.0%
Claude Opus 4.7 (Reasoning)10010010010090909090858593.0%
Gemma 4 31B (Reasoning)100100100100100909085858093.0%
Qwen 3.5 Flash10010010010090909090908093.0%
Z.AI GLM 5.1100100100100100909085857592.5%
GPT-5.110010010010090909090857091.5%
Qwen 3.5 122B100100100100100909090806591.5%
Gemini 2.5 Flash (Reasoning)1001001009590909090808091.5%
Aion 2.01001001009090909090857591.0%
Stealth: Hunter Alpha10010010010090908585857591.0%
Gemma 4 31B100100909090909090908091.0%
Qwen 3.5 Plus (2026-02-15)10010010010085858585858591.0%
DeepSeek V4 Pro (Reasoning)100100959090909090907090.5%
GPT-5.5 (Reasoning)9595959595959080808090.0%
Claude Opus 4.79090909090909090909090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009070707090.0%
GPT-5100100959590909085856589.5%
Qwen 3.5 Plus (2026-04-20)10090909090909090907589.5%
GPT-5.51001001008585858585858589.5%
DeepSeek V4 Flash (Reasoning)100100959090908585807589.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100909090909090757589.0%
GPT-5.5 (Reasoning, Low)9595959595959080757088.5%
Z.AI GLM 5100100909090909090757088.5%
Qwen 3.6 Flash10090909090909090757588.0%
Xiaomi MIMO v2.5 Pro100100959590908585707088.0%
Qwen3.7 Max100100100100100909080606088.0%
Grok 4 Fast100100959585858580807588.0%
Gemini 3 Pro (Preview)10090909090909080807587.5%
o4 Mini9090909090908585858087.5%
Qwen 3.6 27B9090909090909085807587.0%
Grok 4.1 Fast100100959590858580756587.0%
Claude Sonnet 4.610090858585858585858587.0%
Z.AI GLM 5 Turbo1001001009090907575757086.5%
MoonshotAI: Kimi K2.6100100959590858570707086.0%
ByteDance Seed 1.69090909090909075757585.5%
o4 Mini High9590909090858580807085.5%
ByteDance Seed 2.0 Mini10090909090908080756585.0%
Xiaomi MIMO v2.59090909090858575757584.5%
Qwen 3.5 35B100100909090907575706584.5%
Qwen 3.6 35B9090909090908075757084.0%
GPT-5.410010010010085707070707083.5%
Gemma 4 26B (Reasoning)1009090909090909085081.5%
Grok 410095909090858565605581.5%
Claude Sonnet 4.510090858585757570707080.5%
ByteDance Seed 2.0 Lite9090908580807575756580.5%
MoonshotAI: Kimi K2.59090909090858065606080.0%
Z.AI GLM 4.710090858080757575707080.0%
Z.AI GLM 4.69090858585807575756080.0%
Claude Haiku 4.58080808080808080808080.0%
GPT-OSS 120B9090908080757575706579.0%
Z.AI GLM 4.5 Air9080808080808080756078.5%
Stealth: Healer Alpha9090909085857565654578.0%
DeepSeek V4 Flash9090858075757570707078.0%
Claude Sonnet 410085858570707070707077.5%
Gemma 4 26B8080808075757575757577.0%
MiniMax M2.79080808080807575755577.0%
Claude 3.7 Sonnet9090807575757575656576.5%
Gemini 3 Flash (Preview)9075757575757575757076.0%
Gemini 2.5 Flash9090808075757070705575.5%
Inception Mercury 27575757575757575757074.5%
GPT-5.4 (Reasoning)8585807065656565656571.0%
GPT-5.4 (Reasoning, Low)10075707070656565656571.0%
Gemini 3.1 Flash Lite (Preview)7570707070707070707070.5%
Gemini 3.1 Flash Lite7570707070707070707070.5%
Mistral Large9080757575656560606070.5%
Claude 3.5 Sonnet8075707070707065656570.0%
MiniMax M2.58080757570706565606070.0%
Claude Opus 49075757575656560606070.0%
Grok 4.208080757070707065606070.0%
Gemini 3.1 Flash Lite (Reasoning)7070707070707070706569.5%
Z.AI GLM 4.57575757575707065605569.5%
Nemotron 3 Super10095907070706055453569.0%
GPT-5.4 Mini (Reasoning)7575757070707070655069.0%
ByteDance Seed 1.6 Flash9080757070656560555568.5%
Qwen 3.5 9B908080807575706565068.0%
Grok 4.20 (Beta)8080757570707055554567.5%
GPT-5.4 Mini (Reasoning, Low)8075757570656555555567.0%
Mistral Medium 3.1757575757575757070066.5%
GPT-4.1 Mini7070707070706565655066.5%
Gemini 2.5 Flash Lite (Reasoning)7575757070656560555566.5%
Mistral Large 38575656565606060606065.5%
GPT-4o, May 13th (temp=0)7070706565656565655065.0%
GPT-5.4 Nano (Reasoning)858075757570656560065.0%
Llama 3.1 Nemotron 70B7565656565656060606064.0%
Mistral Large 28075656565656560505064.0%
Stealth: Aurora Alpha8580757575757575101063.5%
Mistral Small 3.2 24B7575757065655555505063.5%
GPT-4o, Aug. 6th (temp=0)6565656565656560605563.0%
GPT-4.17565656565656560555063.0%
Z.AI GLM 4.7 Flash7070707065656055554562.5%
Mistral Small 4 (Reasoning)8075706565655555504062.0%
DeepSeek-V2 Chat8070707070605550454561.5%
DeepSeek V4 Pro858570707060555555060.5%
GPT-4o, Aug. 6th (temp=1)7070656560555555555060.0%
Grok 4.38070707065655555452059.5%
Qwen 3 32B8080656565555545453559.0%
Mistral Small Creative7570706055555550454057.5%
Writer: Palmyra X57070606060555555454557.5%
Llama 3.1 70B7065605555555555555057.5%
Ministral 3 14B80807575656565605557.5%
GPT-4o, May 13th (temp=1)7065606060605550504057.0%
GPT-5 Nano7565656060555545454557.0%
Qwen3 235B A22B Instruct 25076565656060555050454055.5%
DeepSeek V3 (2025-03-24)7070605555555045454555.0%
DeepSeek V3 (2024-12-26)7065605555555545454555.0%
Nemotron 3 Nano6560606055555545454054.0%
Qwen 2.5 72B7060606055555050453554.0%
Gemini 2.5 Flash Lite7560605555555550403554.0%
Gemma 3 27B7065555555505045454553.5%
DeepSeek V3.26565655555554540353051.0%
Inception Mercury6565605555505050301549.5%
Arcee AI: Trinity Large (Preview)7055555555505045352049.0%
GPT-5.4 Mini656055555550454545548.0%
Hermes 3 405B6555504545454540404047.0%
DeepSeek V3.1705555555045404040045.0%
Ministral 3 3B5045454040403535303039.0%
Ministral 3 8B555045454035302520535.0%
Ministral 8B65454545353030250032.0%
Gemma 3 12B5035353525252520201528.5%
Ministral 3B554030303030301515528.0%
GPT-5.4 Nano (Reasoning, Low)7575656000000027.5%
GPT-4o Mini (temp=1)3535353535252520101026.5%
Hermes 3 70B45403535303025205026.5%
GPT-4o Mini (temp=0)3525252525252525252526.0%
Arcee AI: Trinity Mini4535303025202020201025.5%
Mistral Small 4403535303025202015025.0%
Cohere Command R+ (Aug. 2024)504540403530000024.0%
Mistral NeMO504040352525000021.5%
Claude 3 Haiku303020202020202015520.0%
Llama 3.1 8B35303030302015100020.0%
GPT-5.4 Nano352020151515000012.0%
Gemma 3 4B30252010105000010.0%
Rocinante 12B5045000000009.5%
GPT-4.1 Nano00000000000.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.11001001001001001001001001009599.5%
Gemini 3.1 Flash Lite1001001001001001001001001009599.5%
Claude Opus 4.6 (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Grok 4.3 (Reasoning)1001001001001001001001001009099.0%
GPT-5.5 (Reasoning, Low)1001001001001001001001001009099.0%
MoonshotAI: Kimi K2.61001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Grok 4.20 (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 4.71001001001001001001001001009099.0%
ByteDance Seed 2.0 Mini1001001001001001001001001009099.0%
Qwen 3.5 Flash1001001001001001001001001009099.0%
DeepSeek V3 (2024-12-26)1001001001001001001001001009099.0%
Gemini 2.5 Pro100100100100100100100100959098.5%
MoonshotAI: Kimi K2.5100100100100100100100100959098.5%
Gemma 4 31B (Reasoning)100100100100100100100100909098.0%
Grok 4.1 Fast100100100100100100100100909098.0%
Grok 4 Fast100100100100100100100100909098.0%
Xiaomi MIMO v2.5100100100100100100100100909098.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100909098.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001008098.0%
DeepSeek V4 Flash (Reasoning)100100100100100959595959597.5%
Gemma 4 26B (Reasoning)10010010010010010010090909097.0%
DeepSeek V4 Pro (Reasoning)10010010010010010010090909097.0%
Gemini 3 Pro (Preview)10010010010010010010090909097.0%
Qwen 3.5 35B10010010010010010010090909097.0%
GPT-5.4 Mini (Reasoning, Low)10010010010010010010090909097.0%
Qwen 3.5 9B10010010010010010010090909097.0%
GPT-4o, May 13th (temp=0)100100100100100100100100858597.0%
Claude 3.5 Sonnet100100100100100100100100858597.0%
Z.AI GLM 4.61001001001001001009590909096.5%
Qwen 3.5 Plus (2026-02-15)1001001009595959595959596.5%
DeepSeek-V2 Chat10010010010010010010090908596.5%
DeepSeek V3.210010010010010010010090858095.5%
o4 Mini High100100100100100909090909095.0%
Z.AI GLM 4.5100100100100100909090909095.0%
Mistral Medium 3.11001001001001001008585858594.0%
o4 Mini1001001009090909090909093.0%
Grok 4.20 (Beta)100100100100100858585858592.5%
Grok 4.20100100100100100858585858592.5%
Grok 4100100909090909090909092.0%
GPT-4o, Aug. 6th (temp=1)10010010010090908585858592.0%
DeepSeek V3.110010010010090908585858592.0%
Gemma 4 31B10090909090909090909091.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100555591.0%
GPT-OSS 120B100100909090909090908091.0%
Llama 3.1 70B10090909090909090909091.0%
Llama 3.1 Nemotron 70B10090909090909090909091.0%
Claude Opus 410010010010085858585858591.0%
Gemma 4 26B9090909090909090909090.0%
ByteDance Seed 2.0 Lite100100909090909090808090.0%
Inception Mercury 29090909090909090909090.0%
Gemini 2.5 Flash9090909090909090909090.0%
Z.AI GLM 4.5 Air100100909090909090807589.5%
Mistral Large10090909090909085858589.5%
Gemini 3.1 Flash Lite (Reasoning)10010010010010010010010095089.5%
Mistral Large 29090909090909090858589.0%
Claude Opus 4.7100100909090909090757589.0%
Aion 2.010010010010010010010010090089.0%
DeepSeek V4 Flash9090909090909090858589.0%
Hermes 3 405B1001001009090909085806088.5%
GPT-5.4 Nano (Reasoning)1001001001001001001009090088.0%
Inception Mercury10090909090909090757588.0%
Mistral Large 39090909090858585858587.5%
Gemini 2.5 Flash Lite (Reasoning)9090909090909090806586.5%
MiniMax M2.7100100909090858575757586.5%
GPT-4o, May 13th (temp=1)1001001009085858585656586.0%
Nemotron 3 Nano10090909090808080808086.0%
Gemma 3 12B9090909090908080807585.5%
Qwen 3.6 27B10010010010090909090801085.0%
MiniMax M2.5100100909090857575757085.0%
Claude Sonnet 4.68585858585858585858585.0%
Grok 4.310095909090858075756584.5%
DeepSeek V4 Pro10010095959590909085084.0%
GPT-4.1 Mini9090909085808075757583.0%
ByteDance Seed 1.6 Flash9090909080808080756582.0%
Stealth: Aurora Alpha909090909090909090081.0%
Qwen 3 32B9090909090808075705080.5%
Mistral Small 3.2 24B9090908075757575757580.0%
Qwen3 235B A22B Instruct 25078585858580757575756578.5%
GPT-5 Nano9090808080807070707078.0%
Mistral Small 4 (Reasoning)9090808080757575706578.0%
Qwen 2.5 72B9090908080808080703577.5%
Z.AI GLM 4.7 Flash10090908075757565605576.5%
Arcee AI: Trinity Large (Preview)9085858075707070706576.0%
Writer: Palmyra X58585858575757065656575.5%
Gemini 2.5 Flash Lite7575757575757575757074.5%
GPT-5.4 Mini9085807575757565654072.5%
Mistral Small Creative8080707070707070706571.5%
Cohere Command R+ (Aug. 2024)9090807070706560605571.0%
Claude Haiku 4.57575757575757560606070.5%
Gemma 3 27B7575706565656555555564.5%
Ministral 3 8B7070656560606050503558.5%
Ministral 3 14B857565606060605545557.0%
Ministral 8B7575656560605555352056.5%
GPT-4o Mini (temp=0)6060606060605050505056.0%
Ministral 3 3B6565656560605555403056.0%
GPT-5.4 Nano90807575706055400054.5%
Hermes 3 70B807565606060504540053.5%
GPT-4o Mini (temp=1)6060605050505050503551.5%
Ministral 3B6560555050505045403550.0%
Llama 3.1 8B7570656060404035302049.5%
Mistral NeMO6555554545454545454549.0%
Mistral Small 4706565605550504530049.0%
Arcee AI: Trinity Mini6060555045454035353546.0%
GPT-5.4 Nano (Reasoning, Low)100908080750000042.5%
Claude 3 Haiku4545454035353535201535.0%
Rocinante 12B554035151515555019.0%
GPT-4.1 Nano20201510101010100010.5%
WizardLM 2 8x22b1010000000002.0%
Gemma 3 4B50000000000.5%
LFM2 24B00000000000.0%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Qwen 3.5 27B1001001001001001001001001009099.0%
Qwen 3.6 27B1001001001001001001001001009099.0%
DeepSeek V4 Flash (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 4.51001001001001001001001001009099.0%
Grok 4 Fast1001001001001001001001001009099.0%
Xiaomi MIMO v2.51001001001001001001001001009099.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100909098.0%
Grok 410010010010010010010090909097.0%
Claude Sonnet 4.51001001001001001001001001007097.0%
Qwen 3.5 9B100100100100100100100100908097.0%
Gemini 3.1 Flash Lite (Preview)10010010010010010010090909097.0%
GPT-51001001001001001009090909096.0%
Claude Opus 4.710010010010010010010090908096.0%
Z.AI GLM 4.6100100100100100100100100907096.0%
DeepSeek V3 (2025-03-24)100100100100100100100100808096.0%
Stealth: Hunter Alpha1001001001001001009090909096.0%
Gemini 3.1 Flash Lite (Reasoning)1001001001001001009090909096.0%
Gemini 3.1 Flash Lite100100100100100909090909095.0%
DeepSeek V4 Pro10010010010010010010090907095.0%
DeepSeek V3.2100100100100100909090909095.0%
DeepSeek V4 Flash10010010010010010010080808094.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010080808094.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001003093.0%
Mistral Large1001001009090909090909093.0%
GPT-OSS 120B1001001001001001008080808092.0%
DeepSeek-V2 Chat1001001001001001009080807092.0%
Llama 3.1 Nemotron 70B100100100100100100100100903092.0%
Llama 3.1 70B1001001001001001009080707091.0%
Mistral Large 39090909090909090909090.0%
Mistral Large 29090909090909090909090.0%
DeepSeek V3 (2024-12-26)1001001001001001008080806090.0%
Z.AI GLM 4.5 Air10010010010010010010080705090.0%
Nemotron 3 Nano1001001001001001008080806090.0%
GPT-5.4 Nano (Reasoning)10010010010010010010010080088.0%
Inception Mercury 21001001008080808080808086.0%
Mistral Small 4 (Reasoning)100100909090808080807086.0%
Claude Opus 4100100100100100707070707085.0%
Z.AI GLM 4.7 Flash100100100100100808070705085.0%
Stealth: Aurora Alpha100100100100100808080803085.0%
MiniMax M2.71001001008080808080707084.0%
DeepSeek V3.1100100100100100100908070084.0%
GPT-4o, May 13th (temp=0)10010010010090907060606083.0%
ByteDance Seed 1.6 Flash10080808080808080807081.0%
Gemini 2.5 Flash10080808080808080707080.0%
Inception Mercury10080808080808080606078.0%
MiniMax M2.5100100908070707050505073.0%
GPT-5 Nano100100100808080805050072.0%
Mistral Medium 3.110070707070707070606071.0%
Claude Sonnet 4.67070707070707070707070.0%
Gemma 3 27B7070707070707070707070.0%
GPT-4o, Aug. 6th (temp=0)9090906060606060606069.0%
GPT-4o, May 13th (temp=1)9090707060606060606068.0%
GPT-4.1 Mini8080707070707070703068.0%
Qwen 2.5 72B8080808080805050505068.0%
GPT-4o, Aug. 6th (temp=1)9090706060606060606067.0%
Grok 4.2010090807060606050504066.0%
Grok 4.20 (Beta)8070706060606060605063.0%
Grok 4.37070707070706060504063.0%
Arcee AI: Trinity Mini10080808060605050303062.0%
Writer: Palmyra X5707070707070606060060.0%
GPT-4o Mini (temp=1)8080808050505050404060.0%
Qwen3 235B A22B Instruct 2507707070707060606060059.0%
Arcee AI: Trinity Large (Preview)7070707060605050404058.0%
Claude Haiku 4.57070707060605040404057.0%
GPT-5.4 Mini9070706060606040401056.0%
GPT-4o Mini (temp=0)8080505050505050505056.0%
Gemma 3 12B8050505050505050505053.0%
Mistral Small Creative6060606060605050403053.0%
GPT-5.4 Nano (Reasoning, Low)100100100803020000043.0%
Llama 3.1 8B707070704030302020042.0%
Gemini 2.5 Flash Lite5050404040404040404042.0%
Cohere Command R+ (Aug. 2024)806060605040202020041.0%
Ministral 3 3B70706050404040300040.0%
Mistral Small 3.2 24B6050504040403030301038.0%
Mistral NeMO705050504040303010037.0%
Claude 3 Haiku6060404040302020101033.0%
GPT-5.4 Nano6060505040303000032.0%
Ministral 3 14B5050404040202020201031.0%
Ministral 3B70706040202010100030.0%
Mistral Small 450504040403030200030.0%
Hermes 3 70B60505030302020100027.0%
GPT-4.1 Nano202020200000008.0%
Ministral 8B50101000000007.0%
Rocinante 12B40201000000007.0%
Gemma 3 4B20202000000006.0%
Ministral 3 8B100000000001.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Z.AI GLM 5.11001001001001001001001001009299.2%
Aion 2.01001001001001001001001001009299.2%
MoonshotAI: Kimi K2.61001001001001001001001001009299.2%
Qwen 3.5 Plus (2026-04-20)1001001001001001001001001009299.2%
ByteDance Seed 1.61001001001001001001001001009299.2%
GPT-5.21001001001001001001001001009299.2%
GPT-5.51001001001001001001001001009299.2%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Grok 41001001001001001001001001009299.2%
Claude Opus 41001001001001001001001001009299.2%
Qwen 3.5 122B100100100100100100100100929298.3%
Grok 4.20 (Beta, Reasoning)100100100100100100100100929298.3%
DeepSeek V4 Flash (Reasoning)100100100100100100100100929298.3%
Inception Mercury 2100100100100100100100100929298.3%
Gemini 3.5 Flash (Reasoning)1001001001001001001001001007597.5%
GPT-5.4 (Reasoning)10010010010010010010092929297.5%
Z.AI GLM 51001001001001001001001001007597.5%
Qwen 3.5 9B100100100100100100100100928397.5%
Xiaomi MIMO v2.51001001001001001001001001007597.5%
Grok 4.20 (Reasoning)1001001001001001009292929296.7%
Mistral Medium 3.1100100100100100100100100838396.7%
Claude 3.5 Sonnet100100100100100100100100838396.7%
DeepSeek V4 Pro100100100100100100100100927596.7%
GPT-5.110010010010092929292929295.0%
Stealth: Healer Alpha100100100100100100100100757595.0%
GPT-4o, May 13th (temp=0)10010010010010010010083838395.0%
Qwen3 235B A22B Instruct 25071001001001001001009292927595.0%
GPT-510010010010092929292929295.0%
MiniMax M2.710010010010010010010083836793.3%
MiniMax M2.510010010010010010010083757593.3%
Grok 4 Fast1001001001001001001001001003393.3%
Grok 4.3 (Reasoning)100100100100100100100100924293.3%
o4 Mini1001001001001001009292757593.3%
Claude Opus 4.7 (Reasoning)10010010010010010010075757592.5%
GPT-5 Mini10092929292929292929292.5%
Llama 3.1 Nemotron 70B100100100100100928383838392.5%
o4 Mini High100100100100100929292757592.5%
GPT-5.4 Mini (Reasoning)1001001001001001008383836791.7%
Qwen 3.6 35B1001001001001001009275757591.7%
Qwen 3.5 Flash10010010010010010010092755091.7%
Qwen 3.6 Flash100100100100100100100100100090.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001007575757590.0%
Writer: Palmyra X51001001001001001007575757590.0%
Gemini 3.1 Flash Lite (Preview)10010010010010010010075675089.2%
GPT-OSS 120B100100100100100929275755889.2%
Z.AI GLM 4.5100100100100100927575757589.2%
Gemini 2.5 Flash Lite (Reasoning)100100929292838383838389.2%
Claude Sonnet 4.6 (Reasoning)100100100100100757575757587.5%
Xiaomi MIMO v2.5 Pro100100100100100757575757587.5%
Z.AI GLM 4.5 Air1001001001001001008375754287.5%
DeepSeek V3 (2025-03-24)1001001009283838375757586.7%
Mistral Small Creative1001001009292838383755886.7%
Z.AI GLM 4.6100100100100100100929275886.7%
Mistral Small 4 (Reasoning)1001001001001001009275584286.7%
Stealth: Hunter Alpha10010010010083757575757585.8%
GPT-5.4 Mini (Reasoning, Low)10010010010083838367676785.0%
GPT-5.4 Nano (Reasoning)10010010010010092928375084.2%
Inception Mercury10010010010083757575755884.2%
ByteDance Seed 1.6 Flash10092838383838383836784.2%
Gemini 3.1 Flash Lite1001001008383837575755082.5%
GPT-4.1 Mini9292929292757575756782.5%
Qwen 3 32B1001001008383838383674282.5%
Mistral Large1001001009275757575676782.5%
GPT-4o, Aug. 6th (temp=1)10083838383838383676781.7%
Gemini 2.5 Flash9292929283757575756781.7%
GPT-4o, Aug. 6th (temp=0)8383838383838383836781.7%
GPT-5 Nano100100837575757575757580.8%
GPT-4.1100100757575757575757580.0%
DeepSeek V3.21001001009292929258502580.0%
Gemma 3 27B100100927575757575755079.2%
DeepSeek V4 Flash10092929275757575585879.2%
Claude Opus 4.79292757575757575757578.3%
Gemini 3.1 Flash Lite (Reasoning)1001001008383835858584276.7%
Ministral 3 8B8383837575757575676775.8%
Nemotron 3 Super8375757575757575757575.8%
Claude Sonnet 4.67575757575757575757575.0%
Claude Haiku 4.57575757575757575757575.0%
DeepSeek-V2 Chat100100837575756758585875.0%
Grok 4.209292927575757575504274.2%
Llama 3.1 70B838383838383838375074.2%
Mistral Large 39275757575756767676773.3%
Mistral Small 3.2 24B10092838375757558504273.3%
Grok 4.20 (Beta)9283757575757567505071.7%
Z.AI GLM 4.7 Flash10083676767676767675870.8%
Mistral Large 210075757575676767505070.0%
Claude 3.7 Sonnet1001001001001001009200069.2%
DeepSeek V3.19292837575756758423369.2%
Hermes 3 405B1008375757575756758068.3%
Gemma 3 12B9275757575756767503368.3%
Ministral 8B10083757575675850504267.5%
DeepSeek V3 (2024-12-26)9283757575675858424266.7%
Nemotron 3 Nano8383757575756758332565.0%
WizardLM 2 8x22b1009283757567585050065.0%
Grok 4.31009283757567585050065.0%
Gemini 2.5 Flash Lite8383756767585858504264.2%
GPT-4o, May 13th (temp=1)10083838383676717171761.7%
Arcee AI: Trinity Large (Preview)7575675850505050504256.7%
GPT-4o Mini (temp=1)6767676750505050504255.8%
Arcee AI: Trinity Mini8383675050505050333355.0%
Qwen 2.5 72B8367676767675025252554.2%
GPT-5.4 Mini7567585050505050502552.5%
Hermes 3 70B757567676750504225051.7%
GPT-4o Mini (temp=0)5050505050505050505050.0%
Cohere Command R+ (Aug. 2024)75676758584233338044.2%
Claude 3 Haiku5858585842332525252540.8%
Ministral 3 14B5050505042423333251739.2%
Mistral NeMO6767424233333325251738.3%
Mistral Small 4755050423317800027.5%
Ministral 3B4242333388000016.7%
GPT-5.4 Nano (Reasoning, Low)8333251700000015.8%
Ministral 3 3B75251717178000015.8%
Rocinante 12B42422525170000015.0%
Gemma 3 4B25251788000008.3%
GPT-5.4 Nano4217800000006.7%
GPT-4.1 Nano170000000001.7%
Llama 3.1 8B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959599.1%
Grok 4.20 (Beta)1001001001001001001001001008698.6%
Grok 4.20100100100100100100100100919198.2%
Gemini 2.5 Pro1001001009595959595959596.8%
MoonshotAI: Kimi K2.6100100100100100959595868695.9%
ByteDance Seed 2.0 Lite10010010010095959595918695.9%
GPT-5.49595959595959595919194.5%
Claude Sonnet 4.51001001001001001008686868694.5%
Gemini 3 Flash (Preview, Reasoning)1001001009595959186828292.7%
Claude Sonnet 4.6 (Reasoning)9595959595919191918292.3%
GPT-5.59591919191919191919191.4%
Xiaomi MIMO v2.5 Pro100100959591919186828291.4%
Gemini 3 Pro (Preview)10095959191918686868690.9%
Qwen 3.5 Flash10095959595919186827790.9%
GPT-5.29595959591918686868690.9%
Grok 4100100919191919191827790.5%
Qwen 3.5 35B10095959591919186867390.5%
Z.AI GLM 5 Turbo100100959595918686777790.5%
GPT-59191919191919191918690.5%
Gemma 4 31B (Reasoning)10095959591868686868290.5%
DeepSeek V4 Pro (Reasoning)100100959591919186827390.5%
Qwen3.7 Max10095919191918686868690.5%
Qwen 3.5 27B1001001008686868686868290.0%
MoonshotAI: Kimi K2.59595959191868686868289.5%
Qwen3.6 Max Preview100100959586868682827789.1%
GPT-5.5 (Reasoning, Low)9595958686868686868689.1%
DeepSeek V4 Flash (Reasoning)100100959595868682777389.1%
GPT-5.19191919191919186868289.1%
GPT-5.4 (Reasoning, Low)9591919191868686868689.1%
Grok 4.20 (Reasoning)10091919191919182828289.1%
Claude 3.5 Sonnet9595959595918682777388.6%
Z.AI GLM 5.19591919191918686828288.6%
Grok 4.20 (Beta, Reasoning)10091919191919182827788.6%
Aion 2.0100100959191868282827788.6%
ByteDance Seed 2.0 Mini10091919186868686868288.6%
Grok 4.3 (Reasoning)100100919191918282826887.7%
Gemini 2.5 Flash10086868686868686868687.7%
GPT-5.5 (Reasoning)9586868686868686868687.3%
Qwen 3.6 27B10091919191868282827787.3%
Gemini 2.5 Flash (Reasoning)10095919191868282777787.3%
GPT-5.4 (Reasoning)9191918686868686828286.8%
Mistral Medium 3.19595868686868686827786.8%
Qwen 3.6 Flash10095959186868282737386.4%
Z.AI GLM 4.79595868686868282828286.4%
Gemma 4 31B8686868686868686868686.4%
Gemini 3 Flash (Preview)8686868686868686868686.4%
Z.AI GLM 510091919191868282737385.9%
Stealth: Hunter Alpha10091919186868282777385.9%
Gemini 3.5 Flash (Reasoning, Minimal)8686868686868686868285.9%
Gemini 3.1 Flash Lite (Reasoning)9595918282828282828285.5%
Gemini 3.5 Flash (Reasoning)9191919186868282827385.5%
Claude Sonnet 48686868686868686828285.5%
Grok 4.1 Fast9191919191868282737385.0%
Claude Sonnet 4.6100100919186828277686884.5%
o4 Mini High9191919191827777777384.1%
Qwen 3.5 122B9191918686828282776883.6%
Gemini 3.1 Flash Lite9595828282828282827383.6%
Claude Opus 49595958282828273737383.2%
Qwen 3.5 Plus (2026-04-20)9591919186867773686882.7%
GPT-4o, Aug. 6th (temp=0)8686868686867777777782.7%
Claude Opus 4.79182828282828282777781.8%
Qwen 3.6 35B10095959186777368686481.8%
Stealth: Healer Alpha9191919182827773736881.8%
Gemini 3.1 Flash Lite (Preview)8282828282828282828281.8%
GPT-5.4 Mini (Reasoning)8686868686868277686881.4%
DeepSeek V4 Pro9595868282828277686481.4%
Gemma 4 26B (Reasoning)9191868682827777776481.4%
Xiaomi MIMO v2.59595918682777777686481.4%
Mistral Large9595958682827768646480.9%
Grok 4 Fast1009591919191828282080.5%
o4 Mini9182828282827777777380.5%
Claude Opus 4.7 (Reasoning)10082827777777777776879.5%
Grok 4.39186868682828277645979.5%
Qwen 3.5 Plus (2026-02-15)9191868282777777775079.1%
ByteDance Seed 1.68686777777777777777779.1%
GPT-5.4 Mini (Reasoning, Low)8686828282827777686478.6%
GPT-4o, Aug. 6th (temp=1)9191919186777368595978.6%
DeepSeek V4 Flash10091919186777373505078.2%
DeepSeek V3.210091919177737368645578.2%
Mistral Large 28686868282827773645977.7%
GPT-5 Mini8682777777777777736877.3%
Z.AI GLM 4.610095868682827359594576.8%
Qwen 3.5 397B A17B958686868686828273076.4%
Gemini 2.5 Flash Lite (Reasoning)8682828282737368686475.9%
GPT-5 Nano9182828277777368645575.0%
MiniMax M2.79182828277777368685075.0%
Qwen 3.5 9B959191867777776868974.1%
GPT-5.4 Mini9191868273736864644573.6%
Claude Haiku 4.5919191828277736868072.3%
Gemma 4 26B8273737373737368646471.4%
Z.AI GLM 4.7 Flash9177737373686868645971.4%
Z.AI GLM 4.5 Air9191736868686868595971.4%
Writer: Palmyra X58682827768686868595571.4%
Inception Mercury9582827773736864554571.4%
Mistral Large 39177777768646464646470.9%
Inception Mercury 27777777773736868595970.9%
Stealth: Aurora Alpha7373737373737368686470.9%
DeepSeek V3.19595868282776445414170.9%
MiniMax M2.58282777368686464645970.0%
Claude 3.7 Sonnet100100868677645950453270.0%
GPT-OSS 120B8277777373686859595569.1%
Z.AI GLM 4.59173686868686464595968.2%
Qwen3 235B A22B Instruct 25079186868277685959502368.2%
ByteDance Seed 1.6 Flash8273737373736459594567.3%
GPT-4o, May 13th (temp=0)7777777368685959595066.8%
Nemotron 3 Nano8282777373595959594566.8%
GPT-4.195828282828277640064.5%
Gemini 2.5 Flash Lite100959177737364550062.7%
Mistral Small 4 (Reasoning)8273686864595959504562.7%
DeepSeek V3 (2024-12-26)8273736868685945453661.8%
Ministral 3 14B8282776868454545414159.5%
GPT-4.1 Mini7768686859595541413657.3%
Arcee AI: Trinity Large (Preview)7773736455555050452756.8%
Nemotron 3 Super77777373736868590056.8%
Qwen 3 32B777373686455505045956.4%
GPT-4o, May 13th (temp=1)7368595959555045454155.5%
Mistral Small Creative6464645955555050504155.0%
DeepSeek-V2 Chat736868685955504541052.7%
Mistral Small 3.2 24B6459595955505045454152.7%
DeepSeek V3 (2025-03-24)686859595955454545951.4%
Hermes 3 405B8273686459503627232350.5%
Gemma 3 27B5959554545414141414146.8%
Llama 3.1 Nemotron 70B5959594541413636362744.1%
GPT-5.4 Nano (Reasoning)827373737364000043.6%
Llama 3.1 70B595955554545452727942.7%
Cohere Command R+ (Aug. 2024)6459555545454127231442.7%
Ministral 3 8B7364454141323223181438.2%
Ministral 8B68645550414132230037.3%
Mistral Small 45050454545323227231836.8%
WizardLM 2 8x22b594541414141362723936.4%
GPT-5.4 Nano (Reasoning, Low)77737364500000033.6%
Arcee AI: Trinity Mini454545454136361814933.6%
Hermes 3 70B5041363632231800023.6%
GPT-5.4 Nano64454545140000021.4%
GPT-4o Mini (temp=0)3223231818181818181820.5%
GPT-4o Mini (temp=1)5032322323181450019.5%
Gemma 3 12B3227272723141495017.7%
Ministral 3B3232272323141495017.7%
Claude 3 Haiku412723181414995015.9%
Ministral 3 3B322318181814500012.7%
Mistral NeMO2727271495500011.4%
Llama 3.1 8B4132271400000011.4%
Qwen 2.5 72B739900000009.1%
Gemma 3 4B50000000000.5%
Rocinante 12B50000000000.5%
GPT-4.1 Nano00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)1001001001001001001001001009599.5%
Qwen 3.5 27B1001001001001001001001001009599.5%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Gemini 3.5 Flash (Reasoning)1001001001001001001001001009099.0%
GPT-5 Mini100100100100100100100100959599.0%
Qwen 3.5 35B1001001001001001001001001009099.0%
GPT-5.4100100100100100100100100959599.0%
Grok 4.3 (Reasoning)1001001001001001001001001008598.5%
Gemma 4 31B (Reasoning)1001001001001001001001001008598.5%
Qwen 3.6 Flash1001001001001001001001001008598.5%
Gemini 2.5 Pro1001001001001001001001001008598.5%
Claude Sonnet 4.51001001001001001001001001008598.5%
Xiaomi MIMO v2.5 Pro1001001001001001001001001008598.5%
Z.AI GLM 5.1100100100100100100100100909098.0%
MoonshotAI: Kimi K2.5100100100100100100100100958598.0%
GPT-5.4 (Reasoning)10010010010010010010095909097.5%
DeepSeek V4 Flash (Reasoning)100100100100100100100100908597.5%
Gemma 4 26B (Reasoning)10010010010010010010090909097.0%
Qwen 3.5 397B A17B10010010010010010010095858596.5%
Gemini 2.5 Flash (Reasoning)1001001001001001009595908596.5%
GPT-5.4 Mini (Reasoning)1001001001001001009090909096.0%
o4 Mini High100100100100100959590909096.0%
GPT-5.21001001009595959595959096.0%
Gemini 3 Flash (Preview, Reasoning)10010010010010010010085858595.5%
Gemini 3 Pro (Preview)10010010010010010010085858595.5%
ByteDance Seed 2.0 Mini1001001001001001009090908595.5%
Qwen 3.5 Flash10010010010010010010085858595.5%
GPT-51001001001001001009590858595.5%
Claude Opus 4.5100100100100100909090909095.0%
Aion 2.0100100100100100909090909095.0%
Z.AI GLM 4.710010010010010010010090857595.0%
Stealth: Healer Alpha10010010010010010010090808095.0%
o4 Mini100100100100100909090908594.5%
Qwen 3.5 122B1001001001001001009085858594.5%
Xiaomi MIMO v2.51001001001001001009085858594.5%
Z.AI GLM 5100100100100100909090858594.0%
GPT-5.110010010010095909090908093.5%
ByteDance Seed 1.61001001001001001009085857593.5%
Claude Opus 4.71001001001001001008585858093.5%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100858585858592.5%
Stealth: Hunter Alpha100100100100100908585858092.5%
Gemini 2.5 Flash100100959590909090908092.0%
Nemotron 3 Super100100909090909090908091.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009090909090807090.0%
ByteDance Seed 2.0 Lite9090909090909090909090.0%
MoonshotAI: Kimi K2.610010010010010010010010095089.5%
DeepSeek V4 Pro (Reasoning)100100100100100100959595088.5%
Inception Mercury 210095909090909085757588.0%
Mistral Large 210090909090908580808087.5%
Claude Sonnet 4.69090909090909090757587.0%
Claude Sonnet 49090909090909080808087.0%
Claude 3.7 Sonnet9090909090909090757587.0%
GPT-5.4 Mini (Reasoning, Low)9090909090909085757586.5%
Qwen 3.6 27B10010010010010010010090601586.5%
Z.AI GLM 4.5 Air100100909085858580807086.5%
DeepSeek V4 Pro10095858585858585857586.5%
Qwen 3.5 9B10010010010090909090851085.5%
Mistral Large 310090909090858075757585.0%
Qwen 3.5 Plus (2026-02-15)100100858585858580707084.5%
Claude Opus 4100100908585858575706584.0%
GPT-OSS 120B100100909090808075756084.0%
Mistral Small 4 (Reasoning)10090909090808080804582.5%
Gemini 3 Flash (Preview)10085858585858570707082.0%
Mistral Large9090909090908065656581.5%
Stealth: Aurora Alpha9090909090858075755081.5%
Z.AI GLM 4.610010010010090858575551580.5%
MiniMax M2.710090858080808075706580.5%
Gemini 3.1 Flash Lite8585858585858070707080.0%
GPT-4o, May 13th (temp=0)8080808080808080707078.0%
Gemini 3.1 Flash Lite (Reasoning)9585858585757070656077.5%
DeepSeek V3.29585858080757065606075.5%
Gemma 4 26B858585858585858570075.0%
MiniMax M2.59090808080707065605574.0%
Gemini 3.1 Flash Lite (Preview)9090757575757570655074.0%
Claude 3.5 Sonnet9080757575757565656574.0%
Grok 4.209590858580706560604573.5%
DeepSeek-V2 Chat9090858585807055404072.0%
Z.AI GLM 4.7 Flash9090858580707065454072.0%
Ministral 3 14B8580807575707070554570.5%
Claude Haiku 4.5909085757575757060069.5%
GPT-5 Nano9590806565656560555069.0%
GPT-4o, Aug. 6th (temp=1)8080808075706555554068.0%
DeepSeek V3 (2024-12-26)9085857065656055554567.5%
Nemotron 3 Nano9080757570656060504567.0%
GPT-4o, Aug. 6th (temp=0)8080707070656565554066.0%
Qwen3 235B A22B Instruct 25078580757070656050505065.5%
Grok 4.20 (Beta)8080807565605555505065.0%
GPT-4.110075707065656560503065.0%
Mistral Medium 3.19085757070655555454065.0%
DeepSeek V3.19085757070656560353064.5%
ByteDance Seed 1.6 Flash8080807070656060453564.5%
DeepSeek V3 (2025-03-24)8580757065605555504564.0%
DeepSeek V4 Flash808075707065656555563.0%
Writer: Palmyra X57575706565656060504062.5%
GPT-4o, May 13th (temp=1)7570706560555555504560.0%
GPT-4.1 Mini7065656560605555504559.0%
GPT-5.4 Mini9595959080703055557.0%
Gemma 3 27B6560606055555555555057.0%
Z.AI GLM 4.58075757560555045151054.0%
Inception Mercury8565656055555045402054.0%
Grok 4.3757070605555555045053.5%
Mistral Small Creative7070656565655025151550.5%
Qwen 3 32B907060606055502520049.0%
Ministral 3 8B707065656560352525048.0%
Hermes 3 405B5550505050504540403546.5%
GPT-5.4 Nano (Reasoning, Low)858580757525000042.5%
Mistral Small 3.2 24B555050505035353535039.5%
Ministral 3 3B5550505050453525201039.0%
GPT-5.4 Nano (Reasoning)90909070400000038.0%
Mistral Small 4606050453535302520036.0%
Llama 3.1 Nemotron 70B655050503535301510034.0%
Qwen 2.5 72B5045454035353525151033.5%
WizardLM 2 8x22b5050453530303015151031.0%
Llama 3.1 70B605040403530201510030.0%
Arcee AI: Trinity Large (Preview)656050503530000029.0%
Ministral 3B5040353020202015151526.0%
Ministral 8B55554540250000022.0%
Hermes 3 70B404030252015151010521.0%
GPT-5.4 Nano70553520205000020.5%
Gemini 2.5 Flash Lite4540302525201000019.5%
Cohere Command R+ (Aug. 2024)35352020205000013.5%
Arcee AI: Trinity Mini3025201515101000012.5%
GPT-4o Mini (temp=0)3025202055555012.0%
GPT-4o Mini (temp=1)20151555550007.0%
Llama 3.1 8B3020000000005.0%
Mistral NeMO355000000004.0%
Gemma 3 12B155555000003.5%
GPT-4.1 Nano00000000000.0%
Claude 3 Haiku00000000000.0%
Gemma 3 4B00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%