Accuracy (recall)

Test: Codex Violation Detection

Avg. Score
69.2%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash (Reasoning)94.0%$0.007912.5s86%
2GPT-5.495.9%$0.0138.8s85%
3Grok 4.1 Fast93.8%$0.002121.1s84%
4Z.AI GLM 5 Turbo94.8%$0.007217.1s85%
5Gemini 3 Flash (Preview, Reasoning)94.7%$0.01118.0s86%
6Grok 4 Fast92.4%$0.001812.8s72%
7Grok 4.20 (Beta, Reasoning)96.1%$0.02616.8s89%
8Gemini 3 Flash (Preview)88.8%$0.00314.5s69%
9Claude Opus 4.598.0%$0.0419.7s91%
10GPT-5.295.7%$0.02424.2s87%
11Claude Sonnet 4.593.6%$0.0248.9s80%
12GPT-5.4 (Reasoning, Low)92.4%$0.02013.5s79%
13Gemini 2.5 Pro97.4%$0.03523.2s91%
14Gemini 2.5 Flash82.5%$0.00252.8s69%
15Claude Opus 4.697.3%$0.04010.2s88%
16Stealth: Healer Alpha89.0%$0.000021.8s67%
17Stealth: Hunter Alpha89.5%$0.000044.0s71%
18Inception Mercury 282.2%$0.00304.6s62%
19Claude Sonnet 490.3%$0.0239.0s72%
20Qwen 3.5 Flash92.8%$0.00381.0m77%
21GPT-5.4 Mini (Reasoning, Low)83.0%$0.00556.7s61%
22Gemini 3.1 Flash Lite (Preview)81.9%$0.00192.2s57%
23Qwen 3.5 35B93.3%$0.01754.8s81%
24ByteDance Seed 1.691.9%$0.00671.0m77%
25Z.AI GLM 592.1%$0.01356.4s79%
26o4 Mini88.9%$0.01928.1s74%
27GPT-5 Mini91.2%$0.009251.7s74%
28Gemini 2.5 Flash Lite (Reasoning)81.1%$0.002017.6s62%
29Z.AI GLM 4.791.3%$0.00911.0m77%
30Claude Sonnet 4.682.9%$0.0249.9s70%
31ByteDance Seed 2.0 Lite89.8%$0.00671.1m73%
32GPT-5.4 Mini (Reasoning)87.9%$0.02028.6s68%
33Mistral Large 375.5%$0.003010.2s58%
34Mistral Large79.3%$0.0129.1s59%
35MiniMax M2.779.1%$0.002229.4s59%
36Qwen 3.5 Plus (2026-02-15)85.0%$0.004134.2s55%
37Qwen 3.5 27B95.6%$0.0211.7m89%
38Qwen 3.5 122B92.7%$0.0261.2m81%
39Mistral Large 275.9%$0.0128.8s57%
40MoonshotAI: Kimi K2.592.9%$0.0151.5m79%
41GPT-5.192.7%$0.03752.8s79%
42o4 Mini High90.7%$0.03351.3s76%
43Grok 4.20 (Beta)74.8%$0.00592.5s47%
44Mistral Medium 3.175.1%$0.00297.7s46%
45Aion 2.091.7%$0.00841.3m64%
46MiniMax M2.574.6%$0.002025.5s52%
47GPT-5.4 (Reasoning)91.7%$0.04243.1s75%
48Gemini 3.1 Pro (Preview)98.3%$0.06852.3s93%
49DeepSeek V3.275.0%$0.001317.2s45%
50Claude Opus 4.6 (Reasoning)97.3%$0.07632.0s90%
51GPT-4.177.7%$0.008110.0s44%
52Z.AI GLM 4.575.6%$0.002618.5s45%
53Mistral Small 4 (Reasoning)70.9%$0.002016.3s47%
54ByteDance Seed 1.6 Flash67.9%$0.000912.0s46%
55Z.AI GLM 4.686.1%$0.00491.3m59%
56Gemini 3 Pro (Preview)91.0%$0.05034.0s70%
57DeepSeek-V2 Chat71.1%$0.002013.1s41%
58GPT-4.1 Mini63.7%$0.00155.8s43%
59Claude Haiku 4.566.8%$0.00785.8s45%
60DeepSeek V3 (2024-12-26)69.9%$0.001915.6s41%
61Grok 494.1%$0.0511.2m80%
62Writer: Palmyra X567.2%$0.006212.1s44%
63Qwen3 235B A22B Instruct 250767.3%$0.000721.1s43%
64Stealth: Aurora Alpha77.5%6.2s46%
65Claude 3.7 Sonnet77.4%$0.02310.2s42%
66Claude Sonnet 4.6 (Reasoning)94.9%$0.07651.3s84%
67Inception Mercury63.1%$0.00059.5s34%
68DeepSeek V3 (2025-03-24)68.9%$0.001522.9s34%
69GPT-4o, Aug. 6th (temp=1)65.4%$0.0113.6s36%
70Mistral Small Creative56.8%$0.00064.3s36%
71GPT-4o, Aug. 6th (temp=0)67.9%$0.0155.0s36%
72ByteDance Seed 2.0 Mini90.8%$0.00292.7m73%
73Claude 3.5 Sonnet79.2%$0.04210.5s47%
74DeepSeek V3.166.1%$0.001431.1s33%
75Z.AI GLM 4.7 Flash68.2%$0.00181.1m45%
76Qwen 3 32B64.1%$0.001027.8s31%
77Mistral Small 3.2 24B53.8%$0.000610.4s32%
78GPT-593.3%$0.0611.6m81%
79GPT-4o, May 13th (temp=0)71.6%$0.0305.4s36%
80GPT-5.4 Nano (Reasoning)67.5%$0.003516.6s23%
81Gemma 3 27B53.2%$0.000513.3s33%
82GPT-5.4 Mini54.5%$0.00333.1s29%
83Nemotron 3 Super83.1%$0.00002.3m55%
84Qwen 3.5 9B84.7%$0.00202.4m55%
85Gemini 2.5 Flash Lite47.9%$0.00052.3s24%
86Claude 3.5 Haiku51.2%$0.00496.2s25%
87GPT-4o, May 13th (temp=1)58.1%$0.0263.7s34%
88Qwen 3.5 397B A17B93.7%$0.0262.9m75%
89Ministral 3 14B46.4%$0.00106.3s25%
90Hermes 3 405B56.2%$0.004417.2s20%
91Llama 3.1 Nemotron 70B55.8%$0.005518.5s19%
92Ministral 3 8B39.9%$0.00074.8s20%
93Llama 3.1 70B51.2%$0.002124.3s18%
94Qwen 2.5 72B42.3%$0.000813.9s18%
95GPT-5 Nano67.2%$0.00491.9m42%
96Ministral 8B34.7%$0.00055.6s18%
97Arcee AI: Trinity Mini33.3%$0.000410.3s18%
98Mistral Small 430.7%$0.00084.3s18%
99Claude Opus 484.4%$0.11615.8s65%
100Ministral 3 3B28.6%$0.00053.3s16%
101Gemma 3 12B35.9%$0.000312.0s11%
102GPT-4o Mini (temp=1)29.4%$0.00067.4s13%
103GPT-4o Mini (temp=0)31.3%$0.000625.0s19%
104Ministral 3B25.0%$0.00022.9s14%
105Cohere Command R+ (Aug. 2024)32.3%$0.01410.0s15%
106GPT-5.4 Nano (Reasoning, Low)32.9%$0.00136.1s2%
107Claude 3 Haiku19.4%$0.00153.6s12%
108Hermes 3 70B28.5%$0.001329.2s14%
109GPT-5.4 Nano21.3%$0.00082.9s4%
110Mistral NeMO20.6%$0.000713.3s7%
111Llama 3.1 8B16.1%$0.000216.1s0%
112WizardLM 2 8x22b17.1%$0.003615.7s0%
113Rocinante 12B6.4%$0.00096.0s0%
114Nemotron 3 Nano64.5%$0.00313.5m37%
115GPT-4.1 Nano2.6%$0.00043.9s0%
116Gemma 3 4B3.2%$0.000212.5s0%
117Arcee AI: Trinity Large (Preview)46.2%$0.00003.1m24%
118LFM2 24B0.0%$0.00081.9m0%
69.24%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 410098989797959594949196.1%
Claude Sonnet 4.6 (Reasoning)9795959494949492928893.6%
Qwen 3.5 397B A17B100100979594948989888593.2%
GPT-5.49794949292929191919192.6%
Gemini 2.5 Pro10095929292898888888691.2%
GPT-5.29797949292918988868591.2%
Qwen 3.5 27B9492929291919191888690.9%
Grok 4 Fast100100979291898686838290.8%
Grok 4.1 Fast10097949191918989828290.6%
GPT-59494929291918888888890.6%
Grok 4.20 (Beta, Reasoning)9797949289898886868290.2%
Claude Opus 4.6 (Reasoning)9492919191919188868590.0%
Gemini 3.1 Pro (Preview)9494919191898888888690.0%
GPT-5.4 (Reasoning, Low)9494929291898888868389.8%
GPT-5.4 (Reasoning)9494929191918685858289.1%
Claude Opus 4.59292919189898888858388.9%
Z.AI GLM 5 Turbo9292918989868683838387.7%
Nemotron 3 Super9491898989888683838087.4%
GPT-5 Mini9191888886868585858587.0%
Gemini 2.5 Flash (Reasoning)9189898988858585857786.4%
ByteDance Seed 1.69491918988858382807986.2%
Z.AI GLM 4.69492928888868382777786.1%
o4 Mini High9491918885838282827985.6%
GPT-5.19491888585858282808085.2%
Qwen 3.5 122B8988868683828280797983.5%
Z.AI GLM 59785858382828279797783.0%
Gemini 3 Flash (Preview, Reasoning)8988858383828282827483.0%
Qwen 3.5 35B9291898585837979767183.0%
Aion 2.01009794929189888888082.7%
Claude Opus 4.68585858583838080808082.7%
Z.AI GLM 4.79488888383827979747382.3%
Claude Opus 49289858382827977777081.7%
Qwen 3.5 Flash9491868582827977766281.4%
MoonshotAI: Kimi K2.59186868682828080736581.2%
Claude Sonnet 4.68682828280777776767479.2%
Claude Sonnet 48382808080797977737178.5%
Gemini 3 Flash (Preview)8582828079767676747478.3%
Claude Sonnet 4.58583828079777777717078.2%
o4 Mini8886828279767673706777.7%
Stealth: Hunter Alpha8883838280777673685977.0%
GPT-5.4 Mini (Reasoning)8382828080747373717076.8%
ByteDance Seed 2.0 Lite8585838079777668656276.1%
Stealth: Healer Alpha9491918279767462584274.8%
ByteDance Seed 2.0 Mini8382767474747473736474.7%
Qwen 3.5 9B8582797777767371626274.4%
Gemini 2.5 Flash7977767674737064615970.8%
Gemini 3 Pro (Preview)8382807676767473671570.2%
Inception Mercury 27977747171707070615870.0%
Gemini 2.5 Flash Lite (Reasoning)7977767373686767625569.5%
GPT-5.4 Mini (Reasoning, Low)7777717070686764625968.5%
Qwen 3.5 Plus (2026-02-15)928583827977763938065.2%
Mistral Large7970676765656261595865.2%
Mistral Large 37373706868646161565264.4%
GPT-4.17774706765646258564563.8%
MiniMax M2.77771686464645956554862.6%
Stealth: Aurora Alpha6867656559585858554259.4%
Gemini 3.1 Flash Lite (Preview)6765656259595656535259.4%
Mistral Large 26767676159585656534758.9%
GPT-5.4 Nano (Reasoning)82807674706864590057.3%
DeepSeek V3.29465595952474747443955.3%
MiniMax M2.57371646153535042413654.4%
Z.AI GLM 4.57667555353504545453652.6%
Claude Haiku 4.56758505050504847474250.9%
DeepSeek-V2 Chat6158565350504744393349.1%
Mistral Medium 3.16256535252484848472148.8%
DeepSeek V3 (2024-12-26)6764595858564835241748.5%
Grok 4.20 (Beta)6158565550483939383648.0%
GPT-5 Nano5858565250484544353548.0%
Qwen3 235B A22B Instruct 25075855555555444442363347.6%
DeepSeek V3.1716764625848413624047.1%
Mistral Small 4 (Reasoning)5350504747444441392944.4%
Claude 3.7 Sonnet6159484439393838353043.2%
Writer: Palmyra X5565653474444413938342.1%
Z.AI GLM 4.7 Flash5347454241383333322939.4%
DeepSeek V3 (2025-03-24)686248474533323023038.9%
ByteDance Seed 1.6 Flash5941393938383532292337.3%
Mistral Small Creative474141413835333329534.2%
Claude 3.5 Sonnet3836353333323232303033.2%
Qwen 3 32B4441393938322618171230.6%
GPT-4.1 Mini5344333330292621201430.3%
Nemotron 3 Nano393836323029292714327.7%
Inception Mercury523835352923202017226.8%
GPT-5.4 Nano (Reasoning, Low)594847413932000026.7%
GPT-5.4 Mini3832322929292318151425.8%
Mistral Small 3.2 24B44393935242118189024.8%
GPT-4o, May 13th (temp=1)363029292926241817624.4%
GPT-4o, Aug. 6th (temp=0)44352929292927230024.4%
GPT-4o, Aug. 6th (temp=1)3532292724212015141423.0%
Ministral 3 8B353029242121201814321.5%
Ministral 3 14B3232272424211460018.0%
GPT-4o, May 13th (temp=0)39353333320000017.3%
Qwen 2.5 72B27262017151515129816.4%
Ministral 8B333224232312222215.3%
Gemma 3 27B3626151514121199014.7%
Claude 3.5 Haiku2418151514121199012.7%
Mistral Small 4292018121211920011.2%
Gemini 2.5 Flash Lite3230261850000011.1%
Hermes 3 405B2718111188665210.0%
Hermes 3 70B14111188852006.4%
Llama 3.1 Nemotron 70B12121186530005.6%
Arcee AI: Trinity Mini159665332004.8%
Cohere Command R+ (Aug. 2024)1212965300004.7%
Gemma 3 12B119865000003.8%
GPT-5.4 Nano239330000003.8%
Llama 3.1 70B126532000002.7%
GPT-4o Mini (temp=1)96652000002.7%
Claude 3 Haiku118322200002.6%
WizardLM 2 8x22b96622000002.4%
Mistral NeMO110000000001.1%
GPT-4o Mini (temp=0)80000000000.8%
Llama 3.1 8B53000000000.8%
Ministral 3 3B53000000000.8%
GPT-4.1 Nano30000000000.3%
Ministral 3B30000000000.3%
Gemma 3 4B20000000000.2%
Arcee AI: Trinity Large (Preview)00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Gemini 3.1 Pro (Preview)1001001001001001009797979798.9%
Claude Opus 4.610010010010097979797979297.9%
GPT-5.410010010010097979797979297.9%
Grok 410010010010095959595959596.8%
Z.AI GLM 5 Turbo100100100100100979792928796.6%
Claude Sonnet 4.6 (Reasoning)10097979797979792929296.1%
MoonshotAI: Kimi K2.51001001009797959592929296.1%
Claude Sonnet 4.610010010010095959595928795.8%
Claude Sonnet 410010010010092929292878493.9%
GPT-5.29797959595928989898792.6%
Qwen 3.5 35B10097979592928989878792.6%
Claude Opus 4.6 (Reasoning)100100959595928787878792.4%
Grok 4.20 (Beta)9797979797928989877992.4%
ByteDance Seed 1.610095959592898989898491.8%
Gemini 3 Flash (Preview, Reasoning)10097959592928987878491.8%
Grok 4.20 (Beta, Reasoning)9595959595898989898491.6%
GPT-5.4 (Reasoning)9595959292929289878791.6%
Qwen 3.5 Flash10095929289898989878791.1%
Gemini 2.5 Flash (Reasoning)10095928989898989878490.5%
Qwen 3.5 122B10095929292898987878290.5%
Qwen 3.5 27B9292929289898989898790.3%
Grok 4 Fast9595958989898989897990.0%
GPT-59595928989898989878289.7%
Qwen 3.5 397B A17B9292929292929284848489.7%
Grok 4.1 Fast9595959292898987847989.7%
GPT-5.4 (Reasoning, Low)9595898989898989848489.5%
Aion 2.09595898989898987847688.4%
GPT-5.4 Mini (Reasoning)9595898989898484848488.4%
Z.AI GLM 4.710092929289878484847487.9%
Z.AI GLM 5100100959289828279797987.6%
Stealth: Hunter Alpha10095959289878784826687.6%
Stealth: Healer Alpha10092898989878484827987.6%
GPT-5.19292928989898482828287.4%
Gemini 3 Flash (Preview)9292929292848484847687.4%
o4 Mini High9595878787878484828286.8%
Gemini 3 Pro (Preview)10095929289898279767486.8%
Z.AI GLM 4.69792928989898784747186.6%
ByteDance Seed 2.0 Lite9589898989848282828286.3%
GPT-4.1100100959292848476716686.1%
o4 Mini9589898787878282797485.0%
Nemotron 3 Super8989898484848484827985.0%
Qwen 3.5 9B9595898984848276747484.2%
Gemini 3.1 Flash Lite (Preview)9789828282828282827983.7%
ByteDance Seed 2.0 Mini9595928482828279766883.4%
Gemini 2.5 Flash9292898987828271716882.4%
Claude Opus 49592878479797979716881.3%
GPT-5.4 Mini (Reasoning, Low)8989878482797976747481.3%
Qwen 3.5 Plus (2026-02-15)9292848484847979686381.1%
Inception Mercury 28484848279797979766879.5%
Writer: Palmyra X58989878482797471686678.9%
Stealth: Aurora Alpha8989847676767676766678.7%
GPT-5 Mini8989878787878479534578.7%
Z.AI GLM 4.58787828282797674636377.4%
MiniMax M2.58989828279767471686177.1%
Gemini 2.5 Flash Lite (Reasoning)8484828279767674686376.8%
GPT-5.4 Nano (Reasoning)8484827976767471716376.1%
Claude 3.5 Sonnet8282828274747466666173.9%
MiniMax M2.78989847976716663615873.7%
Claude 3.7 Sonnet7674747474747471717173.2%
Mistral Medium 3.17676747474747468666371.8%
Mistral Large8784797168686666616171.1%
GPT-4o, May 13th (temp=0)7979747474686868636170.8%
DeepSeek V3 (2024-12-26)8482797168686663636170.5%
Mistral Large 28279797166666666636370.0%
DeepSeek-V2 Chat8976767468686363615869.7%
DeepSeek V3.29584827674717163611869.5%
Qwen3 235B A22B Instruct 25078482797674686663584569.5%
Z.AI GLM 4.7 Flash7974747168686363615567.6%
Mistral Large 37471716868686663636167.4%
Mistral Small 4 (Reasoning)7976746863636358555365.3%
GPT-4.1 Mini7674747166666361503463.4%
DeepSeek V3 (2025-03-24)898974746866635829061.1%
Nemotron 3 Nano7974666363615847453759.2%
Claude Haiku 4.587797979717163610058.9%
ByteDance Seed 1.6 Flash7468666361616155502958.7%
GPT-4o, Aug. 6th (temp=0)7461615858585555555358.7%
Mistral Small 3.2 24B6863636158555555555358.7%
GPT-5 Nano82827674716866610057.9%
DeepSeek V3.182828279717155420056.3%
Gemini 2.5 Flash Lite7668616158504747453955.3%
Qwen 3 32B7674636158535350372655.0%
GPT-4o, Aug. 6th (temp=1)686866636158554745053.2%
Inception Mercury8466665353504542422952.9%
GPT-4o, May 13th (temp=1)7471685353534545323252.4%
GPT-5.4 Mini8782797471665000050.8%
Mistral Small Creative6353535047474539392145.8%
Arcee AI: Trinity Large (Preview)766158505042393732044.5%
Ministral 3 8B535047474545423939040.8%
Ministral 8B5045424239373737343439.7%
Gemma 3 27B4745423939373737373739.7%
Ministral 3 14B5047454239373734322438.7%
Hermes 3 405B5050474542373429292138.4%
Claude 3.5 Haiku3937343434343434323234.5%
Ministral 3B535045393926241813831.6%
GPT-5.4 Nano (Reasoning, Low)8479797100000031.3%
Mistral Small 4584734323229292413029.7%
GPT-4o Mini (temp=0)3232323232323232321129.5%
Gemma 3 12B3734342929262624211127.1%
Arcee AI: Trinity Mini373737373426241611826.6%
Qwen 2.5 72B4239393734322183025.5%
Ministral 3 3B50393932292616165025.3%
Llama 3.1 Nemotron 70B393229262624211816023.2%
Llama 3.1 70B47343226211616133020.8%
GPT-5.4 Nano53453732290000019.5%
Hermes 3 70B34342626181616113018.4%
Cohere Command R+ (Aug. 2024)39343226248550017.4%
GPT-4o Mini (temp=1)2421181613111150011.8%
Claude 3 Haiku26181185553008.2%
Mistral NeMO1811000000002.9%
Llama 3.1 8B50000000000.5%
GPT-4.1 Nano30000000000.3%
Rocinante 12B30000000000.3%
WizardLM 2 8x22b00000000000.0%
Gemma 3 4B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959098.5%
Claude Opus 4.6100100100100100100100100908597.5%
Claude Opus 4.6 (Reasoning)100100100100100100100100858597.0%
Gemini 2.5 Pro10010010010010010010090908096.0%
Qwen 3.5 397B A17B1001001001001001009090908595.5%
GPT-5.210095959595959595959595.5%
GPT-5 Mini100100959595959595958595.0%
Qwen 3.5 27B100100100100100909090909095.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100959090858594.5%
Qwen 3.5 Flash10010010010090909090908093.0%
Qwen 3.5 122B100100100100100909090806591.5%
Gemini 2.5 Flash (Reasoning)1001001009590909090808091.5%
GPT-5.110010010010090909090857091.5%
Stealth: Hunter Alpha10010010010090908585857591.0%
Aion 2.01001001009090909090857591.0%
Qwen 3.5 Plus (2026-02-15)10010010010085858585858591.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009070707090.0%
GPT-5100100959590909085856589.5%
Z.AI GLM 5100100909090909090757088.5%
Grok 4 Fast100100959585858580807588.0%
Gemini 3 Pro (Preview)10090909090909080807587.5%
o4 Mini9090909090908585858087.5%
Grok 4.1 Fast100100959590858580756587.0%
Claude Sonnet 4.610090858585858585858587.0%
Z.AI GLM 5 Turbo1001001009090907575757086.5%
ByteDance Seed 1.69090909090909075757585.5%
o4 Mini High9590909090858580807085.5%
ByteDance Seed 2.0 Mini10090909090908080756585.0%
Qwen 3.5 35B100100909090907575706584.5%
GPT-5.410010010010085707070707083.5%
Grok 410095909090858565605581.5%
Claude Sonnet 4.510090858585757570707080.5%
ByteDance Seed 2.0 Lite9090908580807575756580.5%
MoonshotAI: Kimi K2.59090909090858065606080.0%
Z.AI GLM 4.69090858585807575756080.0%
Z.AI GLM 4.710090858080757575707080.0%
Claude Haiku 4.58080808080808080808080.0%
Stealth: Healer Alpha9090909085857565654578.0%
Claude Sonnet 410085858570707070707077.5%
MiniMax M2.79080808080807575755577.0%
Claude 3.7 Sonnet9090807575757575656576.5%
Gemini 3 Flash (Preview)9075757575757575757076.0%
Gemini 2.5 Flash9090808075757070705575.5%
Inception Mercury 27575757575757575757074.5%
GPT-5.4 (Reasoning)8585807065656565656571.0%
GPT-5.4 (Reasoning, Low)10075707070656565656571.0%
Gemini 3.1 Flash Lite (Preview)7570707070707070707070.5%
Mistral Large9080757575656560606070.5%
Claude 3.5 Sonnet8075707070707065656570.0%
MiniMax M2.58080757570706565606070.0%
Claude Opus 49075757575656560606070.0%
Z.AI GLM 4.57575757575707065605569.5%
GPT-5.4 Mini (Reasoning)7575757070707070655069.0%
Nemotron 3 Super10095907070706055453569.0%
ByteDance Seed 1.6 Flash9080757070656560555568.5%
Qwen 3.5 9B908080807575706565068.0%
Grok 4.20 (Beta)8080757570707055554567.5%
GPT-5.4 Mini (Reasoning, Low)8075757570656555555567.0%
GPT-4.1 Mini7070707070706565655066.5%
Mistral Medium 3.1757575757575757070066.5%
Gemini 2.5 Flash Lite (Reasoning)7575757070656560555566.5%
Mistral Large 38575656565606060606065.5%
GPT-4o, May 13th (temp=0)7070706565656565655065.0%
GPT-5.4 Nano (Reasoning)858075757570656560065.0%
Mistral Large 28075656565656560505064.0%
Llama 3.1 Nemotron 70B7565656565656060606064.0%
Stealth: Aurora Alpha8580757575757575101063.5%
Mistral Small 3.2 24B7575757065655555505063.5%
GPT-4.17565656565656560555063.0%
GPT-4o, Aug. 6th (temp=0)6565656565656560605563.0%
Z.AI GLM 4.7 Flash7070707065656055554562.5%
Mistral Small 4 (Reasoning)8075706565655555504062.0%
DeepSeek-V2 Chat8070707070605550454561.5%
GPT-4o, Aug. 6th (temp=1)7070656560555555555060.0%
Qwen 3 32B8080656565555545453559.0%
Writer: Palmyra X57070606060555555454557.5%
Llama 3.1 70B7065605555555555555057.5%
Mistral Small Creative7570706055555550454057.5%
Ministral 3 14B80807575656565605557.5%
GPT-5 Nano7565656060555545454557.0%
GPT-4o, May 13th (temp=1)7065606060605550504057.0%
Qwen3 235B A22B Instruct 25076565656060555050454055.5%
DeepSeek V3 (2024-12-26)7065605555555545454555.0%
DeepSeek V3 (2025-03-24)7070605555555045454555.0%
Gemini 2.5 Flash Lite7560605555555550403554.0%
Nemotron 3 Nano6560606055555545454054.0%
Qwen 2.5 72B7060606055555050453554.0%
Gemma 3 27B7065555555505045454553.5%
Claude 3.5 Haiku6060605550505045454051.5%
DeepSeek V3.26565655555554540353051.0%
Inception Mercury6565605555505050301549.5%
Arcee AI: Trinity Large (Preview)7055555555505045352049.0%
GPT-5.4 Mini656055555550454545548.0%
Hermes 3 405B6555504545454540404047.0%
DeepSeek V3.1705555555045404040045.0%
Ministral 3 3B5045454040403535303039.0%
Ministral 3 8B555045454035302520535.0%
Ministral 8B65454545353030250032.0%
Gemma 3 12B5035353525252520201528.5%
Ministral 3B554030303030301515528.0%
GPT-5.4 Nano (Reasoning, Low)7575656000000027.5%
GPT-4o Mini (temp=1)3535353535252520101026.5%
Hermes 3 70B45403535303025205026.5%
GPT-4o Mini (temp=0)3525252525252525252526.0%
Arcee AI: Trinity Mini4535303025202020201025.5%
Mistral Small 4403535303025202015025.0%
Cohere Command R+ (Aug. 2024)504540403530000024.0%
Mistral NeMO504040352525000021.5%
Claude 3 Haiku303020202020202015520.0%
Llama 3.1 8B35303030302015100020.0%
GPT-5.4 Nano352020151515000012.0%
Gemma 3 4B30252010105000010.0%
Rocinante 12B5045000000009.5%
GPT-4.1 Nano00000000000.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.11001001001001001001001001009599.5%
Claude Opus 4.6 (Reasoning)1001001001001001001001001009099.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Z.AI GLM 4.71001001001001001001001001009099.0%
ByteDance Seed 2.0 Mini1001001001001001001001001009099.0%
Qwen 3.5 Flash1001001001001001001001001009099.0%
DeepSeek V3 (2024-12-26)1001001001001001001001001009099.0%
Gemini 2.5 Pro100100100100100100100100959098.5%
MoonshotAI: Kimi K2.5100100100100100100100100959098.5%
Grok 4.1 Fast100100100100100100100100909098.0%
Grok 4 Fast100100100100100100100100909098.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100909098.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001008098.0%
Qwen 3.5 9B10010010010010010010090909097.0%
Gemini 3 Pro (Preview)10010010010010010010090909097.0%
Qwen 3.5 35B10010010010010010010090909097.0%
GPT-5.4 Mini (Reasoning, Low)10010010010010010010090909097.0%
GPT-4o, May 13th (temp=0)100100100100100100100100858597.0%
Claude 3.5 Sonnet100100100100100100100100858597.0%
Z.AI GLM 4.61001001001001001009590909096.5%
DeepSeek-V2 Chat10010010010010010010090908596.5%
Qwen 3.5 Plus (2026-02-15)1001001009595959595959596.5%
DeepSeek V3.210010010010010010010090858095.5%
o4 Mini High100100100100100909090909095.0%
Z.AI GLM 4.5100100100100100909090909095.0%
Mistral Medium 3.11001001001001001008585858594.0%
o4 Mini1001001009090909090909093.0%
Grok 4.20 (Beta)100100100100100858585858592.5%
Grok 4100100909090909090909092.0%
GPT-4o, Aug. 6th (temp=1)10010010010090908585858592.0%
DeepSeek V3.110010010010090908585858592.0%
Llama 3.1 70B10090909090909090909091.0%
Llama 3.1 Nemotron 70B10090909090909090909091.0%
Claude Opus 410010010010085858585858591.0%
Inception Mercury 29090909090909090909090.0%
Gemini 2.5 Flash9090909090909090909090.0%
ByteDance Seed 2.0 Lite100100909090909090808090.0%
Mistral Large10090909090909085858589.5%
Mistral Large 29090909090909090858589.0%
Aion 2.010010010010010010010010090089.0%
Hermes 3 405B1001001009090909085806088.5%
GPT-5.4 Nano (Reasoning)1001001001001001001009090088.0%
Inception Mercury10090909090909090757588.0%
Mistral Large 39090909090858585858587.5%
Gemini 2.5 Flash Lite (Reasoning)9090909090909090806586.5%
MiniMax M2.7100100909090858575757586.5%
Nemotron 3 Nano10090909090808080808086.0%
GPT-4o, May 13th (temp=1)1001001009085858585656586.0%
Gemma 3 12B9090909090908080807585.5%
MiniMax M2.5100100909090857575757085.0%
Claude Sonnet 4.68585858585858585858585.0%
GPT-4.1 Mini9090909085808075757583.0%
ByteDance Seed 1.6 Flash9090909080808080756582.0%
Stealth: Aurora Alpha909090909090909090081.0%
Qwen 3 32B9090909090808075705080.5%
Mistral Small 3.2 24B9090908075757575757580.0%
Qwen3 235B A22B Instruct 25078585858580757575756578.5%
GPT-5 Nano9090808080807070707078.0%
Mistral Small 4 (Reasoning)9090808080757575706578.0%
Qwen 2.5 72B9090908080808080703577.5%
Z.AI GLM 4.7 Flash10090908075757565605576.5%
Arcee AI: Trinity Large (Preview)9085858075707070706576.0%
Writer: Palmyra X58585858575757065656575.5%
Gemini 2.5 Flash Lite7575757575757575757074.5%
GPT-5.4 Mini9085807575757565654072.5%
Mistral Small Creative8080707070707070706571.5%
Cohere Command R+ (Aug. 2024)9090807070706560605571.0%
Claude Haiku 4.57575757575757560606070.5%
Gemma 3 27B7575706565656555555564.5%
Ministral 3 8B7070656560606050503558.5%
Ministral 3 14B857565606060605545557.0%
Ministral 8B7575656560605555352056.5%
GPT-4o Mini (temp=0)6060606060605050505056.0%
Ministral 3 3B6565656560605555403056.0%
GPT-5.4 Nano90807575706055400054.5%
Claude 3.5 Haiku6565655555555555402553.5%
Hermes 3 70B807565606060504540053.5%
GPT-4o Mini (temp=1)6060605050505050503551.5%
Ministral 3B6560555050505045403550.0%
Llama 3.1 8B7570656060404035302049.5%
Mistral NeMO6555554545454545454549.0%
Mistral Small 4706565605550504530049.0%
Arcee AI: Trinity Mini6060555045454035353546.0%
GPT-5.4 Nano (Reasoning, Low)100908080750000042.5%
Claude 3 Haiku4545454035353535201535.0%
Rocinante 12B554035151515555019.0%
GPT-4.1 Nano20201510101010100010.5%
WizardLM 2 8x22b1010000000002.0%
Gemma 3 4B50000000000.5%
LFM2 24B00000000000.0%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Qwen 3 32B100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Qwen 3.5 27B1001001001001001001001001009099.0%
Z.AI GLM 4.51001001001001001001001001009099.0%
Grok 4 Fast1001001001001001001001001009099.0%
Qwen 3.5 9B100100100100100100100100908097.0%
Gemini 3.1 Flash Lite (Preview)10010010010010010010090909097.0%
Grok 410010010010010010010090909097.0%
Claude Sonnet 4.51001001001001001001001001007097.0%
Stealth: Hunter Alpha1001001001001001009090909096.0%
GPT-51001001001001001009090909096.0%
Z.AI GLM 4.6100100100100100100100100907096.0%
DeepSeek V3 (2025-03-24)100100100100100100100100808096.0%
DeepSeek V3.2100100100100100909090909095.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010080808094.0%
Claude 3.5 Haiku100100100100100100100100707094.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001003093.0%
Mistral Large1001001009090909090909093.0%
DeepSeek-V2 Chat1001001001001001009080807092.0%
Llama 3.1 Nemotron 70B100100100100100100100100903092.0%
Llama 3.1 70B1001001001001001009080707091.0%
Mistral Large 39090909090909090909090.0%
Mistral Large 29090909090909090909090.0%
DeepSeek V3 (2024-12-26)1001001001001001008080806090.0%
Nemotron 3 Nano1001001001001001008080806090.0%
GPT-5.4 Nano (Reasoning)10010010010010010010010080088.0%
Inception Mercury 21001001008080808080808086.0%
Mistral Small 4 (Reasoning)100100909090808080807086.0%
Claude Opus 4100100100100100707070707085.0%
Z.AI GLM 4.7 Flash100100100100100808070705085.0%
Stealth: Aurora Alpha100100100100100808080803085.0%
MiniMax M2.71001001008080808080707084.0%
DeepSeek V3.1100100100100100100908070084.0%
GPT-4o, May 13th (temp=0)10010010010090907060606083.0%
ByteDance Seed 1.6 Flash10080808080808080807081.0%
Gemini 2.5 Flash10080808080808080707080.0%
Inception Mercury10080808080808080606078.0%
MiniMax M2.5100100908070707050505073.0%
GPT-5 Nano100100100808080805050072.0%
Mistral Medium 3.110070707070707070606071.0%
Claude Sonnet 4.67070707070707070707070.0%
Gemma 3 27B7070707070707070707070.0%
GPT-4o, Aug. 6th (temp=0)9090906060606060606069.0%
GPT-4o, May 13th (temp=1)9090707060606060606068.0%
GPT-4.1 Mini8080707070707070703068.0%
Qwen 2.5 72B8080808080805050505068.0%
GPT-4o, Aug. 6th (temp=1)9090706060606060606067.0%
Grok 4.20 (Beta)8070706060606060605063.0%
Arcee AI: Trinity Mini10080808060605050303062.0%
Writer: Palmyra X5707070707070606060060.0%
GPT-4o Mini (temp=1)8080808050505050404060.0%
Qwen3 235B A22B Instruct 2507707070707060606060059.0%
Arcee AI: Trinity Large (Preview)7070707060605050404058.0%
Claude Haiku 4.57070707060605040404057.0%
GPT-5.4 Mini9070706060606040401056.0%
GPT-4o Mini (temp=0)8080505050505050505056.0%
Gemma 3 12B8050505050505050505053.0%
Mistral Small Creative6060606060605050403053.0%
GPT-5.4 Nano (Reasoning, Low)100100100803020000043.0%
Llama 3.1 8B707070704030302020042.0%
Gemini 2.5 Flash Lite5050404040404040404042.0%
Cohere Command R+ (Aug. 2024)806060605040202020041.0%
Ministral 3 3B70706050404040300040.0%
Mistral Small 3.2 24B6050504040403030301038.0%
Mistral NeMO705050504040303010037.0%
Claude 3 Haiku6060404040302020101033.0%
GPT-5.4 Nano6060505040303000032.0%
Ministral 3 14B5050404040202020201031.0%
Ministral 3B70706040202010100030.0%
Mistral Small 450504040403030200030.0%
Hermes 3 70B60505030302020100027.0%
GPT-4.1 Nano202020200000008.0%
Ministral 8B50101000000007.0%
Rocinante 12B40201000000007.0%
Gemma 3 4B20202000000006.0%
Ministral 3 8B100000000001.0%
WizardLM 2 8x22b00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5.21001001001001001001001001009299.2%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Grok 41001001001001001001001001009299.2%
ByteDance Seed 1.61001001001001001001001001009299.2%
Aion 2.01001001001001001001001001009299.2%
Claude Opus 41001001001001001001001001009299.2%
Qwen 3.5 122B100100100100100100100100929298.3%
Grok 4.20 (Beta, Reasoning)100100100100100100100100929298.3%
Inception Mercury 2100100100100100100100100929298.3%
GPT-5.4 (Reasoning)10010010010010010010092929297.5%
Z.AI GLM 51001001001001001001001001007597.5%
Qwen 3.5 9B100100100100100100100100928397.5%
Claude 3.5 Sonnet100100100100100100100100838396.7%
Mistral Medium 3.1100100100100100100100100838396.7%
GPT-5.110010010010092929292929295.0%
GPT-510010010010092929292929295.0%
Stealth: Healer Alpha100100100100100100100100757595.0%
GPT-4o, May 13th (temp=0)10010010010010010010083838395.0%
Qwen3 235B A22B Instruct 25071001001001001001009292927595.0%
MiniMax M2.710010010010010010010083836793.3%
MiniMax M2.510010010010010010010083757593.3%
o4 Mini1001001001001001009292757593.3%
Grok 4 Fast1001001001001001001001001003393.3%
Llama 3.1 Nemotron 70B100100100100100928383838392.5%
GPT-5 Mini10092929292929292929292.5%
o4 Mini High100100100100100929292757592.5%
Qwen 3.5 Flash10010010010010010010092755091.7%
GPT-5.4 Mini (Reasoning)1001001001001001008383836791.7%
Qwen 3.5 Plus (2026-02-15)1001001001001001007575757590.0%
Writer: Palmyra X51001001001001001007575757590.0%
Z.AI GLM 4.5100100100100100927575757589.2%
Gemini 3.1 Flash Lite (Preview)10010010010010010010075675089.2%
Gemini 2.5 Flash Lite (Reasoning)100100929292838383838389.2%
Claude Sonnet 4.6 (Reasoning)100100100100100757575757587.5%
DeepSeek V3 (2025-03-24)1001001009283838375757586.7%
Z.AI GLM 4.6100100100100100100929275886.7%
Mistral Small 4 (Reasoning)1001001001001001009275584286.7%
Mistral Small Creative1001001009292838383755886.7%
Stealth: Hunter Alpha10010010010083757575757585.8%
GPT-5.4 Mini (Reasoning, Low)10010010010083838367676785.0%
GPT-5.4 Nano (Reasoning)10010010010010092928375084.2%
Inception Mercury10010010010083757575755884.2%
ByteDance Seed 1.6 Flash10092838383838383836784.2%
GPT-4.1 Mini9292929292757575756782.5%
Qwen 3 32B1001001008383838383674282.5%
Mistral Large1001001009275757575676782.5%
Gemini 2.5 Flash9292929283757575756781.7%
GPT-4o, Aug. 6th (temp=1)10083838383838383676781.7%
GPT-4o, Aug. 6th (temp=0)8383838383838383836781.7%
GPT-5 Nano100100837575757575757580.8%
GPT-4.1100100757575757575757580.0%
DeepSeek V3.21001001009292929258502580.0%
Gemma 3 27B100100927575757575755079.2%
Ministral 3 8B8383837575757575676775.8%
Nemotron 3 Super8375757575757575757575.8%
Claude Sonnet 4.67575757575757575757575.0%
Claude Haiku 4.57575757575757575757575.0%
DeepSeek-V2 Chat100100837575756758585875.0%
Llama 3.1 70B838383838383838375074.2%
Mistral Large 39275757575756767676773.3%
Mistral Small 3.2 24B10092838375757558504273.3%
Grok 4.20 (Beta)9283757575757567505071.7%
Z.AI GLM 4.7 Flash10083676767676767675870.8%
Mistral Large 210075757575676767505070.0%
Claude 3.7 Sonnet1001001001001001009200069.2%
DeepSeek V3.19292837575756758423369.2%
Hermes 3 405B1008375757575756758068.3%
Gemma 3 12B9275757575756767503368.3%
Claude 3.5 Haiku8383838367676767502567.5%
Ministral 8B10083757575675850504267.5%
DeepSeek V3 (2024-12-26)9283757575675858424266.7%
Nemotron 3 Nano8383757575756758332565.0%
WizardLM 2 8x22b1009283757567585050065.0%
Gemini 2.5 Flash Lite8383756767585858504264.2%
GPT-4o, May 13th (temp=1)10083838383676717171761.7%
Arcee AI: Trinity Large (Preview)7575675850505050504256.7%
GPT-4o Mini (temp=1)6767676750505050504255.8%
Arcee AI: Trinity Mini8383675050505050333355.0%
Qwen 2.5 72B8367676767675025252554.2%
GPT-5.4 Mini7567585050505050502552.5%
Hermes 3 70B757567676750504225051.7%
GPT-4o Mini (temp=0)5050505050505050505050.0%
Cohere Command R+ (Aug. 2024)75676758584233338044.2%
Claude 3 Haiku5858585842332525252540.8%
Ministral 3 14B5050505042423333251739.2%
Mistral NeMO6767424233333325251738.3%
Mistral Small 4755050423317800027.5%
Ministral 3B4242333388000016.7%
GPT-5.4 Nano (Reasoning, Low)8333251700000015.8%
Ministral 3 3B75251717178000015.8%
Rocinante 12B42422525170000015.0%
Gemma 3 4B25251788000008.3%
GPT-5.4 Nano4217800000006.7%
GPT-4.1 Nano170000000001.7%
Llama 3.1 8B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959599.1%
Grok 4.20 (Beta)1001001001001001001001001008698.6%
Gemini 2.5 Pro1001001009595959595959596.8%
ByteDance Seed 2.0 Lite10010010010095959595918695.9%
GPT-5.49595959595959595919194.5%
Claude Sonnet 4.51001001001001001008686868694.5%
Gemini 3 Flash (Preview, Reasoning)1001001009595959186828292.7%
Claude Sonnet 4.6 (Reasoning)9595959595919191918292.3%
GPT-5.29595959591918686868690.9%
Gemini 3 Pro (Preview)10095959191918686868690.9%
Qwen 3.5 Flash10095959595919186827790.9%
Z.AI GLM 5 Turbo100100959595918686777790.5%
Qwen 3.5 35B10095959591919186867390.5%
GPT-59191919191919191918690.5%
Grok 4100100919191919191827790.5%
Qwen 3.5 27B1001001008686868686868290.0%
MoonshotAI: Kimi K2.59595959191868686868289.5%
GPT-5.19191919191919186868289.1%
GPT-5.4 (Reasoning, Low)9591919191868686868689.1%
Grok 4.20 (Beta, Reasoning)10091919191919182827788.6%
Aion 2.0100100959191868282827788.6%
ByteDance Seed 2.0 Mini10091919186868686868288.6%
Claude 3.5 Sonnet9595959595918682777388.6%
Gemini 2.5 Flash10086868686868686868687.7%
Gemini 2.5 Flash (Reasoning)10095919191868282777787.3%
GPT-5.4 (Reasoning)9191918686868686828286.8%
Mistral Medium 3.19595868686868686827786.8%
Z.AI GLM 4.79595868686868282828286.4%
Gemini 3 Flash (Preview)8686868686868686868686.4%
Z.AI GLM 510091919191868282737385.9%
Stealth: Hunter Alpha10091919186868282777385.9%
Claude Sonnet 48686868686868686828285.5%
Grok 4.1 Fast9191919191868282737385.0%
Claude Sonnet 4.6100100919186828277686884.5%
o4 Mini High9191919191827777777384.1%
Qwen 3.5 122B9191918686828282776883.6%
Claude Opus 49595958282828273737383.2%
GPT-4o, Aug. 6th (temp=0)8686868686867777777782.7%
Stealth: Healer Alpha9191919182827773736881.8%
Gemini 3.1 Flash Lite (Preview)8282828282828282828281.8%
GPT-5.4 Mini (Reasoning)8686868686868277686881.4%
Mistral Large9595958682827768646480.9%
Grok 4 Fast1009591919191828282080.5%
o4 Mini9182828282827777777380.5%
ByteDance Seed 1.68686777777777777777779.1%
Qwen 3.5 Plus (2026-02-15)9191868282777777775079.1%
GPT-5.4 Mini (Reasoning, Low)8686828282827777686478.6%
GPT-4o, Aug. 6th (temp=1)9191919186777368595978.6%
DeepSeek V3.210091919177737368645578.2%
Mistral Large 28686868282827773645977.7%
GPT-5 Mini8682777777777777736877.3%
Z.AI GLM 4.610095868682827359594576.8%
Qwen 3.5 397B A17B958686868686828273076.4%
Gemini 2.5 Flash Lite (Reasoning)8682828282737368686475.9%
GPT-5 Nano9182828277777368645575.0%
MiniMax M2.79182828277777368685075.0%
Qwen 3.5 9B959191867777776868974.1%
GPT-5.4 Mini9191868273736864644573.6%
Claude Haiku 4.5919191828277736868072.3%
Z.AI GLM 4.7 Flash9177737373686868645971.4%
Writer: Palmyra X58682827768686868595571.4%
Inception Mercury9582827773736864554571.4%
Stealth: Aurora Alpha7373737373737368686470.9%
Mistral Large 39177777768646464646470.9%
Inception Mercury 27777777773736868595970.9%
DeepSeek V3.19595868282776445414170.9%
MiniMax M2.58282777368686464645970.0%
Claude 3.7 Sonnet100100868677645950453270.0%
Z.AI GLM 4.59173686868686464595968.2%
Qwen3 235B A22B Instruct 25079186868277685959502368.2%
ByteDance Seed 1.6 Flash8273737373736459594567.3%
GPT-4o, May 13th (temp=0)7777777368685959595066.8%
Nemotron 3 Nano8282777373595959594566.8%
GPT-4.195828282828277640064.5%
Mistral Small 4 (Reasoning)8273686864595959504562.7%
Gemini 2.5 Flash Lite100959177737364550062.7%
DeepSeek V3 (2024-12-26)8273736868685945453661.8%
Ministral 3 14B8282776868454545414159.5%
GPT-4.1 Mini7768686859595541413657.3%
Arcee AI: Trinity Large (Preview)7773736455555050452756.8%
Nemotron 3 Super77777373736868590056.8%
Qwen 3 32B777373686455505045956.4%
GPT-4o, May 13th (temp=1)7368595959555045454155.5%
Mistral Small Creative6464645955555050504155.0%
DeepSeek-V2 Chat736868685955504541052.7%
Mistral Small 3.2 24B6459595955505045454152.7%
DeepSeek V3 (2025-03-24)686859595955454545951.4%
Claude 3.5 Haiku5555555555555545414150.9%
Hermes 3 405B8273686459503627232350.5%
Gemma 3 27B5959554545414141414146.8%
Llama 3.1 Nemotron 70B5959594541413636362744.1%
GPT-5.4 Nano (Reasoning)827373737364000043.6%
Llama 3.1 70B595955554545452727942.7%
Cohere Command R+ (Aug. 2024)6459555545454127231442.7%
Ministral 3 8B7364454141323223181438.2%
Ministral 8B68645550414132230037.3%
Mistral Small 45050454545323227231836.8%
WizardLM 2 8x22b594541414141362723936.4%
GPT-5.4 Nano (Reasoning, Low)77737364500000033.6%
Arcee AI: Trinity Mini454545454136361814933.6%
Hermes 3 70B5041363632231800023.6%
GPT-5.4 Nano64454545140000021.4%
GPT-4o Mini (temp=0)3223231818181818181820.5%
GPT-4o Mini (temp=1)5032322323181450019.5%
Ministral 3B3232272323141495017.7%
Gemma 3 12B3227272723141495017.7%
Claude 3 Haiku412723181414995015.9%
Ministral 3 3B322318181814500012.7%
Mistral NeMO2727271495500011.4%
Llama 3.1 8B4132271400000011.4%
Qwen 2.5 72B739900000009.1%
Gemma 3 4B50000000000.5%
Rocinante 12B50000000000.5%
GPT-4.1 Nano00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 27B1001001001001001001001001009599.5%
Z.AI GLM 5 Turbo1001001001001001001001001009099.0%
Qwen 3.5 35B1001001001001001001001001009099.0%
GPT-5 Mini100100100100100100100100959599.0%
GPT-5.4100100100100100100100100959599.0%
Gemini 2.5 Pro1001001001001001001001001008598.5%
Claude Sonnet 4.51001001001001001001001001008598.5%
MoonshotAI: Kimi K2.5100100100100100100100100958598.0%
GPT-5.4 (Reasoning)10010010010010010010095909097.5%
Qwen 3.5 397B A17B10010010010010010010095858596.5%
Gemini 2.5 Flash (Reasoning)1001001001001001009595908596.5%
GPT-5.4 Mini (Reasoning)1001001001001001009090909096.0%
o4 Mini High100100100100100959590909096.0%
GPT-5.21001001009595959595959096.0%
GPT-51001001001001001009590858595.5%
Gemini 3 Flash (Preview, Reasoning)10010010010010010010085858595.5%
Gemini 3 Pro (Preview)10010010010010010010085858595.5%
ByteDance Seed 2.0 Mini1001001001001001009090908595.5%
Qwen 3.5 Flash10010010010010010010085858595.5%
Claude Opus 4.5100100100100100909090909095.0%
Aion 2.0100100100100100909090909095.0%
Z.AI GLM 4.710010010010010010010090857595.0%
Stealth: Healer Alpha10010010010010010010090808095.0%
o4 Mini100100100100100909090908594.5%
Qwen 3.5 122B1001001001001001009085858594.5%
Z.AI GLM 5100100100100100909090858594.0%
GPT-5.110010010010095909090908093.5%
ByteDance Seed 1.61001001001001001009085857593.5%
Stealth: Hunter Alpha100100100100100908585858092.5%
Gemini 2.5 Flash100100959590909090908092.0%
Nemotron 3 Super100100909090909090908091.0%
ByteDance Seed 2.0 Lite9090909090909090909090.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009090909090807090.0%
Inception Mercury 210095909090909085757588.0%
Mistral Large 210090909090908580808087.5%
Claude Sonnet 4.69090909090909090757587.0%
Claude Sonnet 49090909090909080808087.0%
Claude 3.7 Sonnet9090909090909090757587.0%
GPT-5.4 Mini (Reasoning, Low)9090909090909085757586.5%
Qwen 3.5 9B10010010010090909090851085.5%
Mistral Large 310090909090858075757585.0%
Qwen 3.5 Plus (2026-02-15)100100858585858580707084.5%
Claude Opus 4100100908585858575706584.0%
Mistral Small 4 (Reasoning)10090909090808080804582.5%
Gemini 3 Flash (Preview)10085858585858570707082.0%
Stealth: Aurora Alpha9090909090858075755081.5%
Mistral Large9090909090908065656581.5%
Z.AI GLM 4.610010010010090858575551580.5%
MiniMax M2.710090858080808075706580.5%
GPT-4o, May 13th (temp=0)8080808080808080707078.0%
DeepSeek V3.29585858080757065606075.5%
MiniMax M2.59090808080707065605574.0%
Gemini 3.1 Flash Lite (Preview)9090757575757570655074.0%
Claude 3.5 Sonnet9080757575757565656574.0%
DeepSeek-V2 Chat9090858585807055404072.0%
Z.AI GLM 4.7 Flash9090858580707065454072.0%
Ministral 3 14B8580807575707070554570.5%
Claude Haiku 4.5909085757575757060069.5%
GPT-5 Nano9590806565656560555069.0%
GPT-4o, Aug. 6th (temp=1)8080808075706555554068.0%
DeepSeek V3 (2024-12-26)9085857065656055554567.5%
Nemotron 3 Nano9080757570656060504567.0%
GPT-4o, Aug. 6th (temp=0)8080707070656565554066.0%
Qwen3 235B A22B Instruct 25078580757070656050505065.5%
GPT-4.110075707065656560503065.0%
Mistral Medium 3.19085757070655555454065.0%
Grok 4.20 (Beta)8080807565605555505065.0%
DeepSeek V3.19085757070656560353064.5%
ByteDance Seed 1.6 Flash8080807070656060453564.5%
DeepSeek V3 (2025-03-24)8580757065605555504564.0%
Writer: Palmyra X57575706565656060504062.5%
GPT-4o, May 13th (temp=1)7570706560555555504560.0%
GPT-4.1 Mini7065656560605555504559.0%
GPT-5.4 Mini9595959080703055557.0%
Gemma 3 27B6560606055555555555057.0%
Z.AI GLM 4.58075757560555045151054.0%
Inception Mercury8565656055555045402054.0%
Mistral Small Creative7070656565655025151550.5%
Qwen 3 32B907060606055502520049.0%
Ministral 3 8B707065656560352525048.0%
Hermes 3 405B5550505050504540403546.5%
Claude 3.5 Haiku605050505050505040045.0%
GPT-5.4 Nano (Reasoning, Low)858580757525000042.5%
Mistral Small 3.2 24B555050505035353535039.5%
Ministral 3 3B5550505050453525201039.0%
GPT-5.4 Nano (Reasoning)90909070400000038.0%
Mistral Small 4606050453535302520036.0%
Llama 3.1 Nemotron 70B655050503535301510034.0%
Qwen 2.5 72B5045454035353525151033.5%
WizardLM 2 8x22b5050453530303015151031.0%
Llama 3.1 70B605040403530201510030.0%
Arcee AI: Trinity Large (Preview)656050503530000029.0%
Ministral 3B5040353020202015151526.0%
Ministral 8B55554540250000022.0%
Hermes 3 70B404030252015151010521.0%
GPT-5.4 Nano70553520205000020.5%
Gemini 2.5 Flash Lite4540302525201000019.5%
Cohere Command R+ (Aug. 2024)35352020205000013.5%
Arcee AI: Trinity Mini3025201515101000012.5%
GPT-4o Mini (temp=0)3025202055555012.0%
GPT-4o Mini (temp=1)20151555550007.0%
Llama 3.1 8B3020000000005.0%
Mistral NeMO355000000004.0%
Gemma 3 12B155555000003.5%
GPT-4.1 Nano00000000000.0%
Claude 3 Haiku00000000000.0%
Gemma 3 4B00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%