Accuracy (recall)

Test: Codex Violation Detection

Avg. Score
65.2%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast93.8%$0.002121.1s84%
2Grok 4 Fast92.4%$0.001812.8s72%
3Gemini 3 Flash (Preview)88.8%$0.00314.5s69%
4Claude Opus 4.598.0%$0.0419.7s91%
5GPT-5.295.7%$0.02424.2s87%
6Claude Sonnet 4.593.6%$0.0248.9s80%
7Gemini 2.5 Flash82.5%$0.00252.8s69%
8Gemini 2.5 Pro97.4%$0.03523.2s91%
9Claude Opus 4.697.3%$0.04010.2s88%
10Claude Sonnet 490.3%$0.0239.0s72%
11o4 Mini88.9%$0.01928.1s74%
12Z.AI GLM 592.1%$0.01356.4s79%
13ByteDance Seed 1.691.9%$0.00671.0m77%
14GPT-5 Mini91.2%$0.009251.7s74%
15Z.AI GLM 4.791.3%$0.00911.0m77%
16Claude Sonnet 4.682.9%$0.0249.9s70%
17Mistral Large 375.5%$0.003010.2s58%
18Mistral Large79.3%$0.0129.1s59%
19Qwen 3.5 Plus (2026-02-15)85.0%$0.004134.2s55%
20Mistral Large 275.9%$0.0128.8s57%
21Mistral Medium 3.175.1%$0.00297.7s46%
22GPT-5.192.7%$0.03752.8s79%
23o4 Mini High90.7%$0.03351.3s76%
24MoonshotAI: Kimi K2.592.9%$0.0151.5m79%
25Minimax M2.574.6%$0.002025.5s52%
26DeepSeek V3.275.0%$0.001317.2s45%
27GPT-4.177.7%$0.008110.0s44%
28Gemini 3.1 Pro (Preview)98.3%$0.06852.3s93%
29Z.AI GLM 4.575.6%$0.002618.5s45%
30ByteDance Seed 1.6 Flash67.9%$0.000912.0s46%
31DeepSeek-V2 Chat71.1%$0.002013.1s41%
32GPT-4.1 Mini63.7%$0.00155.8s43%
33Claude Haiku 4.566.8%$0.00785.8s45%
34Gemini 3 Pro (Preview)91.0%$0.05034.0s70%
35DeepSeek V3 (2024-12-26)69.9%$0.001915.6s41%
36Z.AI GLM 4.686.1%$0.00491.3m59%
37Writer: Palmyra X567.2%$0.006212.1s44%
38Stealth: Aurora Alpha77.5%6.2s46%
39Grok 494.1%$0.0511.2m80%
40Claude 3.7 Sonnet77.4%$0.02310.2s42%
41GPT-4o, Aug. 6th (temp=1)65.4%$0.0113.6s36%
42Mistral Small Creative56.8%$0.00064.3s36%
43DeepSeek V3 (2025-03-24)68.9%$0.001522.9s34%
44GPT-4o, Aug. 6th (temp=0)67.9%$0.0155.0s36%
45Claude 3.5 Sonnet79.2%$0.04210.5s47%
46DeepSeek V3.166.1%$0.001431.1s33%
47GPT-4o, May 13th (temp=0)71.6%$0.0305.4s36%
48Mistral Small 3.2 24B53.8%$0.000610.4s32%
49Gemma 3 27B53.2%$0.000513.3s33%
50Z.AI GLM 4.7 Flash68.2%$0.00181.1m45%
51GPT-593.3%$0.0611.6m81%
52Gemini 2.5 Flash Lite47.9%$0.00052.3s24%
53Claude 3.5 Haiku51.2%$0.00496.2s25%
54GPT-4o, May 13th (temp=1)58.1%$0.0263.7s34%
55Ministral 3 14B46.4%$0.00106.3s25%
56Hermes 3 405B56.2%$0.004417.2s20%
57Llama 3.1 Nemotron 70B55.8%$0.005518.5s19%
58Qwen 3.5 397B A17B93.7%$0.0262.9m75%
59Ministral 3 8B39.9%$0.00074.8s20%
60Llama 3.1 70B51.2%$0.002124.3s18%
61Qwen 2.5 72B42.3%$0.000813.9s18%
62Ministral 8B34.7%$0.00055.6s18%
63Claude Opus 484.4%$0.11615.8s65%
64Arcee AI: Trinity Mini33.3%$0.000410.3s18%
65GPT-5 Nano67.2%$0.00491.9m42%
66Ministral 3 3B28.6%$0.00053.3s16%
67Gemma 3 12B35.9%$0.000312.0s11%
68GPT-4o Mini (temp=1)29.4%$0.00067.4s13%
69Ministral 3B25.0%$0.00022.9s14%
70GPT-4o Mini (temp=0)31.3%$0.000625.0s19%
71Cohere Command R+ (Aug. 2024)32.3%$0.01410.0s15%
72Claude 3 Haiku19.4%$0.00153.6s12%
73Hermes 3 70B28.5%$0.001329.2s14%
74Mistral NeMO20.6%$0.000713.3s7%
75Llama 3.1 8B16.1%$0.000216.1s0%
76WizardLM 2 8x22b17.1%$0.003615.7s0%
77Rocinante 12B6.4%$0.00096.0s0%
78GPT-4.1 Nano2.6%$0.00043.9s0%
79Gemma 3 4B3.2%$0.000212.5s0%
80Arcee AI: Trinity Large (Preview)46.2%$0.00003.1m24%
65.23%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Grok 410098989797959594949196.1%
Qwen 3.5 397B A17B100100979594948989888593.2%
Gemini 2.5 Pro10095929292898888888691.2%
GPT-5.29797949292918988868591.2%
Grok 4 Fast100100979291898686838290.8%
Grok 4.1 Fast10097949191918989828290.6%
GPT-59494929291918888888890.6%
Gemini 3.1 Pro (Preview)9494919191898888888690.0%
Claude Opus 4.59292919189898888858388.9%
GPT-5 Mini9191888886868585858587.0%
ByteDance Seed 1.69491918988858382807986.2%
Z.AI GLM 4.69492928888868382777786.1%
o4 Mini High9491918885838282827985.6%
GPT-5.19491888585858282808085.2%
Z.AI GLM 59785858382828279797783.0%
Claude Opus 4.68585858583838080808082.7%
Z.AI GLM 4.79488888383827979747382.3%
Claude Opus 49289858382827977777081.7%
MoonshotAI: Kimi K2.59186868682828080736581.2%
Claude Sonnet 4.68682828280777776767479.2%
Claude Sonnet 48382808080797977737178.5%
Gemini 3 Flash (Preview)8582828079767676747478.3%
Claude Sonnet 4.58583828079777777717078.2%
o4 Mini8886828279767673706777.7%
Gemini 2.5 Flash7977767674737064615970.8%
Gemini 3 Pro (Preview)8382807676767473671570.2%
Qwen 3.5 Plus (2026-02-15)928583827977763938065.2%
Mistral Large7970676765656261595865.2%
Mistral Large 37373706868646161565264.4%
GPT-4.17774706765646258564563.8%
Stealth: Aurora Alpha6867656559585858554259.4%
Mistral Large 26767676159585656534758.9%
DeepSeek V3.29465595952474747443955.3%
Minimax M2.57371646153535042413654.4%
Z.AI GLM 4.57667555353504545453652.6%
Claude Haiku 4.56758505050504847474250.9%
DeepSeek-V2 Chat6158565350504744393349.1%
Mistral Medium 3.16256535252484848472148.8%
DeepSeek V3 (2024-12-26)6764595858564835241748.5%
GPT-5 Nano5858565250484544353548.0%
DeepSeek V3.1716764625848413624047.1%
Claude 3.7 Sonnet6159484439393838353043.2%
Writer: Palmyra X5565653474444413938342.1%
Z.AI GLM 4.7 Flash5347454241383333322939.4%
DeepSeek V3 (2025-03-24)686248474533323023038.9%
ByteDance Seed 1.6 Flash5941393938383532292337.3%
Mistral Small Creative474141413835333329534.2%
Claude 3.5 Sonnet3836353333323232303033.2%
GPT-4.1 Mini5344333330292621201430.3%
Mistral Small 3.2 24B44393935242118189024.8%
GPT-4o, May 13th (temp=1)363029292926241817624.4%
GPT-4o, Aug. 6th (temp=0)44352929292927230024.4%
GPT-4o, Aug. 6th (temp=1)3532292724212015141423.0%
Ministral 3 8B353029242121201814321.5%
Ministral 3 14B3232272424211460018.0%
GPT-4o, May 13th (temp=0)39353333320000017.3%
Qwen 2.5 72B27262017151515129816.4%
Ministral 8B333224232312222215.3%
Gemma 3 27B3626151514121199014.7%
Claude 3.5 Haiku2418151514121199012.7%
Gemini 2.5 Flash Lite3230261850000011.1%
Hermes 3 405B2718111188665210.0%
Hermes 3 70B14111188852006.4%
Llama 3.1 Nemotron 70B12121186530005.6%
Arcee AI: Trinity Mini159665332004.8%
Cohere Command R+ (Aug. 2024)1212965300004.7%
Gemma 3 12B119865000003.8%
GPT-4o Mini (temp=1)96652000002.7%
Llama 3.1 70B126532000002.7%
Claude 3 Haiku118322200002.6%
WizardLM 2 8x22b96622000002.4%
Mistral NeMO110000000001.1%
GPT-4o Mini (temp=0)80000000000.8%
Llama 3.1 8B53000000000.8%
Ministral 3 3B53000000000.8%
GPT-4.1 Nano30000000000.3%
Ministral 3B30000000000.3%
Gemma 3 4B20000000000.2%
Arcee AI: Trinity Large (Preview)00000000000.0%
Rocinante 12B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009299.2%
Gemini 3.1 Pro (Preview)1001001001001001009797979798.9%
Claude Opus 4.610010010010097979797979297.9%
Grok 410010010010095959595959596.8%
MoonshotAI: Kimi K2.51001001009797959592929296.1%
Claude Sonnet 4.610010010010095959595928795.8%
Claude Sonnet 410010010010092929292878493.9%
GPT-5.29797959595928989898792.6%
ByteDance Seed 1.610095959592898989898491.8%
Grok 4 Fast9595958989898989897990.0%
Grok 4.1 Fast9595959292898987847989.7%
GPT-59595928989898989878289.7%
Qwen 3.5 397B A17B9292929292929284848489.7%
Z.AI GLM 4.710092929289878484847487.9%
Z.AI GLM 5100100959289828279797987.6%
Gemini 3 Flash (Preview)9292929292848484847687.4%
GPT-5.19292928989898482828287.4%
o4 Mini High9595878787878484828286.8%
Gemini 3 Pro (Preview)10095929289898279767486.8%
Z.AI GLM 4.69792928989898784747186.6%
GPT-4.1100100959292848476716686.1%
o4 Mini9589898787878282797485.0%
Gemini 2.5 Flash9292898987828271716882.4%
Claude Opus 49592878479797979716881.3%
Qwen 3.5 Plus (2026-02-15)9292848484847979686381.1%
Writer: Palmyra X58989878482797471686678.9%
Stealth: Aurora Alpha8989847676767676766678.7%
GPT-5 Mini8989878787878479534578.7%
Z.AI GLM 4.58787828282797674636377.4%
Minimax M2.58989828279767471686177.1%
Claude 3.5 Sonnet8282828274747466666173.9%
Claude 3.7 Sonnet7674747474747471717173.2%
Mistral Medium 3.17676747474747468666371.8%
Mistral Large8784797168686666616171.1%
GPT-4o, May 13th (temp=0)7979747474686868636170.8%
DeepSeek V3 (2024-12-26)8482797168686663636170.5%
Mistral Large 28279797166666666636370.0%
DeepSeek-V2 Chat8976767468686363615869.7%
DeepSeek V3.29584827674717163611869.5%
Z.AI GLM 4.7 Flash7974747168686363615567.6%
Mistral Large 37471716868686663636167.4%
GPT-4.1 Mini7674747166666361503463.4%
DeepSeek V3 (2025-03-24)898974746866635829061.1%
Claude Haiku 4.587797979717163610058.9%
ByteDance Seed 1.6 Flash7468666361616155502958.7%
GPT-4o, Aug. 6th (temp=0)7461615858585555555358.7%
Mistral Small 3.2 24B6863636158555555555358.7%
GPT-5 Nano82827674716866610057.9%
DeepSeek V3.182828279717155420056.3%
Gemini 2.5 Flash Lite7668616158504747453955.3%
GPT-4o, Aug. 6th (temp=1)686866636158554745053.2%
GPT-4o, May 13th (temp=1)7471685353534545323252.4%
Mistral Small Creative6353535047474539392145.8%
Arcee AI: Trinity Large (Preview)766158505042393732044.5%
Ministral 3 8B535047474545423939040.8%
Gemma 3 27B4745423939373737373739.7%
Ministral 8B5045424239373737343439.7%
Ministral 3 14B5047454239373734322438.7%
Hermes 3 405B5050474542373429292138.4%
Claude 3.5 Haiku3937343434343434323234.5%
Ministral 3B535045393926241813831.6%
GPT-4o Mini (temp=0)3232323232323232321129.5%
Gemma 3 12B3734342929262624211127.1%
Arcee AI: Trinity Mini373737373426241611826.6%
Qwen 2.5 72B4239393734322183025.5%
Ministral 3 3B50393932292616165025.3%
Llama 3.1 Nemotron 70B393229262624211816023.2%
Llama 3.1 70B47343226211616133020.8%
Hermes 3 70B34342626181616113018.4%
Cohere Command R+ (Aug. 2024)39343226248550017.4%
GPT-4o Mini (temp=1)2421181613111150011.8%
Claude 3 Haiku26181185553008.2%
Mistral NeMO1811000000002.9%
Llama 3.1 8B50000000000.5%
GPT-4.1 Nano30000000000.3%
Rocinante 12B30000000000.3%
Gemma 3 4B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959098.5%
Claude Opus 4.6100100100100100100100100908597.5%
Gemini 2.5 Pro10010010010010010010090908096.0%
Qwen 3.5 397B A17B1001001001001001009090908595.5%
GPT-5.210095959595959595959595.5%
GPT-5 Mini100100959595959595958595.0%
GPT-5.110010010010090909090857091.5%
Qwen 3.5 Plus (2026-02-15)10010010010085858585858591.0%
GPT-5100100959590909085856589.5%
Z.AI GLM 5100100909090909090757088.5%
Grok 4 Fast100100959585858580807588.0%
o4 Mini9090909090908585858087.5%
Gemini 3 Pro (Preview)10090909090909080807587.5%
Claude Sonnet 4.610090858585858585858587.0%
Grok 4.1 Fast100100959590858580756587.0%
o4 Mini High9590909090858580807085.5%
ByteDance Seed 1.69090909090909075757585.5%
Grok 410095909090858565605581.5%
Claude Sonnet 4.510090858585757570707080.5%
MoonshotAI: Kimi K2.59090909090858065606080.0%
Z.AI GLM 4.710090858080757575707080.0%
Z.AI GLM 4.69090858585807575756080.0%
Claude Haiku 4.58080808080808080808080.0%
Claude Sonnet 410085858570707070707077.5%
Claude 3.7 Sonnet9090807575757575656576.5%
Gemini 3 Flash (Preview)9075757575757575757076.0%
Gemini 2.5 Flash9090808075757070705575.5%
Mistral Large9080757575656560606070.5%
Claude 3.5 Sonnet8075707070707065656570.0%
Claude Opus 49075757575656560606070.0%
Minimax M2.58080757570706565606070.0%
Z.AI GLM 4.57575757575707065605569.5%
ByteDance Seed 1.6 Flash9080757070656560555568.5%
Mistral Medium 3.1757575757575757070066.5%
GPT-4.1 Mini7070707070706565655066.5%
Mistral Large 38575656565606060606065.5%
GPT-4o, May 13th (temp=0)7070706565656565655065.0%
Mistral Large 28075656565656560505064.0%
Llama 3.1 Nemotron 70B7565656565656060606064.0%
Stealth: Aurora Alpha8580757575757575101063.5%
Mistral Small 3.2 24B7575757065655555505063.5%
GPT-4.17565656565656560555063.0%
GPT-4o, Aug. 6th (temp=0)6565656565656560605563.0%
Z.AI GLM 4.7 Flash7070707065656055554562.5%
DeepSeek-V2 Chat8070707070605550454561.5%
GPT-4o, Aug. 6th (temp=1)7070656560555555555060.0%
Ministral 3 14B80807575656565605557.5%
Writer: Palmyra X57070606060555555454557.5%
Llama 3.1 70B7065605555555555555057.5%
Mistral Small Creative7570706055555550454057.5%
GPT-4o, May 13th (temp=1)7065606060605550504057.0%
GPT-5 Nano7565656060555545454557.0%
DeepSeek V3 (2025-03-24)7070605555555045454555.0%
DeepSeek V3 (2024-12-26)7065605555555545454555.0%
Gemini 2.5 Flash Lite7560605555555550403554.0%
Qwen 2.5 72B7060606055555050453554.0%
Gemma 3 27B7065555555505045454553.5%
Claude 3.5 Haiku6060605550505045454051.5%
DeepSeek V3.26565655555554540353051.0%
Arcee AI: Trinity Large (Preview)7055555555505045352049.0%
Hermes 3 405B6555504545454540404047.0%
DeepSeek V3.1705555555045404040045.0%
Ministral 3 3B5045454040403535303039.0%
Ministral 3 8B555045454035302520535.0%
Ministral 8B65454545353030250032.0%
Gemma 3 12B5035353525252520201528.5%
Ministral 3B554030303030301515528.0%
GPT-4o Mini (temp=1)3535353535252520101026.5%
Hermes 3 70B45403535303025205026.5%
GPT-4o Mini (temp=0)3525252525252525252526.0%
Arcee AI: Trinity Mini4535303025202020201025.5%
Cohere Command R+ (Aug. 2024)504540403530000024.0%
Mistral NeMO504040352525000021.5%
Llama 3.1 8B35303030302015100020.0%
Claude 3 Haiku303020202020202015520.0%
Gemma 3 4B30252010105000010.0%
Rocinante 12B5045000000009.5%
GPT-4.1 Nano00000000000.0%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.11001001001001001001001001009599.5%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Z.AI GLM 4.71001001001001001001001001009099.0%
DeepSeek V3 (2024-12-26)1001001001001001001001001009099.0%
MoonshotAI: Kimi K2.5100100100100100100100100959098.5%
Gemini 2.5 Pro100100100100100100100100959098.5%
Grok 4.1 Fast100100100100100100100100909098.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001008098.0%
Grok 4 Fast100100100100100100100100909098.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100909098.0%
Gemini 3 Pro (Preview)10010010010010010010090909097.0%
Claude 3.5 Sonnet100100100100100100100100858597.0%
GPT-4o, May 13th (temp=0)100100100100100100100100858597.0%
Z.AI GLM 4.61001001001001001009590909096.5%
Qwen 3.5 Plus (2026-02-15)1001001009595959595959596.5%
DeepSeek-V2 Chat10010010010010010010090908596.5%
DeepSeek V3.210010010010010010010090858095.5%
o4 Mini High100100100100100909090909095.0%
Z.AI GLM 4.5100100100100100909090909095.0%
Mistral Medium 3.11001001001001001008585858594.0%
o4 Mini1001001009090909090909093.0%
Grok 4100100909090909090909092.0%
GPT-4o, Aug. 6th (temp=1)10010010010090908585858592.0%
DeepSeek V3.110010010010090908585858592.0%
Llama 3.1 70B10090909090909090909091.0%
Llama 3.1 Nemotron 70B10090909090909090909091.0%
Claude Opus 410010010010085858585858591.0%
Gemini 2.5 Flash9090909090909090909090.0%
Mistral Large10090909090909085858589.5%
Mistral Large 29090909090909090858589.0%
Hermes 3 405B1001001009090909085806088.5%
Mistral Large 39090909090858585858587.5%
GPT-4o, May 13th (temp=1)1001001009085858585656586.0%
Gemma 3 12B9090909090908080807585.5%
Minimax M2.5100100909090857575757085.0%
Claude Sonnet 4.68585858585858585858585.0%
GPT-4.1 Mini9090909085808075757583.0%
ByteDance Seed 1.6 Flash9090909080808080756582.0%
Stealth: Aurora Alpha909090909090909090081.0%
Mistral Small 3.2 24B9090908075757575757580.0%
GPT-5 Nano9090808080807070707078.0%
Qwen 2.5 72B9090908080808080703577.5%
Z.AI GLM 4.7 Flash10090908075757565605576.5%
Arcee AI: Trinity Large (Preview)9085858075707070706576.0%
Writer: Palmyra X58585858575757065656575.5%
Gemini 2.5 Flash Lite7575757575757575757074.5%
Mistral Small Creative8080707070707070706571.5%
Cohere Command R+ (Aug. 2024)9090807070706560605571.0%
Claude Haiku 4.57575757575757560606070.5%
Gemma 3 27B7575706565656555555564.5%
Ministral 3 8B7070656560606050503558.5%
Ministral 3 14B857565606060605545557.0%
Ministral 8B7575656560605555352056.5%
GPT-4o Mini (temp=0)6060606060605050505056.0%
Ministral 3 3B6565656560605555403056.0%
Claude 3.5 Haiku6565655555555555402553.5%
Hermes 3 70B807565606060504540053.5%
GPT-4o Mini (temp=1)6060605050505050503551.5%
Ministral 3B6560555050505045403550.0%
Llama 3.1 8B7570656060404035302049.5%
Mistral NeMO6555554545454545454549.0%
Arcee AI: Trinity Mini6060555045454035353546.0%
Claude 3 Haiku4545454035353535201535.0%
Rocinante 12B554035151515555019.0%
GPT-4.1 Nano20201510101010100010.5%
WizardLM 2 8x22b1010000000002.0%
Gemma 3 4B50000000000.5%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Grok 4 Fast1001001001001001001001001009099.0%
Z.AI GLM 4.51001001001001001001001001009099.0%
Grok 410010010010010010010090909097.0%
Claude Sonnet 4.51001001001001001001001001007097.0%
GPT-51001001001001001009090909096.0%
Z.AI GLM 4.6100100100100100100100100907096.0%
DeepSeek V3 (2025-03-24)100100100100100100100100808096.0%
DeepSeek V3.2100100100100100909090909095.0%
Claude 3.5 Haiku100100100100100100100100707094.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001003093.0%
Mistral Large1001001009090909090909093.0%
DeepSeek-V2 Chat1001001001001001009080807092.0%
Llama 3.1 Nemotron 70B100100100100100100100100903092.0%
Llama 3.1 70B1001001001001001009080707091.0%
Mistral Large 39090909090909090909090.0%
Mistral Large 29090909090909090909090.0%
DeepSeek V3 (2024-12-26)1001001001001001008080806090.0%
Claude Opus 4100100100100100707070707085.0%
Z.AI GLM 4.7 Flash100100100100100808070705085.0%
Stealth: Aurora Alpha100100100100100808080803085.0%
DeepSeek V3.1100100100100100100908070084.0%
GPT-4o, May 13th (temp=0)10010010010090907060606083.0%
ByteDance Seed 1.6 Flash10080808080808080807081.0%
Gemini 2.5 Flash10080808080808080707080.0%
Minimax M2.5100100908070707050505073.0%
GPT-5 Nano100100100808080805050072.0%
Mistral Medium 3.110070707070707070606071.0%
Claude Sonnet 4.67070707070707070707070.0%
Gemma 3 27B7070707070707070707070.0%
GPT-4o, Aug. 6th (temp=0)9090906060606060606069.0%
GPT-4.1 Mini8080707070707070703068.0%
GPT-4o, May 13th (temp=1)9090707060606060606068.0%
Qwen 2.5 72B8080808080805050505068.0%
GPT-4o, Aug. 6th (temp=1)9090706060606060606067.0%
Arcee AI: Trinity Mini10080808060605050303062.0%
Writer: Palmyra X5707070707070606060060.0%
GPT-4o Mini (temp=1)8080808050505050404060.0%
Arcee AI: Trinity Large (Preview)7070707060605050404058.0%
Claude Haiku 4.57070707060605040404057.0%
GPT-4o Mini (temp=0)8080505050505050505056.0%
Gemma 3 12B8050505050505050505053.0%
Mistral Small Creative6060606060605050403053.0%
Gemini 2.5 Flash Lite5050404040404040404042.0%
Llama 3.1 8B707070704030302020042.0%
Cohere Command R+ (Aug. 2024)806060605040202020041.0%
Ministral 3 3B70706050404040300040.0%
Mistral Small 3.2 24B6050504040403030301038.0%
Mistral NeMO705050504040303010037.0%
Claude 3 Haiku6060404040302020101033.0%
Ministral 3 14B5050404040202020201031.0%
Ministral 3B70706040202010100030.0%
Hermes 3 70B60505030302020100027.0%
GPT-4.1 Nano202020200000008.0%
Ministral 8B50101000000007.0%
Rocinante 12B40201000000007.0%
Gemma 3 4B20202000000006.0%
Ministral 3 8B100000000001.0%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009299.2%
GPT-5.21001001001001001001001001009299.2%
Grok 41001001001001001001001001009299.2%
Claude Opus 41001001001001001001001001009299.2%
ByteDance Seed 1.61001001001001001001001001009299.2%
Z.AI GLM 51001001001001001001001001007597.5%
Mistral Medium 3.1100100100100100100100100838396.7%
Claude 3.5 Sonnet100100100100100100100100838396.7%
GPT-5.110010010010092929292929295.0%
GPT-510010010010092929292929295.0%
GPT-4o, May 13th (temp=0)10010010010010010010083838395.0%
o4 Mini1001001001001001009292757593.3%
Minimax M2.510010010010010010010083757593.3%
Grok 4 Fast1001001001001001001001001003393.3%
GPT-5 Mini10092929292929292929292.5%
o4 Mini High100100100100100929292757592.5%
Llama 3.1 Nemotron 70B100100100100100928383838392.5%
Qwen 3.5 Plus (2026-02-15)1001001001001001007575757590.0%
Writer: Palmyra X51001001001001001007575757590.0%
Z.AI GLM 4.5100100100100100927575757589.2%
Z.AI GLM 4.6100100100100100100929275886.7%
DeepSeek V3 (2025-03-24)1001001009283838375757586.7%
Mistral Small Creative1001001009292838383755886.7%
ByteDance Seed 1.6 Flash10092838383838383836784.2%
GPT-4.1 Mini9292929292757575756782.5%
Mistral Large1001001009275757575676782.5%
Gemini 2.5 Flash9292929283757575756781.7%
GPT-4o, Aug. 6th (temp=1)10083838383838383676781.7%
GPT-4o, Aug. 6th (temp=0)8383838383838383836781.7%
GPT-5 Nano100100837575757575757580.8%
GPT-4.1100100757575757575757580.0%
DeepSeek V3.21001001009292929258502580.0%
Gemma 3 27B100100927575757575755079.2%
Ministral 3 8B8383837575757575676775.8%
Claude Sonnet 4.67575757575757575757575.0%
Claude Haiku 4.57575757575757575757575.0%
DeepSeek-V2 Chat100100837575756758585875.0%
Llama 3.1 70B838383838383838375074.2%
Mistral Large 39275757575756767676773.3%
Mistral Small 3.2 24B10092838375757558504273.3%
Z.AI GLM 4.7 Flash10083676767676767675870.8%
Mistral Large 210075757575676767505070.0%
Claude 3.7 Sonnet1001001001001001009200069.2%
DeepSeek V3.19292837575756758423369.2%
Hermes 3 405B1008375757575756758068.3%
Gemma 3 12B9275757575756767503368.3%
Claude 3.5 Haiku8383838367676767502567.5%
Ministral 8B10083757575675850504267.5%
DeepSeek V3 (2024-12-26)9283757575675858424266.7%
WizardLM 2 8x22b1009283757567585050065.0%
Gemini 2.5 Flash Lite8383756767585858504264.2%
GPT-4o, May 13th (temp=1)10083838383676717171761.7%
Arcee AI: Trinity Large (Preview)7575675850505050504256.7%
GPT-4o Mini (temp=1)6767676750505050504255.8%
Arcee AI: Trinity Mini8383675050505050333355.0%
Qwen 2.5 72B8367676767675025252554.2%
Hermes 3 70B757567676750504225051.7%
GPT-4o Mini (temp=0)5050505050505050505050.0%
Cohere Command R+ (Aug. 2024)75676758584233338044.2%
Claude 3 Haiku5858585842332525252540.8%
Ministral 3 14B5050505042423333251739.2%
Mistral NeMO6767424233333325251738.3%
Ministral 3B4242333388000016.7%
Ministral 3 3B75251717178000015.8%
Rocinante 12B42422525170000015.0%
Gemma 3 4B25251788000008.3%
GPT-4.1 Nano170000000001.7%
Llama 3.1 8B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100959599.1%
Gemini 2.5 Pro1001001009595959595959596.8%
Claude Sonnet 4.51001001001001001008686868694.5%
GPT-5.29595959591918686868690.9%
Gemini 3 Pro (Preview)10095959191918686868690.9%
Grok 4100100919191919191827790.5%
GPT-59191919191919191918690.5%
MoonshotAI: Kimi K2.59595959191868686868289.5%
GPT-5.19191919191919186868289.1%
Claude 3.5 Sonnet9595959595918682777388.6%
Gemini 2.5 Flash10086868686868686868687.7%
Mistral Medium 3.19595868686868686827786.8%
Z.AI GLM 4.79595868686868282828286.4%
Gemini 3 Flash (Preview)8686868686868686868686.4%
Z.AI GLM 510091919191868282737385.9%
Claude Sonnet 48686868686868686828285.5%
Grok 4.1 Fast9191919191868282737385.0%
Claude Sonnet 4.6100100919186828277686884.5%
o4 Mini High9191919191827777777384.1%
Claude Opus 49595958282828273737383.2%
GPT-4o, Aug. 6th (temp=0)8686868686867777777782.7%
Mistral Large9595958682827768646480.9%
Grok 4 Fast1009591919191828282080.5%
o4 Mini9182828282827777777380.5%
ByteDance Seed 1.68686777777777777777779.1%
Qwen 3.5 Plus (2026-02-15)9191868282777777775079.1%
GPT-4o, Aug. 6th (temp=1)9191919186777368595978.6%
DeepSeek V3.210091919177737368645578.2%
Mistral Large 28686868282827773645977.7%
GPT-5 Mini8682777777777777736877.3%
Z.AI GLM 4.610095868682827359594576.8%
Qwen 3.5 397B A17B958686868686828273076.4%
GPT-5 Nano9182828277777368645575.0%
Claude Haiku 4.5919191828277736868072.3%
Z.AI GLM 4.7 Flash9177737373686868645971.4%
Writer: Palmyra X58682827768686868595571.4%
Stealth: Aurora Alpha7373737373737368686470.9%
Mistral Large 39177777768646464646470.9%
DeepSeek V3.19595868282776445414170.9%
Minimax M2.58282777368686464645970.0%
Claude 3.7 Sonnet100100868677645950453270.0%
Z.AI GLM 4.59173686868686464595968.2%
ByteDance Seed 1.6 Flash8273737373736459594567.3%
GPT-4o, May 13th (temp=0)7777777368685959595066.8%
GPT-4.195828282828277640064.5%
Gemini 2.5 Flash Lite100959177737364550062.7%
DeepSeek V3 (2024-12-26)8273736868685945453661.8%
Ministral 3 14B8282776868454545414159.5%
GPT-4.1 Mini7768686859595541413657.3%
Arcee AI: Trinity Large (Preview)7773736455555050452756.8%
GPT-4o, May 13th (temp=1)7368595959555045454155.5%
Mistral Small Creative6464645955555050504155.0%
DeepSeek-V2 Chat736868685955504541052.7%
Mistral Small 3.2 24B6459595955505045454152.7%
DeepSeek V3 (2025-03-24)686859595955454545951.4%
Claude 3.5 Haiku5555555555555545414150.9%
Hermes 3 405B8273686459503627232350.5%
Gemma 3 27B5959554545414141414146.8%
Llama 3.1 Nemotron 70B5959594541413636362744.1%
Llama 3.1 70B595955554545452727942.7%
Cohere Command R+ (Aug. 2024)6459555545454127231442.7%
Ministral 3 8B7364454141323223181438.2%
Ministral 8B68645550414132230037.3%
WizardLM 2 8x22b594541414141362723936.4%
Arcee AI: Trinity Mini454545454136361814933.6%
Hermes 3 70B5041363632231800023.6%
GPT-4o Mini (temp=0)3223231818181818181820.5%
GPT-4o Mini (temp=1)5032322323181450019.5%
Gemma 3 12B3227272723141495017.7%
Ministral 3B3232272323141495017.7%
Claude 3 Haiku412723181414995015.9%
Ministral 3 3B322318181814500012.7%
Llama 3.1 8B4132271400000011.4%
Mistral NeMO2727271495500011.4%
Qwen 2.5 72B739900000009.1%
Gemma 3 4B50000000000.5%
Rocinante 12B50000000000.5%
GPT-4.1 Nano00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100959599.0%
Gemini 2.5 Pro1001001001001001001001001008598.5%
Claude Sonnet 4.51001001001001001001001001008598.5%
MoonshotAI: Kimi K2.5100100100100100100100100958598.0%
Qwen 3.5 397B A17B10010010010010010010095858596.5%
o4 Mini High100100100100100959590909096.0%
GPT-5.21001001009595959595959096.0%
GPT-51001001001001001009590858595.5%
Gemini 3 Pro (Preview)10010010010010010010085858595.5%
Claude Opus 4.5100100100100100909090909095.0%
Z.AI GLM 4.710010010010010010010090857595.0%
o4 Mini100100100100100909090908594.5%
Z.AI GLM 5100100100100100909090858594.0%
GPT-5.110010010010095909090908093.5%
ByteDance Seed 1.61001001001001001009085857593.5%
Gemini 2.5 Flash100100959590909090908092.0%
Mistral Large 210090909090908580808087.5%
Claude Sonnet 49090909090909080808087.0%
Claude Sonnet 4.69090909090909090757587.0%
Claude 3.7 Sonnet9090909090909090757587.0%
Mistral Large 310090909090858075757585.0%
Qwen 3.5 Plus (2026-02-15)100100858585858580707084.5%
Claude Opus 4100100908585858575706584.0%
Gemini 3 Flash (Preview)10085858585858570707082.0%
Stealth: Aurora Alpha9090909090858075755081.5%
Mistral Large9090909090908065656581.5%
Z.AI GLM 4.610010010010090858575551580.5%
GPT-4o, May 13th (temp=0)8080808080808080707078.0%
DeepSeek V3.29585858080757065606075.5%
Minimax M2.59090808080707065605574.0%
Claude 3.5 Sonnet9080757575757565656574.0%
Z.AI GLM 4.7 Flash9090858580707065454072.0%
DeepSeek-V2 Chat9090858585807055404072.0%
Ministral 3 14B8580807575707070554570.5%
Claude Haiku 4.5909085757575757060069.5%
GPT-5 Nano9590806565656560555069.0%
GPT-4o, Aug. 6th (temp=1)8080808075706555554068.0%
DeepSeek V3 (2024-12-26)9085857065656055554567.5%
GPT-4o, Aug. 6th (temp=0)8080707070656565554066.0%
Mistral Medium 3.19085757070655555454065.0%
GPT-4.110075707065656560503065.0%
DeepSeek V3.19085757070656560353064.5%
ByteDance Seed 1.6 Flash8080807070656060453564.5%
DeepSeek V3 (2025-03-24)8580757065605555504564.0%
Writer: Palmyra X57575706565656060504062.5%
GPT-4o, May 13th (temp=1)7570706560555555504560.0%
GPT-4.1 Mini7065656560605555504559.0%
Gemma 3 27B6560606055555555555057.0%
Z.AI GLM 4.58075757560555045151054.0%
Mistral Small Creative7070656565655025151550.5%
Ministral 3 8B707065656560352525048.0%
Hermes 3 405B5550505050504540403546.5%
Claude 3.5 Haiku605050505050505040045.0%
Mistral Small 3.2 24B555050505035353535039.5%
Ministral 3 3B5550505050453525201039.0%
Llama 3.1 Nemotron 70B655050503535301510034.0%
Qwen 2.5 72B5045454035353525151033.5%
WizardLM 2 8x22b5050453530303015151031.0%
Llama 3.1 70B605040403530201510030.0%
Arcee AI: Trinity Large (Preview)656050503530000029.0%
Ministral 3B5040353020202015151526.0%
Ministral 8B55554540250000022.0%
Hermes 3 70B404030252015151010521.0%
Gemini 2.5 Flash Lite4540302525201000019.5%
Cohere Command R+ (Aug. 2024)35352020205000013.5%
Arcee AI: Trinity Mini3025201515101000012.5%
GPT-4o Mini (temp=0)3025202055555012.0%
GPT-4o Mini (temp=1)20151555550007.0%
Llama 3.1 8B3020000000005.0%
Mistral NeMO355000000004.0%
Gemma 3 12B155555000003.5%
Claude 3 Haiku00000000000.0%
GPT-4.1 Nano00000000000.0%
Gemma 3 4B00000000000.0%
Rocinante 12B00000000000.0%