Precision

Test: Codex Violation Detection

Avg. Score
80.7%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast98.0%$0.002121.1s90%
2ByteDance Seed 1.6 Flash94.1%$0.000912.0s85%
3Gemini 3 Flash (Preview)93.9%$0.00314.5s83%
4Grok 4 Fast96.8%$0.001812.8s76%
5Gemini 2.5 Flash91.4%$0.00252.8s77%
6Minimax M2.593.5%$0.002025.5s84%
7Claude Opus 4.599.8%$0.0419.7s98%
8Claude Sonnet 4.596.3%$0.0248.9s88%
9Claude Sonnet 495.6%$0.0239.0s88%
10o4 Mini97.4%$0.01928.1s90%
11Claude Opus 4.698.5%$0.04010.2s94%
12Mistral Large 387.5%$0.003010.2s74%
13Stealth: Aurora Alpha92.9%6.2s74%
14GPT-5.295.7%$0.02424.2s89%
15Gemini 2.5 Pro98.4%$0.03523.2s93%
16Claude 3.5 Haiku89.8%$0.00496.2s68%
17ByteDance Seed 1.696.8%$0.00671.0m91%
18DeepSeek V3 (2024-12-26)89.6%$0.001915.6s70%
19Mistral Large89.9%$0.0129.1s75%
20Z.AI GLM 597.9%$0.01356.4s90%
21DeepSeek-V2 Chat89.5%$0.002013.1s66%
22GPT-5 Mini94.8%$0.009251.7s86%
23Z.AI GLM 4.796.2%$0.00911.0m89%
24GPT-4.1 Mini84.4%$0.00155.8s66%
25Mistral Large 287.9%$0.0128.8s72%
26Z.AI GLM 4.588.7%$0.002618.5s68%
27Z.AI GLM 4.7 Flash94.0%$0.00181.1m86%
28Claude Sonnet 4.691.5%$0.0249.9s77%
29Mistral Medium 3.185.4%$0.00297.7s64%
30DeepSeek V3.286.6%$0.001317.2s66%
31Claude 3.5 Sonnet94.4%$0.04210.5s85%
32GPT-4o, Aug. 6th (temp=1)87.6%$0.0113.6s62%
33GPT-4.189.1%$0.008110.0s61%
34Writer: Palmyra X584.3%$0.006212.1s66%
35Qwen 3.5 Plus (2026-02-15)89.3%$0.004134.2s66%
36o4 Mini High97.1%$0.03351.3s90%
37GPT-4o, Aug. 6th (temp=0)87.3%$0.0155.0s60%
38GPT-5.197.3%$0.03752.8s90%
39DeepSeek V3 (2025-03-24)86.8%$0.001522.9s56%
40Gemini 3 Pro (Preview)97.1%$0.05034.0s91%
41Gemma 3 27B78.6%$0.000513.3s60%
42Mistral Small Creative74.7%$0.00064.3s56%
43MoonshotAI: Kimi K2.595.9%$0.0151.5m88%
44GPT-4o, May 13th (temp=1)85.5%$0.0263.7s64%
45Hermes 3 405B82.2%$0.004417.2s53%
46Z.AI GLM 4.691.9%$0.00491.3m76%
47Arcee AI: Trinity Mini80.9%$0.000410.3s47%
48Claude Haiku 4.581.8%$0.00785.8s49%
49Claude 3.7 Sonnet88.5%$0.02310.2s56%
50GPT-4o, May 13th (temp=0)88.6%$0.0305.4s54%
51Gemini 3.1 Pro (Preview)99.3%$0.06852.3s96%
52Mistral Small 3.2 24B72.3%$0.000610.4s49%
53Qwen 2.5 72B78.0%$0.000813.9s41%
54Llama 3.1 Nemotron 70B78.2%$0.005518.5s45%
55Grok 497.4%$0.0511.2m87%
56DeepSeek V3.179.4%$0.001431.1s44%
57Ministral 3 14B67.1%$0.00106.3s41%
58GPT-4o Mini (temp=1)67.1%$0.00067.4s34%
59Gemini 2.5 Flash Lite64.3%$0.00052.3s31%
60Llama 3.1 70B72.2%$0.002124.3s32%
61Ministral 3 8B58.4%$0.00074.8s36%
62Gemma 3 12B62.3%$0.000312.0s34%
63GPT-4o Mini (temp=0)69.1%$0.000625.0s32%
64GPT-596.2%$0.0611.6m87%
65GPT-5 Nano88.5%$0.00491.9m57%
66Ministral 3 3B56.1%$0.00053.3s29%
67Ministral 8B52.4%$0.00055.6s31%
68Ministral 3B53.5%$0.00022.9s28%
69Claude 3 Haiku55.9%$0.00153.6s26%
70Hermes 3 70B60.7%$0.001329.2s31%
71Claude Opus 490.4%$0.11615.8s77%
72Cohere Command R+ (Aug. 2024)60.3%$0.01410.0s24%
73Qwen 3.5 397B A17B95.5%$0.0262.9m76%
74Mistral NeMO44.0%$0.000713.3s21%
75Llama 3.1 8B34.1%$0.000216.1s13%
76WizardLM 2 8x22b32.2%$0.003615.7s6%
77Gemma 3 4B20.8%$0.000212.5s14%
78Rocinante 12B21.8%$0.00096.0s3%
79GPT-4.1 Nano20.4%$0.00043.9s0%
80Arcee AI: Trinity Large (Preview)60.2%$0.00003.1m29%
80.75%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
o4 Mini1001001001001001001001001009799.7%
o4 Mini High1001001001001001001001001009799.7%
GPT-5100100100100100100100100979799.4%
Gemini 3.1 Pro (Preview)100100100100100100100100979799.3%
Z.AI GLM 5100100100100100100100100979699.3%
Claude Opus 4.5100100100100100979797979798.4%
GPT-5.2100100100100100979797979498.1%
GPT-5 Mini1001001001001001009797949498.1%
GPT-5.110010010010010010010096949098.0%
Grok 410010010010097979797949497.6%
Z.AI GLM 4.7100100100100100979796939397.5%
Qwen 3.5 397B A17B10010010010097979794949497.2%
Gemini 2.5 Pro1001001009797979797949196.9%
Grok 4 Fast10010010010097979791918896.0%
Grok 4.1 Fast100100100100100979491888895.8%
Z.AI GLM 4.610097979796969494939195.5%
Claude Sonnet 4.610097969693939393909094.2%
Claude Sonnet 49796969693939393908993.7%
ByteDance Seed 1.610097949494949391888592.9%
Gemini 3 Pro (Preview)10097969393939390878692.7%
Stealth: Aurora Alpha10096969691919190888892.6%
Claude 3.5 Sonnet1001001009392929292857592.0%
MoonshotAI: Kimi K2.510097979693919088877891.7%
Minimax M2.51001001009695948885837591.7%
Claude Opus 4.69494949491919090909091.7%
Gemini 3 Flash (Preview)9493939393939090908791.6%
Mistral Large1001001008888888886858490.7%
GPT-5 Nano1001001009591898684818190.7%
Mistral Large 310095939292908989858290.6%
Z.AI GLM 4.7 Flash10095949393928886867990.5%
GPT-4.19695959292919090867590.1%
Mistral Large 210095959291878786847889.5%
Gemini 2.5 Flash9693939390888786857989.0%
Claude Sonnet 4.59493919090888685858488.5%
DeepSeek V3 (2024-12-26)1001001009595918981715688.0%
Claude Opus 49793918888868583818187.3%
ByteDance Seed 1.6 Flash10092928988888382797586.7%
DeepSeek-V2 Chat10095959486857979777686.6%
Claude 3.7 Sonnet9491898888828279686883.0%
Mistral Medium 3.18987838282808077767380.9%
DeepSeek V3.210085838181797872696879.7%
Claude Haiku 4.58686867979787876747479.6%
GPT-4o, May 13th (temp=1)100100908279797871625078.9%
Z.AI GLM 4.510090818076757574706378.5%
GPT-4.1 Mini10095947977767369625878.2%
Qwen 3.5 Plus (2026-02-15)979493918888858247076.3%
Qwen 2.5 72B10090867575757371675076.2%
DeepSeek V3.11008886858483827971075.8%
DeepSeek V3 (2025-03-24)969586858281807568074.8%
Writer: Palmyra X58785848076757271685074.8%
GPT-4o, Aug. 6th (temp=1)9183818080757371585374.6%
Claude 3.5 Haiku10010083807575716767071.8%
Mistral Small Creative7671696765656361595364.9%
GPT-4o, Aug. 6th (temp=0)92847979797975732264.3%
Hermes 3 405B10083806360575745444063.0%
Arcee AI: Trinity Mini1001001006050505050302861.8%
Gemma 3 27B7877755754535050453357.2%
Hermes 3 70B808075675750504414051.8%
Ministral 3 8B7158575552504847423551.5%
Ministral 8B6357565050505047464551.5%
Mistral Small 3.2 24B6867645755524644392151.2%
Llama 3.1 Nemotron 70B1006763545450433333049.6%
GPT-4o, May 13th (temp=0)10092928870161400047.2%
Claude 3 Haiku7563504340383333333344.1%
Ministral 3 14B575757575350443826043.9%
Gemma 3 12B7567604540333329292543.5%
GPT-4o Mini (temp=1)676056503833333020038.6%
Cohere Command R+ (Aug. 2024)56545045434227250034.1%
WizardLM 2 8x22b5641393635302925211732.9%
Ministral 3B3633333030302827211828.8%
Gemini 2.5 Flash Lite565251473827000027.1%
Llama 3.1 70B545050403829200026.1%
Ministral 3 3B3735312624222220131324.3%
Llama 3.1 8B67502520148000018.4%
Mistral NeMO552524171411995517.4%
Gemma 3 4B3529202013131100014.1%
GPT-4.1 Nano1002112400000013.7%
GPT-4o Mini (temp=0)5033222060000013.2%
Rocinante 12B33332500000009.2%
Arcee AI: Trinity Large (Preview)17131231111004.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001009599.5%
ByteDance Seed 1.61001001001001001001001001009599.5%
GPT-4o, May 13th (temp=0)1001001001001001001001001009299.2%
Claude Sonnet 4.6100100100100100100100100959498.9%
Grok 4.1 Fast10010010010010010010095959498.4%
Gemini 3.1 Pro (Preview)1001001001001001009595959598.0%
GPT-51001001001001001009594949097.3%
Claude Opus 4.610010010010095959595959597.0%
o4 Mini High10010010010094949494949496.6%
MoonshotAI: Kimi K2.510010010010095959595959096.5%
Z.AI GLM 51001001001001001009594908596.4%
Claude Sonnet 410010010010095959595948996.3%
GPT-4.110010010010095959493938995.9%
Minimax M2.510010010010094949493928895.5%
Z.AI GLM 4.7 Flash1001001001001001009392888095.3%
o4 Mini1001001009494949494898894.9%
Z.AI GLM 4.71001001009595959490898894.6%
GPT-5.1100100959595949494908994.6%
Stealth: Aurora Alpha1001001009494949494948294.5%
GPT-5 Mini100100949494949090898993.6%
Gemini 3 Pro (Preview)10010010010095958988858493.6%
Qwen 3.5 397B A17B9595959595959589898993.2%
ByteDance Seed 1.6 Flash1001001009392929291868192.7%
Z.AI GLM 4.610095959594909089888392.0%
Gemini 3 Flash (Preview)9595959595898989898491.6%
Z.AI GLM 4.59494949494948988878791.5%
GPT-5.29595959090909090908691.2%
GPT-4o, Aug. 6th (temp=0)10092929292928686868590.1%
Claude Opus 410095948989898989837989.6%
DeepSeek-V2 Chat100100949488888787817689.4%
Claude 3.5 Sonnet9494949488888882828188.7%
Qwen 3.5 Plus (2026-02-15)9595898989898989797888.2%
DeepSeek V3 (2024-12-26)100100948888878382817888.0%
Gemini 2.5 Flash9595949090858583837988.0%
Writer: Palmyra X510093908989868582807987.4%
Claude 3.5 Haiku10089888888888888787886.9%
Claude 3.7 Sonnet9488888888888383838086.5%
Arcee AI: Trinity Mini100100100100100888367676086.4%
GPT-4o, May 13th (temp=1)100100939085858578776785.9%
Mistral Large9489898888838282818185.8%
Mistral Medium 3.19488888888888482797885.8%
Mistral Large 38888888887878383828185.4%
GPT-4o, Aug. 6th (temp=1)1001001001009392929086085.3%
Mistral Large 29489898783828282827884.9%
DeepSeek V3.210094898884838381785083.2%
Mistral Small Creative10091858583837775757182.5%
Gemma 3 27B9089838280808080807581.9%
Hermes 3 405B10091908882797070696079.8%
DeepSeek V3 (2025-03-24)10010088888887868276079.5%
GPT-4o Mini (temp=0)7878787878787878786076.0%
Mistral Small 3.2 24B8881807878767268676775.5%
GPT-5 Nano100949494939392880074.9%
GPT-4.1 Mini8080787776757574656474.4%
Ministral 3B9190898583757563424273.4%
Arcee AI: Trinity Large (Preview)9489828078767171681272.0%
Qwen 2.5 72B10089888078757150383370.1%
Claude Haiku 4.594898989878383810069.6%
Gemini 2.5 Flash Lite8479747170686762595769.0%
Gemma 3 12B8075737371707067644568.8%
Ministral 3 8B857777757573737171067.7%
Ministral 3 3B1009189898675564738667.6%
Ministral 8B7977716765646363635966.9%
DeepSeek V3.194948985838372650066.6%
Ministral 3 14B7171696764636260535062.9%
Llama 3.1 Nemotron 70B898678755857565450060.2%
Cohere Command R+ (Aug. 2024)1008883786457504033059.3%
Hermes 3 70B1007573736756555036058.4%
Llama 3.1 70B888378755654504738056.7%
GPT-4o Mini (temp=1)837163605756504329051.1%
Claude 3 Haiku756350444343433817041.5%
Mistral NeMO555031302218171513725.7%
Llama 3.1 8B4033141198000011.6%
GPT-4.1 Nano5020151180000010.5%
Rocinante 12B50331100000009.4%
Gemma 3 4B201310108700006.8%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001001001001001001009199.1%
Qwen 3.5 397B A17B1001001001001001001001001009099.0%
Claude Opus 4.61001001001001001001001001009099.0%
Gemini 3 Pro (Preview)1001001001001001001001001008998.9%
GPT-5.1100100100100100100100100908097.0%
o4 Mini10010010010010010010090909097.0%
Z.AI GLM 5100100100100100100100100898096.9%
ByteDance Seed 1.610010010010010010010089898996.7%
Minimax M2.51001001001001001008989888895.3%
Claude 3.5 Haiku10010010010010010010086838395.2%
Claude 3.5 Sonnet1001001001001001008988888895.1%
o4 Mini High1001001001001001009190908095.1%
Qwen 3.5 Plus (2026-02-15)10010010010090909090909094.0%
GPT-4.1 Mini1001001001001001008888887593.8%
Qwen 2.5 72B1001001001001001008686838093.5%
Z.AI GLM 4.7 Flash100100100100100888886868393.0%
DeepSeek-V2 Chat1001001001001001008683837592.7%
GPT-5 Mini100100919191919191919092.6%
Writer: Palmyra X5100100100100100868686838392.4%
Claude Sonnet 4.6100100909090909090909092.0%
GPT-4o, May 13th (temp=1)1001001001001001008886757192.0%
Claude 3.7 Sonnet1001001008989898989888891.9%
GPT-5.210091919191919191919191.8%
GPT-510010010010091919090837391.8%
Z.AI GLM 4.710010010010090898989808091.7%
Claude 3 Haiku100100100100100100100100675091.7%
MoonshotAI: Kimi K2.5100100100100100908882787891.5%
Grok 4 Fast1001001009191909090827590.9%
Grok 4.1 Fast1001001009191909089827390.5%
Stealth: Aurora Alpha100100908989898989898290.5%
GPT-4o, May 13th (temp=0)1001001008888888888887590.0%
GPT-4o, Aug. 6th (temp=0)100100888888888888888689.8%
ByteDance Seed 1.6 Flash1001001008988888686807889.3%
GPT-4o, Aug. 6th (temp=1)1001001008888868686867589.3%
Grok 410010010010091909078737089.1%
Gemini 3 Flash (Preview)10089898989898989898089.1%
Claude Sonnet 4.5100100909090898980808088.8%
Z.AI GLM 4.6100100909090898989826788.5%
DeepSeek V3 (2025-03-24)1001001008686868383837588.2%
DeepSeek V3 (2024-12-26)100100888686868683838388.0%
Mistral Large100100898989888878787887.5%
Gemini 2.5 Flash1001001008989828080807087.0%
Claude Opus 410089898989888878787886.4%
Z.AI GLM 4.510089898989898880787086.0%
Mistral Large 210089888888888878757585.4%
GPT-4.18988888888888886787585.2%
Claude Sonnet 410090909080808080808085.0%
Llama 3.1 70B8886868686868680787583.5%
Mistral Large 39089888888787878787883.0%
GPT-5 Nano10089888886868378676783.0%
Mistral Small 3.2 24B8989898886868075757382.8%
Llama 3.1 Nemotron 70B8988888888787878787382.3%
Gemini 2.5 Flash Lite10089868686867875716381.8%
Ministral 3 14B100100898988888878505081.8%
DeepSeek V3.28888888686868371676380.4%
Gemma 3 27B10088868680757067676778.4%
Mistral Medium 3.1898989898989898080078.2%
GPT-4o Mini (temp=0)8075757575757575757575.5%
Hermes 3 405B8886837571717167676774.6%
DeepSeek V3.11008686868375717171073.0%
Arcee AI: Trinity Mini100100838067606060575071.7%
GPT-4o Mini (temp=1)8080808080757560505071.0%
Mistral Small Creative8980807070676462585469.3%
Arcee AI: Trinity Large (Preview)8075707070676462554165.3%
Hermes 3 70B1008380757167635636063.1%
Ministral 3 3B7567676360606056565061.2%
Gemma 3 12B7563636357575750504457.8%
Ministral 3B8656565656545044423853.5%
Ministral 3 8B6462585854504746403551.4%
Cohere Command R+ (Aug. 2024)1008380757167000047.6%
Ministral 8B735853535050474619045.0%
Llama 3.1 8B67635656505050440043.5%
Mistral NeMO75555454505033330040.4%
Gemma 3 4B5747433837353329272637.2%
Rocinante 12B6764331400000017.8%
GPT-4.1 Nano252318120000007.7%
WizardLM 2 8x22b200000000002.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.51001001001001001001001001009199.1%
Gemini 2.5 Pro1001001001001001001001001009199.1%
Z.AI GLM 4.61001001001001001001001001009199.1%
GPT-4.11001001001001001001001001009199.1%
DeepSeek-V2 Chat1001001001001001001001001009099.0%
DeepSeek V3.21001001001001001001001001009099.0%
Hermes 3 405B1001001001001001001001001009099.0%
Gemma 3 12B1001001001001001001001001008998.9%
Claude 3.5 Sonnet100100100100100100100100909098.0%
GPT-4o, May 13th (temp=0)100100100100100100100100909098.0%
Mistral Large 2100100100100100100100100909098.0%
GPT-4o Mini (temp=1)1001001001001001001001001008098.0%
ByteDance Seed 1.6 Flash100100100100100100100100898897.6%
Mistral Large10010010010010010010090909097.0%
Qwen 2.5 72B1001001001001001001001001006396.3%
Mistral Medium 3.11001001001001001009090909096.0%
DeepSeek V3.11001001001001001009090909096.0%
GPT-4o, Aug. 6th (temp=1)1001001001001001009090909096.0%
Minimax M2.51001001001001001009089898995.7%
GPT-4.1 Mini1001001001001001009089898995.7%
Cohere Command R+ (Aug. 2024)10010010010010010010088867895.1%
Mistral Large 3100100100100100909090909095.0%
Claude Opus 410010010010090909090909094.0%
Z.AI GLM 4.7 Flash100100100100100898989888694.0%
Qwen 3.5 Plus (2026-02-15)1001001009191919191919193.6%
GPT-4o, May 13th (temp=1)10010010010090909090888893.5%
Mistral Small 3.2 24B10010010010089898989898993.3%
Claude Sonnet 4.69090909090909090909090.0%
Stealth: Aurora Alpha100100100100100100100100100090.0%
Arcee AI: Trinity Mini10010010010086838380808089.2%
Gemma 3 27B10089898888888886868688.5%
Writer: Palmyra X59090909089898888888088.0%
Gemini 2.5 Flash Lite8989898989898989898088.0%
Claude Haiku 4.58989898989898978787885.6%
Mistral NeMO8886868383838383838384.2%
Hermes 3 70B10010010010010089888371083.1%
Claude 3.5 Haiku8888888686868686715782.0%
Arcee AI: Trinity Large (Preview)9090838280808075736980.2%
Mistral Small Creative8282808080808080807379.6%
Ministral 3 3B8888888886867871675679.3%
Llama 3.1 8B100100898880787163565077.4%
Ministral 3 14B9089887878787870675076.4%
Ministral 3B8886837875757571646375.7%
Claude 3 Haiku8383838080808071605075.1%
GPT-4.1 Nano100100100100100100675033075.0%
Ministral 8B8989887873707067605573.7%
Ministral 3 8B8880807575736767605571.8%
Rocinante 12B10086806767505050502562.4%
Gemma 3 4B4033202020201717171421.8%
WizardLM 2 8x22b10010017000000021.7%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Qwen 3.5 397B A17B1001001001001001001001001008398.3%
Grok 4 Fast1001001001001001001001001008398.3%
Z.AI GLM 4.51001001001001001001001001008398.3%
Claude Sonnet 4.51001001001001001001001001008098.0%
ByteDance Seed 1.6 Flash1001001001001001001001001008098.0%
Stealth: Aurora Alpha1001001001001001001001001006796.7%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001006796.7%
DeepSeek V3 (2024-12-26)1001001001001001001001001006796.7%
Z.AI GLM 4.6100100100100100100100100838096.3%
DeepSeek-V2 Chat100100100100100100100100838096.3%
Claude 3.5 Haiku100100100100100100100100808096.0%
Gemini 2.5 Flash100100100100100100100100808096.0%
Grok 410010010010010010010083838395.0%
Llama 3.1 Nemotron 70B100100100100100100100100836795.0%
Llama 3.1 70B10010010010010010010083808094.3%
Z.AI GLM 4.7 Flash10010010010010010010080807593.5%
GPT-51001001001001001008383838393.3%
DeepSeek V3.2100100100100100838383838391.7%
Claude Opus 4100100100100100808080808090.0%
Qwen 2.5 72B1001001001001001007575757590.0%
Mistral Large1001001008383838383838388.3%
Arcee AI: Trinity Mini1001001001001001007575676788.3%
GPT-5 Nano1001001001001001001007575085.0%
Minimax M2.51001001008380808075757584.8%
GPT-4o, May 13th (temp=0)10010010010083838067676784.7%
DeepSeek V3.1100100100100100100838071083.5%
Mistral Large 38383838383838383838383.3%
Mistral Large 28383838383838383838383.3%
GPT-4o Mini (temp=1)10010010010075757575606082.0%
GPT-4.1 Mini100100808080808080805081.0%
GPT-4o Mini (temp=0)100100757575757575757580.0%
Claude Sonnet 4.68080808080808080808080.0%
Gemma 3 27B8080808080808080808080.0%
Mistral Medium 3.110080808080808080676779.3%
Gemma 3 12B10075757575757575757577.5%
GPT-4o, May 13th (temp=1)8383808067676767676772.7%
GPT-4o, Aug. 6th (temp=0)8383836767676767676771.7%
GPT-4o, Aug. 6th (temp=1)8383806767676767676771.3%
Writer: Palmyra X58080808080806767673371.3%
Claude Haiku 4.58080808075676760606070.8%
Cohere Command R+ (Aug. 2024)1001001007567605050503368.5%
Arcee AI: Trinity Large (Preview)8080808067676057575067.8%
Gemini 2.5 Flash Lite7575606060606060606063.0%
Mistral NeMO8075757567606050403361.5%
Mistral Small Creative6767676767675757505061.4%
Ministral 3 3B8080676060575050312556.0%
Mistral Small 3.2 24B6760606057575050504055.1%
Llama 3.1 8B8080636350505050431754.5%
Claude 3 Haiku6767606060505050404054.3%
Hermes 3 70B757567675050504033050.7%
Ministral 3B8067636050434040332550.0%
Ministral 3 14B5757505050504343434348.6%
GPT-4.1 Nano100100100100330000043.3%
Ministral 8B5738383333303030302734.6%
Gemma 3 4B5050503325252525252533.3%
Ministral 3 8B3833333330272020201827.3%
Rocinante 12B605050201411000020.5%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Gemini 2.5 Pro1001001001001001001001001008698.6%
GPT-5.21001001001001001001001001008698.6%
Claude Opus 41001001001001001001001001008698.6%
Grok 41001001001001001001001001008698.6%
ByteDance Seed 1.61001001001001001001001001008698.6%
ByteDance Seed 1.6 Flash1001001001001001001001001008698.6%
Llama 3.1 Nemotron 70B1001001001001001001001001008698.6%
Z.AI GLM 51001001001001001001001001008398.3%
Z.AI GLM 4.7 Flash1001001001001001001001001008098.0%
GPT-4o Mini (temp=1)1001001001001001001001001007597.5%
Minimax M2.5100100100100100100100100838396.7%
Claude 3.5 Haiku1001001001001001001001001006796.7%
o4 Mini1001001001001001008686838393.8%
DeepSeek V3 (2025-03-24)1001001001001001008683838393.6%
Mistral Small Creative1001001001001001008686838093.5%
Qwen 3.5 Plus (2026-02-15)1001001001001001008383838393.3%
Writer: Palmyra X51001001001001001008383838393.3%
o4 Mini High100100100100100868686838392.4%
Z.AI GLM 4.5100100100100100868383838391.9%
GPT-5.110010010010086868686868691.4%
GPT-510010010010086868686868691.4%
Z.AI GLM 4.61001001001001001008686835090.5%
Qwen 2.5 72B10010010010010010010067676790.0%
DeepSeek-V2 Chat10010010010083838380808089.0%
GPT-5 Nano1001001008383838383838388.3%
Llama 3.1 70B10010010010010010010010083088.3%
GPT-5 Mini10086868686868686868687.1%
GPT-4.1100100838383838383838386.7%
Mistral Large1001001008683838383717186.2%
Gemma 3 27B100100868383838383836785.2%
GPT-4o, May 13th (temp=1)10010010010010010010050505085.0%
Gemini 2.5 Flash10086868686838383837184.8%
DeepSeek V3 (2024-12-26)100100868383838080757584.6%
DeepSeek V3.21001001008686868680675084.0%
Claude Sonnet 4.68383838383838383838383.3%
Claude Haiku 4.58383838383838383838383.3%
GPT-4.1 Mini8686868686838383837183.3%
Mistral Small 3.2 24B1001001008683838367635782.2%
Gemini 2.5 Flash Lite1001001008380808071675781.9%
Hermes 3 405B1001001008383838383802081.7%
DeepSeek V3.110086868383838071605779.0%
Mistral Large 38683838383837171717178.8%
Mistral Large 210083838383717171676778.1%
Gemma 3 12B8683838383837171676077.2%
Arcee AI: Trinity Large (Preview)100100838380676767675777.0%
WizardLM 2 8x22b10010086838380716767073.7%
Claude 3.7 Sonnet1001001001001001008600068.6%
Ministral 3 8B7575756767676767606067.8%
Ministral 8B10075676767605656555065.1%
Cohere Command R+ (Aug. 2024)10010083808060575040065.0%
Claude 3 Haiku8080808060575050505063.7%
Hermes 3 70B837571717167676750062.3%
Mistral NeMO7575717160606050504361.6%
Ministral 3 14B6767676760575750504358.4%
Ministral 3 3B8344434040382222181536.6%
Ministral 3B75575045403819130033.6%
Rocinante 12B75756767500000033.3%
Gemma 3 4B67505040403325170032.2%
Llama 3.1 8B333330251713000015.1%
GPT-4.1 Nano10000000000010.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-51001001001001001001001001009199.1%
Grok 4.1 Fast1001001001001001001001001009199.1%
Grok 41001001001001001001001001009099.0%
Gemini 3.1 Pro (Preview)100100100100100100100100929298.3%
GPT-5.1100100100100100100100100919198.2%
Claude 3.5 Haiku100100100100100100100100838396.7%
Claude Sonnet 4.51001001001001001009191919196.4%
Claude Sonnet 4.61001001001001001009190898995.9%
o4 Mini1001001001001001009090908295.2%
o4 Mini High1001001001001001009090908295.2%
GPT-5 Nano1001001001001001009090898295.1%
ByteDance Seed 1.6 Flash10010010010010010010088887595.0%
Gemini 3 Pro (Preview)10010010010092929191919194.7%
Z.AI GLM 51001001001001001009185858294.2%
Gemini 2.5 Pro1001001009292929292929294.2%
Minimax M2.5100100100100100908989888093.5%
GPT-5.2100100929292929191919193.0%
GPT-4o, Aug. 6th (temp=1)10010010010091908988888292.7%
Z.AI GLM 4.7 Flash10010010010090898989888292.6%
Arcee AI: Trinity Mini10010010010010010010083716792.1%
Gemini 2.5 Flash10091919191919191919191.8%
Gemini 3 Flash (Preview)9191919191919191919190.9%
MoonshotAI: Kimi K2.510092929291919191858390.7%
GPT-4o, Aug. 6th (temp=0)9191919191919090909090.5%
Claude 3.5 Sonnet10092929292929190838290.4%
Mistral Medium 3.19292919191919191908390.2%
ByteDance Seed 1.69191909090909090909090.2%
GPT-4o, May 13th (temp=0)10090909089898888888689.6%
Claude Sonnet 49191919191919191838389.4%
Grok 4 Fast10010010010010010010010092089.2%
Writer: Palmyra X510091908989898988837888.5%
GPT-5 Mini9190909090909089838288.5%
Z.AI GLM 4.79292919191918383838388.0%
Mistral Large 3100100909090898080808087.9%
DeepSeek V3.210010010010082828078777587.3%
Mistral Large9292929190898383808087.1%
GPT-4o, May 13th (temp=1)100100898888888683757587.0%
DeepSeek V3 (2024-12-26)1001001008989898875756086.4%
Z.AI GLM 4.510089898988888280807585.8%
Claude Opus 49292928383838382828285.4%
Claude Haiku 4.510010010010010090898982085.0%
Stealth: Aurora Alpha10089898282828282828084.9%
Mistral Large 29191919083838382807384.7%
Z.AI GLM 4.610092919183838275737384.2%
GPT-4.1 Mini10090898989888871676783.6%
Qwen 3.5 Plus (2026-02-15)9190868585838377777783.3%
DeepSeek V3.19292919083838075676781.9%
Ministral 3 14B100100908975757575676781.2%
Claude 3.7 Sonnet100100919190737069645580.2%
Qwen 3.5 397B A17B929191919191838382079.5%
GPT-4.110010010010010010092900078.2%
Llama 3.1 Nemotron 70B10088888875717167676778.0%
DeepSeek V3 (2025-03-24)8989888878757575734377.1%
Llama 3.1 70B10088887878757575564375.4%
Mistral Small 3.2 24B8886807575737370676775.2%
Hermes 3 405B100100898073717057565074.6%
Mistral Small Creative8080807878737070706774.5%
DeepSeek-V2 Chat1008989898878757067074.4%
Arcee AI: Trinity Large (Preview)9082828078707067645673.7%
Gemma 3 27B8878757573676767676772.1%
Cohere Command R+ (Aug. 2024)10088787875676458575071.4%
Gemini 2.5 Flash Lite1001009290828280780070.3%
WizardLM 2 8x22b8875676767676058574364.8%
GPT-4o Mini (temp=0)8075606060606060605763.2%
GPT-4o Mini (temp=1)8680807575605040333361.2%
Ministral 3 8B8071696458585546454258.9%
Hermes 3 70B86807171676057330052.6%
Claude 3 Haiku8367605750504343402051.3%
Ministral 8B69676762585855460048.1%
Gemma 3 12B635756565644434238045.3%
Ministral 3B6355505050444440361845.0%
Ministral 3 3B6060575550503629252244.4%
Mistral NeMO6756564440403833332743.3%
Llama 3.1 8B8067565029271790033.4%
Qwen 2.5 72B1003838332924191614931.8%
Rocinante 12B502520201413000014.2%
Gemma 3 4B40201413129975012.8%
GPT-4.1 Nano220000000002.2%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001009199.1%
o4 Mini1001001001001001001001001009099.0%
Gemini 2.5 Pro1001001001001001001001001009099.0%
Claude Sonnet 4.51001001001001001001001001009099.0%
Mistral Large 21001001001001001001001001009099.0%
GPT-5 Mini100100100100100100100100919198.2%
o4 Mini High100100100100100100100100919198.2%
MoonshotAI: Kimi K2.5100100100100100100100100919098.1%
Z.AI GLM 5100100100100100100100100909098.0%
Z.AI GLM 4.7100100100100100100100100908997.9%
Claude Sonnet 4.6100100100100100100100100898997.8%
Claude 3.7 Sonnet100100100100100100100100898997.8%
Qwen 3.5 397B A17B10010010010010010010091909097.1%
GPT-510010010010010010010091909097.1%
Gemini 3 Pro (Preview)10010010010010010010090909097.0%
ByteDance Seed 1.610010010010010010010090908996.9%
Mistral Large10010010010010010010088888896.3%
Mistral Large 31001001001001001009089898995.7%
Minimax M2.510010010010010010010088867895.1%
Z.AI GLM 4.7 Flash1001001001001001009090888395.1%
ByteDance Seed 1.6 Flash10010010010010010010088838095.1%
Gemini 2.5 Flash1001001001001001009191838294.7%
Stealth: Aurora Alpha1001001001001001009089897594.3%
Claude 3.5 Haiku1001001001001001001001001003393.3%
GPT-5.21001001009191919191918392.9%
GPT-4o, Aug. 6th (temp=0)100100100100100888888867192.0%
GPT-4o, Aug. 6th (temp=1)100100100100100898886867191.9%
Claude Opus 41001001009090909089888091.6%
GPT-5 Nano10010010010091888888887091.1%
Claude 3.5 Sonnet100100898989898988888890.7%
GPT-4o, May 13th (temp=1)1001001008988868686837589.2%
Qwen 3.5 Plus (2026-02-15)100100909090909082808089.2%
Z.AI GLM 4.610010010010090908983706788.9%
DeepSeek-V2 Chat10010010010090909071717088.3%
Gemini 3 Flash (Preview)10090909090909080808088.0%
DeepSeek V3.2100100919090898878786987.2%
Gemma 3 27B100100888686868678757085.3%
DeepSeek V3 (2024-12-26)10090908888868380787085.2%
Hermes 3 405B10010010010086807571716785.0%
GPT-4.1 Mini10088888886868378787584.8%
Ministral 3 14B100100908989828080676283.8%
DeepSeek V3 (2025-03-24)100100908980757370706781.3%
Claude Haiku 4.510010090898989898078080.3%
Ministral 3 3B1001001008680757567605079.2%
DeepSeek V3.1100100908880807573565079.1%
Writer: Palmyra X510089898878757373676079.0%
GPT-4.110089808078757373735677.5%
Z.AI GLM 4.510089898986837875444077.3%
Qwen 2.5 72B100100838380806763575076.3%
Mistral Medium 3.19083808075707064605873.1%
Mistral Small Creative8880807575737373674072.2%
Ministral 3 8B888888808078757555070.5%
Ministral 3B100100806760606050505067.7%
Hermes 3 70B10075716767605650504363.8%
Mistral Small 3.2 24B8675757564635555553163.1%
WizardLM 2 8x22b8375756767635650444362.2%
Llama 3.1 Nemotron 70B10088757563635644382062.1%
Arcee AI: Trinity Mini10010075676750503333057.5%
Llama 3.1 70B1006463606056504440053.6%
GPT-4o Mini (temp=0)6757505040404040402544.9%
Cohere Command R+ (Aug. 2024)636360606038312220041.5%
Arcee AI: Trinity Large (Preview)1007573644744110040.5%
GPT-4o Mini (temp=1)605050404040403320037.3%
Ministral 8B75707067600000034.2%
Gemini 2.5 Flash Lite5453504744433870033.5%
Gemma 3 12B504040404020202020029.0%
Claude 3 Haiku3333332525252020201725.2%
Llama 3.1 8B67503125140000018.7%
Mistral NeMO80402215109600018.3%
Gemma 3 4B2520171311000008.5%
Rocinante 12B33251400000007.3%
GPT-4.1 Nano50000000000.5%