No hallucinated violations

Test: Codex Red Herring (False Positive Detection)

Avg. Score
51.9%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast97.9%$0.001912.5s79%
2ByteDance Seed 1.6 Flash95.7%$0.00089.1s69%
3o4 Mini97.5%$0.01425.0s78%
4Gemini 2.5 Flash Lite (Reasoning)94.4%$0.002316.6s68%
5GPT-5.196.9%$0.02526.1s76%
6o4 Mini High98.1%$0.02752.5s81%
7GPT-5 Mini95.0%$0.005937.8s67%
8GPT-5 Nano95.0%$0.00351.1m70%
9Gemini 2.5 Flash (Reasoning)91.3%$0.008514.2s62%
10ByteDance Seed 1.689.6%$0.004332.7s56%
11GPT-5.289.3%$0.01314.5s55%
12Z.AI GLM 595.2%$0.0171.4m69%
13Aion 2.091.9%$0.00961.3m63%
14Minimax M2.585.5%$0.003225.9s50%
15GPT-4.185.9%$0.00481.2s43%
16Grok 4 Fast80.3%$0.001919.2s44%
17Arcee AI: Trinity Mini79.2%$0.000926.3s33%
18Z.AI GLM 4.7 Flash83.4%$0.00392.5m49%
19GPT-583.5%$0.0481.5m52%
20Claude Opus 4.6 (Reasoning)96.3%$0.1201.0m74%
21Ministral 3 8B74.5%$0.001317.9s14%
22Claude Opus 4.679.7%$0.04913.2s34%
23Z.AI GLM 4.674.6%$0.01559.0s31%
24Claude Haiku 4.551.9%$0.00804.2s28%
25Z.AI GLM 4.775.9%$0.0232.2m46%
26Z.AI GLM 4.563.4%$0.003825.0s19%
27MoonshotAI: Kimi K2.575.6%$0.0172.6m48%
28Claude Sonnet 4.666.2%$0.03317.4s25%
29GPT-4.1 Nano58.4%$0.00054.6s8%
30Ministral 8B60.8%$0.00074.4s6%
31Cohere Command R+ (Aug. 2024)60.6%$0.0177.4s15%
32Gemini 3 Flash (Preview, Reasoning)67.6%$0.02849.2s20%
33Gemini 2.5 Flash31.4%$0.00211.5s23%
34DeepSeek V3 (2024-12-26)34.3%$0.00236.6s21%
35GPT-4o, Aug. 6th (temp=1)45.0%$0.00942.4s13%
36Hermes 3 405B47.5%$0.00595.2s9%
37Claude Opus 4.552.4%$0.0364.4s20%
38Hermes 3 70B43.4%$0.001812.1s7%
39Gemini 3 Flash (Preview)25.0%$0.00343.2s20%
40Claude Sonnet 445.5%$0.0224.5s13%
41DeepSeek-V2 Chat30.9%$0.00238.3s15%
42Grok 475.5%$0.0741.6m37%
43DeepSeek V3 (2025-03-24)31.5%$0.001515.7s15%
44Mistral Medium 3.139.4%$0.00324.7s5%
45GPT-4o, Aug. 6th (temp=0)31.4%$0.0112.6s14%
46Qwen 2.5 72B35.2%$0.000911.3s6%
47Claude Sonnet 4.532.8%$0.0245.9s19%
48GPT-4.1 Mini35.6%$0.00186.9s4%
49Claude 3 Haiku19.3%$0.00203.5s16%
50Llama 3.1 70B33.5%$0.003116.4s7%
51Claude 3.7 Sonnet35.5%$0.0224.3s13%
52Rocinante 12B32.3%$0.001513.3s6%
53GPT-4o Mini (temp=1)28.2%$0.00085.6s6%
54Gemini 2.5 Flash Lite32.1%$0.00104.0s2%
55Gemini 2.5 Pro70.4%$0.07352.1s21%
56Mistral Small 3.2 24B31.3%$0.001011.9s3%
57Mistral Large 332.2%$0.004011.7s2%
58Writer: Palmyra X524.0%$0.008513.8s12%
59WizardLM 2 8x22b32.6%$0.004155.3s12%
60Gemini 3.1 Pro (Preview)76.9%$0.1201.5m45%
61GPT-4o Mini (temp=0)24.5%$0.000914.5s6%
62Claude 3.5 Sonnet39.1%$0.0456.9s15%
63DeepSeek V3.124.5%$0.001936.6s11%
64Mistral Large 232.2%$0.01611.4s2%
65Claude Sonnet 4.6 (Reasoning)87.5%$0.1452.2m55%
66DeepSeek V3.221.3%$0.001842.8s11%
67Gemma 3 12B13.7%$0.000511.7s10%
68Ministral 3 14B29.3%$0.001929.2s1%
69Mistral Large31.9%$0.01612.3s2%
70Gemma 3 27B12.5%$0.000712.6s10%
71GPT-4o, May 13th (temp=1)23.5%$0.0333.0s16%
72Llama 3.1 Nemotron 70B22.6%$0.007721.4s5%
73Ministral 3B9.9%$0.000412.4s6%
74Mistral Small Creative9.3%$0.00118.9s4%
75Gemini 3 Pro (Preview)75.6%$0.1441.7m47%
76Llama 3.1 8B10.0%$0.000333.8s8%
77Gemma 3 4B5.6%$0.000315.0s5%
78Ministral 3 3B6.7%$0.001331.1s5%
79GPT-4o, May 13th (temp=0)16.7%$0.0436.4s11%
80Mistral NeMO14.7%$0.00181.6m4%
81Arcee AI: Trinity Large (Preview)15.7%$0.00001.7m2%
82Claude Opus 436.0%$0.1097.1s22%
83Qwen 3.5 Plus (2026-02-15)22.0%$0.0212.0m12%
84Qwen 3.5 397B A17B61.5%$0.0285.4m23%
51.93%

Individual Scenarios

basic entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
ByteDance Seed 1.6 Flash1001001001001001001001001005095.0%
GPT-51001001001001001001001001003393.3%
Grok 4.1 Fast1001001001001001001001001003393.3%
GPT-5.21001001001001001001001001003393.3%
Claude Opus 4.61001001001001001001001001002092.0%
Ministral 3 8B100100100100100100100100100290.2%
GPT-4.1 Nano100100100100100100100100100190.1%
o4 Mini High100100100100100100100100505090.0%
o4 Mini100100100100100100100100505090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100505090.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100333386.7%
Z.AI GLM 5100100100100100100100100333386.7%
MoonshotAI: Kimi K2.510010010010010010010050503383.3%
Z.AI GLM 4.7 Flash10010010010010010010050503383.3%
Arcee AI: Trinity Mini1001001001001001005050505080.0%
Minimax M2.5100100100100100505033332068.7%
Z.AI GLM 4.61001001001005050505033063.3%
ByteDance Seed 1.610010010010050333333333361.7%
GPT-5 Mini1001001005050505033333360.0%
Claude Sonnet 4.61001001001005033333320757.7%
Gemini 3 Pro (Preview)10050505050505050503353.3%
Gemini 3.1 Pro (Preview)100100505050503333332552.5%
Z.AI GLM 4.710050505050333333333346.7%
Rocinante 12B100100100502520202010645.1%
Gemini 2.5 Pro5050505050505033333345.0%
Grok 410033333333333333333340.0%
GPT-4.1100100503333201717171039.7%
Hermes 3 405B100100333325202017131337.3%
Qwen 3.5 397B A17B5033333333333333332534.2%
Gemini 2.5 Flash Lite100100100884431132.9%
GPT-4.1 Mini505050505050888432.8%
Mistral NeMO100100100865430032.6%
Ministral 8B100100100633222232.0%
Claude Opus 4.55033333333333325251731.7%
Grok 4 Fast3333333333333325252530.8%
Gemini 3 Flash (Preview, Reasoning)5033333325252525252029.5%
Z.AI GLM 4.510050333325201185228.8%
Claude Haiku 4.55033252525252520202026.8%
Cohere Command R+ (Aug. 2024)1003333201717141410926.8%
Hermes 3 70B10050251413111098824.8%
DeepSeek V3 (2024-12-26)5050332525201098523.6%
Llama 3.1 70B1003333171310876623.4%
Claude 3 Haiku2525252525252525201323.3%
Claude 3.5 Sonnet3333252020202020171422.3%
Gemini 2.5 Flash3333252525201714101021.3%
GPT-4o, Aug. 6th (temp=1)505020171714141110721.0%
Claude Sonnet 43325202020201717171720.5%
Claude Opus 4252525202017171711918.5%
DeepSeek V3 (2025-03-24)252520201714141313816.9%
DeepSeek-V2 Chat5033171313111195416.6%
WizardLM 2 8x22b5050252074331116.4%
GPT-4o, May 13th (temp=1)25252020171413118715.9%
Claude 3.7 Sonnet201717171717141413915.3%
Claude Sonnet 4.52017171714141414141115.3%
Gemini 3 Flash (Preview)1717141414141099612.4%
GPT-4o Mini (temp=1)251311111010888711.1%
GPT-4o, Aug. 6th (temp=0)141413131111988710.9%
Qwen 3.5 Plus (2026-02-15)17141313119887210.1%
DeepSeek V3.1171713139977119.3%
Mistral Large 21711111110876659.2%
Llama 3.1 Nemotron 70B1311111010997769.2%
Gemma 3 27B11111199988758.9%
Gemma 3 12B259888766548.6%
Qwen 2.5 72B20141387665528.6%
Mistral Large 3141110108777648.3%
Llama 3.1 8B3311965554208.2%
Ministral 3B2514976553338.0%
Writer: Palmyra X53310854433337.7%
Ministral 3 3B338644332106.4%
Mistral Large147766655546.4%
DeepSeek V3.21310777543225.9%
GPT-4o, May 13th (temp=0)109885443225.7%
Mistral Medium 3.177666665535.6%
Mistral Small 3.2 24B87666555425.5%
GPT-4o Mini (temp=0)77666665315.2%
Mistral Small Creative54433322222.9%
Arcee AI: Trinity Large (Preview)133222111102.5%
Gemma 3 4B33322222222.3%
Ministral 3 14B33332111101.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Aion 2.01001001001001001001001001005095.0%
Minimax M2.51001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
Mistral Medium 3.11001001001001001001001001005095.0%
Claude Sonnet 4.61001001001001001001001001001491.4%
GPT-4.1 Nano100100100100100100100100100990.9%
Ministral 3 8B100100100100100100100100100390.3%
Gemini 2.5 Pro100100100100100100100100503388.3%
Grok 4100100100100100100100100502587.5%
Gemini 2.5 Flash Lite10010010010010010010033331377.9%
Mistral Small 3.2 24B10010010010010010010013111173.5%
Z.AI GLM 4.7 Flash100100100100100505050503373.3%
Gemini 3.1 Pro (Preview)10010010010050505050505070.0%
Hermes 3 70B1001001001001001003317141367.7%
Qwen 3.5 397B A17B10010010010050505033252563.3%
Z.AI GLM 4.61001001001005050505014862.2%
Z.AI GLM 4.51001001001001005014108558.8%
Arcee AI: Trinity Mini10010010010050333333251158.6%
Ministral 8B10010010010010050442256.2%
Gemini 3 Flash (Preview, Reasoning)10050505050505050503353.3%
Gemini 3 Pro (Preview)10050505050505050503353.3%
Claude Haiku 4.55050505050505050503348.3%
GPT-4o, Aug. 6th (temp=1)10033333333332517141433.7%
Hermes 3 405B10050332525252020201733.5%
Claude 3.5 Sonnet5033333333333325252532.5%
Claude Sonnet 45033333333333325252032.0%
Claude Opus 45050333333252525252032.0%
WizardLM 2 8x22b1005050501713111010831.8%
DeepSeek-V2 Chat5050333325252020201128.8%
Qwen 2.5 72B1005033252013131313828.6%
DeepSeek V3.11005025251717141110827.6%
Cohere Command R+ (Aug. 2024)1003333202020171411227.1%
DeepSeek V3 (2024-12-26)5033252525252525171126.1%
Mistral NeMO10010011887777225.6%
Claude Opus 4.53333332525252020202025.5%
GPT-4o Mini (temp=1)1002525251714141313825.4%
Claude 3.7 Sonnet3333252525252520202025.2%
Rocinante 12B100332017171413118623.9%
Llama 3.1 70B50333320202020139622.4%
Llama 3.1 Nemotron 70B100252014131310107621.7%
Gemini 3 Flash (Preview)2525252020201717171720.2%
DeepSeek V3 (2025-03-24)3320202020202017171420.1%
GPT-4o, May 13th (temp=1)503320171714141411819.9%
Claude Opus 4.63333252014141413111118.9%
Gemini 2.5 Flash2525252020201714131018.8%
Gemma 3 12B3317171717171414141417.4%
DeepSeek V3.23333332599988517.4%
Claude Sonnet 4.52020201717171717141417.2%
Qwen 3.5 Plus (2026-02-15)502020171313131310417.1%
Gemma 3 27B2020202017171717141017.1%
GPT-4o, Aug. 6th (temp=0)2020171717171414131316.0%
GPT-4o Mini (temp=0)1717171717171717131315.8%
Claude 3 Haiku2020171717141414131115.6%
Ministral 3B3325131311111088613.8%
Writer: Palmyra X5332014141310998413.5%
GPT-4.1 Mini20202017139998613.0%
Mistral Large 31111111111101010779.9%
Ministral 3 3B25141098887419.5%
Llama 3.1 8B1411111010886659.1%
GPT-4o, May 13th (temp=0)1717141413633118.8%
Mistral Large1111998888668.5%
Mistral Large 2119998777657.9%
Arcee AI: Trinity Large (Preview)10101088766657.4%
Gemma 3 4B76666555445.4%
Mistral Small Creative86654443224.4%
Ministral 3 14B66655433334.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001005095.0%
Aion 2.01001001001001001001001001005095.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001001005095.0%
Ministral 3 8B100100100100100100100100100390.3%
Gemini 2.5 Pro100100100100100100100100505090.0%
Z.AI GLM 4.7 Flash100100100100100100100100505090.0%
Arcee AI: Trinity Mini10010010010010010010010050985.9%
Z.AI GLM 510010010010010010010050505085.0%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
GPT-5 Nano10010010010010010010050505085.0%
GPT-5.210010010010010010010050332080.3%
Hermes 3 405B10010010010010010010050332080.3%
Minimax M2.51001001001001001005050505080.0%
Claude Sonnet 41001001001001001005033333375.0%
Cohere Command R+ (Aug. 2024)1001001001001001005025201070.5%
GPT-510010010010050505050505070.0%
Qwen 3.5 397B A17B10010010010050505050505070.0%
MoonshotAI: Kimi K2.510010010010050505050505070.0%
Z.AI GLM 4.710010010010050505050505070.0%
Gemini 3 Flash (Preview, Reasoning)10010010010050505050503368.3%
Grok 4100100100100100503333333368.3%
Claude 3.5 Sonnet10010010010050505033332564.2%
Grok 4 Fast1001001005050505050503363.3%
Ministral 8B100100100100100100654161.6%
Hermes 3 70B100100100100100332520171460.9%
Z.AI GLM 4.6100100100100100502585058.9%
Claude Opus 4.510050505050505050505055.0%
Claude Haiku 4.510050505050505050503353.3%
GPT-4o, Aug. 6th (temp=1)1001001005050332525201752.0%
WizardLM 2 8x22b10050505050505050333351.7%
Claude Opus 45050505050505050505050.0%
Z.AI GLM 4.51001001005050502587549.4%
Claude Sonnet 4.510050505050503333333348.3%
Writer: Palmyra X510010050505033333325748.2%
Claude 3.7 Sonnet100100333333333333333346.7%
GPT-4.1 Nano100100100100503222246.0%
DeepSeek V3 (2024-12-26)5050505050505025141140.0%
GPT-4o, Aug. 6th (temp=0)5050333333333325251733.3%
Rocinante 12B1005050332520201714133.0%
Gemini 3 Flash (Preview)5033333333333325252532.5%
DeepSeek V3.2505050333333171717930.9%
DeepSeek V3 (2025-03-24)5050332525252520141428.2%
DeepSeek V3.150503333332525139928.1%
Gemini 2.5 Flash503333252525202017725.5%
Qwen 2.5 72B100332525201111108825.2%
GPT-4o, May 13th (temp=1)3333332525202020201124.1%
DeepSeek-V2 Chat5033252520201713131022.5%
Llama 3.1 70B1002017141313111110721.5%
Claude Sonnet 4.63325252520201717171421.3%
Mistral NeMO1003313101010764319.5%
GPT-4o Mini (temp=1)25252020171717119716.7%
Qwen 3.5 Plus (2026-02-15)3333202017171087216.7%
Claude 3 Haiku201717141414141411914.5%
Mistral Medium 3.12517171111101099912.8%
GPT-4.1 Mini171414131313111010912.3%
Mistral Small 3.2 24B141414131313111110812.1%
GPT-4o, May 13th (temp=0)25252020204211112.0%
GPT-4o Mini (temp=0)141414141413131110211.9%
Mistral Large 31414141410101098811.2%
Mistral Large141414141111888811.1%
Gemini 2.5 Flash Lite20201010108887710.9%
Gemma 3 12B14141311111110108410.6%
Gemma 3 27B111111111010101010910.4%
Mistral Small Creative3398877777610.0%
Llama 3.1 Nemotron 70B1711111010998779.9%
Mistral Large 217171198887779.8%
Ministral 3 3B141414119877669.6%
Ministral 3B141110109877758.8%
Llama 3.1 8B1714987654327.4%
Gemma 3 4B109966533225.4%
Arcee AI: Trinity Large (Preview)86666555435.4%
Ministral 3 14B66655543314.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001005095.0%
Z.AI GLM 51001001001001001001001001005095.0%
o4 Mini High1001001001001001001001001005095.0%
Minimax M2.51001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
Z.AI GLM 4.6100100100100100100100100100590.5%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100505090.0%
GPT-5 Nano100100100100100100100100505090.0%
Aion 2.010010010010010010010050505085.0%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
Grok 4 Fast10010010010010010010050503383.3%
GPT-5.210010010010010010010050502582.5%
Arcee AI: Trinity Mini10010010010010010010010011081.1%
Z.AI GLM 4.51001001001001001001005050780.7%
Gemini 2.5 Pro1001001001001001005050505080.0%
Qwen 3.5 397B A17B100100100100100505050503373.3%
Gemini 3 Flash (Preview, Reasoning)100100100100100505050503373.3%
Cohere Command R+ (Aug. 2024)1001001001001001005020171770.3%
Claude Sonnet 410010010010050505050503368.3%
Grok 4100100100100100503333332567.5%
Z.AI GLM 4.71001001005050505050505065.0%
Llama 3.1 Nemotron 70B10010010010010010014138764.2%
GPT-5100100505050505050505060.0%
Claude Haiku 4.5100100505050505050505060.0%
DeepSeek-V2 Chat100100505050505050503358.3%
Llama 3.1 70B100100100100100332099857.9%
Hermes 3 405B10010010010033333333251757.5%
Qwen 2.5 72B1001001001005033331714855.6%
Claude Sonnet 4.5100100505050505033333355.0%
GPT-4.1 Nano10010010010010013864153.2%
Claude 3.5 Sonnet1001001005033332525252051.2%
MoonshotAI: Kimi K2.55050505050505050505050.0%
Claude Opus 4.55050505050505050505050.0%
DeepSeek V3 (2024-12-26)1005050505050505033749.0%
GPT-4.1 Mini1001001001002017141111948.2%
GPT-4.1100100503333333333332547.5%
Hermes 3 70B100100100503325251714947.3%
Gemini 2.5 Flash5050505050505033333345.0%
DeepSeek V3 (2025-03-24)5050505050505033333345.0%
GPT-4o, Aug. 6th (temp=0)5050505050505033332544.2%
Rocinante 12B100100100502020201110843.9%
Claude Opus 45050503333333333252536.7%
Writer: Palmyra X550505050503333258635.5%
WizardLM 2 8x22b5050505050332517171135.3%
Gemini 3 Flash (Preview)5033333333333333252533.3%
GPT-4o, Aug. 6th (temp=1)10033332525252020202032.2%
Claude 3.7 Sonnet5033333333252525252530.8%
DeepSeek V3.1505050503320171411730.3%
Qwen 3.5 Plus (2026-02-15)505050502020201714930.0%
GPT-4o, May 13th (temp=0)3333333333252525252028.7%
GPT-4o, May 13th (temp=1)3333333333252520171126.4%
DeepSeek V3.2505050333314887626.0%
GPT-4o Mini (temp=1)10025252017171414141325.9%
Mistral Medium 3.1503333333320141310924.9%
Claude Sonnet 4.63325252525202020201723.0%
Gemini 2.5 Flash Lite33332520202011118718.9%
Claude 3 Haiku2525252020171714141118.8%
GPT-4o Mini (temp=0)2020202020201717171318.3%
Arcee AI: Trinity Large (Preview)1002011888766017.4%
Gemma 3 27B2520201717171413131116.5%
Ministral 3 14B505010998887616.5%
Mistral Large 2202020171714131313915.4%
Mistral Large202017171414131111914.6%
Mistral Large 3141414131313131313912.7%
Ministral 3B501710987776412.4%
Mistral Small 3.2 24B2020141111111088812.1%
Gemma 3 4B141414131111111111811.9%
Mistral NeMO17141414131110108611.7%
Gemma 3 12B14111110101010108710.2%
Llama 3.1 8B1411101010988759.2%
Mistral Small Creative1411998887658.5%
Ministral 3 3B1311888886547.9%
Ministral 8B65555444414.4%
Ministral 3 8B66544431003.3%

detailed entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001005095.0%
GPT-5.21001001001001001001001001005095.0%
Grok 4 Fast1001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
GPT-5 Nano1001001001001001001001001005095.0%
MoonshotAI: Kimi K2.51001001001001001001001001003393.3%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001001002092.0%
Ministral 3 14B100100100100100100100100100590.5%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
ByteDance Seed 1.6100100100100100100100100505090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100505090.0%
ByteDance Seed 1.6 Flash100100100100100100100100503388.3%
Minimax M2.510010010010010010010010050485.4%
Grok 410010010010010010010050332080.3%
Z.AI GLM 4.71001001001001001005050502577.5%
Gemini 3.1 Pro (Preview)1001001001001001003333332572.5%
Z.AI GLM 4.5100100100100100100503311870.3%
Z.AI GLM 4.610010010010010050505033168.4%
Qwen 3.5 397B A17B10010010010050505050332565.8%
Gemini 3 Pro (Preview)10010010010050505033252563.3%
Qwen 2.5 72B10010010010010050251713560.9%
Cohere Command R+ (Aug. 2024)1001001001001003325256259.2%
Claude Sonnet 4.6100100505050505020141349.7%
Claude Haiku 4.55050505050505050503348.3%
GPT-4.1 Nano100100100100507322146.5%
Z.AI GLM 4.7 Flash10050505050333333332545.8%
GPT-4o, Aug. 6th (temp=1)100100505033333320201745.7%
Claude Sonnet 4.5100100333333332525251742.5%
Gemini 2.5 Flash5050505033333333333340.0%
Rocinante 12B10010050141313131110833.1%
Mistral Small 3.2 24B100100100444332031.9%
Gemini 2.5 Flash Lite100100100753211031.8%
WizardLM 2 8x22b10010033252014766331.4%
Arcee AI: Trinity Large (Preview)100100100222211131.2%
Gemini 2.5 Pro5033333333252525202029.8%
Claude Opus 4.55050333333202017171729.0%
GPT-4o, May 13th (temp=1)5050333325201717141327.2%
Claude 3.7 Sonnet10025202020202017141427.0%
Claude Opus 45033252525252020201425.8%
Claude Sonnet 43325252525252525252025.3%
Hermes 3 405B3333332525252020141324.2%
DeepSeek-V2 Chat50505020141411108122.9%
Hermes 3 70B100201717141110109821.6%
Gemini 3 Flash (Preview)2525252525202020111020.6%
Claude 3 Haiku3325252525171313111119.7%
GPT-4o, Aug. 6th (temp=0)3333332014131196417.7%
DeepSeek V3 (2024-12-26)50332014131310108717.7%
GPT-4o Mini (temp=1)202020201717171413716.4%
Claude 3.5 Sonnet2017171717171414141416.0%
DeepSeek V3.25050201463321115.0%
Mistral Medium 3.1332517171313988314.4%
Llama 3.1 70B3320201713101096614.3%
Qwen 3.5 Plus (2026-02-15)252520201714963214.2%
Writer: Palmyra X533332511107543213.4%
GPT-4o Mini (temp=0)1717141414111198812.3%
Gemma 3 12B171717131110888311.0%
Llama 3.1 Nemotron 70B1713131111111186610.6%
DeepSeek V3 (2025-03-24)1717131010999319.7%
GPT-4.1 Mini509666553339.6%
Llama 3.1 8B1717171313754419.6%
Mistral Large 3171111108766438.3%
Mistral Large 217111088776538.2%
Mistral Large1110998754437.0%
Gemma 3 27B148876555446.6%
Mistral NeMO20141087222006.5%
GPT-4o, May 13th (temp=0)99777665546.5%
Mistral Small Creative502222211116.5%
DeepSeek V3.176544322113.4%
Ministral 3B55432211102.3%
Gemma 3 4B43222222112.1%
Ministral 3 3B33211100001.1%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001001001001001001005095.0%
GPT-51001001001001001001001001005095.0%
Gemini 3 Pro (Preview)1001001001001001001001001005095.0%
o4 Mini1001001001001001001001001005095.0%
Z.AI GLM 4.7 Flash1001001001001001001001001005095.0%
GPT-5 Nano1001001001001001001001001005095.0%
Grok 4.1 Fast100100100100100100100100505090.0%
Claude Sonnet 4.6100100100100100100100100501786.7%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010050505085.0%
GPT-4.1 Nano1001001001001001001005050680.6%
Minimax M2.51001001001001001005050505080.0%
Z.AI GLM 4.7100100100100100505050505075.0%
Grok 4 Fast100100100100100505050505075.0%
Z.AI GLM 4.6100100100100100100505033073.3%
GPT-5.2100100100100100505050503373.3%
Cohere Command R+ (Aug. 2024)1001001001001001003325252070.3%
Aion 2.010010010010050505050505070.0%
Z.AI GLM 4.5100100100100100100501717168.4%
ByteDance Seed 1.61001001005050505050505065.0%
GPT-4o, Aug. 6th (temp=1)100100100100100333325202063.2%
Gemini 2.5 Pro100100505050505050505060.0%
Gemini 2.5 Flash (Reasoning)100100505050505050505060.0%
Grok 41001001005050505033333360.0%
Gemini 3 Flash (Preview, Reasoning)1001001005050505033332559.2%
MoonshotAI: Kimi K2.510050505050505050503353.3%
Hermes 3 70B1001001005050502517171452.3%
Mistral Medium 3.11001001001003317171711950.4%
Qwen 3.5 397B A17B5050505050505050505050.0%
Arcee AI: Trinity Large (Preview)1001001001002017877746.5%
Arcee AI: Trinity Mini1001005050503325204043.2%
Claude Haiku 4.55050505050505033252042.8%
Hermes 3 405B100100332525252520202039.3%
Llama 3.1 70B100100503325201713131338.2%
Claude Opus 4.65050505033333325252037.0%
Claude Sonnet 410050333325252525252036.2%
Qwen 2.5 72B100100252020201714131134.0%
Llama 3.1 Nemotron 70B10010033202014131111532.7%
Ministral 8B100100100433332131.9%
Claude 3.7 Sonnet5033333333252525252530.8%
DeepSeek V3 (2024-12-26)5050503325251717171730.0%
Claude Opus 45033333325252525251428.9%
Gemini 2.5 Flash Lite10010017141410987228.1%
Claude Opus 4.53333333325252525252027.8%
Claude 3.5 Sonnet3333332525252525252027.0%
DeepSeek V3 (2025-03-24)5033252525252520171726.2%
Qwen 3.5 Plus (2026-02-15)50505017141111119723.0%
Ministral 3 8B1001003332221021.6%
GPT-4o, May 13th (temp=1)5025251717171717171421.4%
Gemini 3 Flash (Preview)2525252020201717171419.9%
DeepSeek V3.150502020141410108219.8%
GPT-4.1 Mini5050171717141187319.4%
DeepSeek-V2 Chat2525252020201717131119.2%
Mistral Small 3.2 24B1001413111111988519.1%
Claude 3 Haiku2525252520171717111019.1%
Gemma 3 12B3325252020141313131118.6%
Gemini 2.5 Flash2020171717171717141016.4%
Llama 3.1 8B333320171111101010616.2%
GPT-4o, Aug. 6th (temp=0)202020201714141110915.5%
Claude Sonnet 4.51717171717141414141315.3%
DeepSeek V3.225252020141413106315.0%
Gemma 3 27B201717171414141413914.9%
WizardLM 2 8x22b5025131088777614.1%
GPT-4o, May 13th (temp=0)171414141313131310912.9%
Rocinante 12B252014141310877512.4%
Mistral NeMO141413108887659.4%
Mistral Large108888876667.4%
Mistral Large 288888775557.0%
Mistral Large 388887766556.9%
Writer: Palmyra X5109987655436.7%
Gemma 3 4B87666555545.7%
Ministral 3B1411864333225.6%
Ministral 3 14B86654333214.0%
Mistral Small Creative55544433333.9%
Ministral 3 3B75432111002.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Z.AI GLM 51001001001001001001001001005095.0%
Aion 2.01001001001001001001001001005095.0%
Minimax M2.51001001001001001001001001005095.0%
Gemini 2.5 Pro1001001001001001001001001005095.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001001005095.0%
Z.AI GLM 4.51001001001001001001001001005095.0%
Z.AI GLM 4.7 Flash1001001001001001001001001005095.0%
ByteDance Seed 1.6 Flash1001001001001001001001001005095.0%
GPT-5.1100100100100100100100100505090.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100505090.0%
Z.AI GLM 4.6100100100100100100100100505090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100505090.0%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
Z.AI GLM 4.710010010010010010010050503383.3%
Claude Opus 4.6 (Reasoning)1001001001001001005050505080.0%
GPT-51001001001001001005050505080.0%
MoonshotAI: Kimi K2.51001001001001001005050505080.0%
Cohere Command R+ (Aug. 2024)1001001001001001001003333176.8%
Qwen 3.5 397B A17B10010010010050505050503368.3%
Gemini 3.1 Pro (Preview)1001001005050505050505065.0%
Claude 3.7 Sonnet10010010010050505033333365.0%
Claude Sonnet 4.6 (Reasoning)100100505050505050503358.3%
GPT-4.1 Mini1001001001005050501111858.1%
Mistral Medium 3.110010010010050333398754.1%
DeepSeek V3 (2025-03-24)100100505050503333251750.8%
Gemini 2.5 Flash5050505050505050505050.0%
WizardLM 2 8x22b1001005050505033336147.4%
Claude 3.5 Sonnet100100505033333325252047.0%
Claude Sonnet 41001001003325252020202046.3%
GPT-4o, Aug. 6th (temp=1)5050505050505050332545.8%
GPT-4o, Aug. 6th (temp=0)5050505050505033333345.0%
Claude Opus 410050505033333333332544.2%
Claude Haiku 4.55050505050333333332540.8%
Gemini 2.5 Flash Lite100100100252525755439.5%
Qwen 2.5 72B1001005033332514117638.0%
DeepSeek V3 (2024-12-26)5050505050333325201737.8%
Qwen 3.5 Plus (2026-02-15)5050505050333317141336.0%
GPT-4o, May 13th (temp=0)5050505050252020171434.6%
Claude Sonnet 4.55050333333333325252534.2%
DeepSeek V3.1505050502525252520832.8%
DeepSeek-V2 Chat505050333333252517932.6%
Writer: Palmyra X550505033333333206431.3%
Mistral Small 3.2 24B100100252511111010101031.2%
Hermes 3 405B10033252525252020141129.9%
DeepSeek V3.2505033333333202017729.7%
Gemini 3 Flash (Preview)5033333325252525202029.0%
GPT-4o, May 13th (temp=1)333333333333332520828.7%
Llama 3.1 70B10050331714131098826.1%
GPT-4.1 Nano10010033554222125.4%
GPT-4o Mini (temp=0)2525252525252525252024.5%
Hermes 3 70B5025252520171717141422.4%
GPT-4o Mini (temp=1)252520202020201411818.3%
Claude 3 Haiku3320201717141414141117.5%
Rocinante 12B332520171713131111816.7%
Ministral 3 14B10077653211013.1%
Ministral 3B33251411109854312.3%
Llama 3.1 Nemotron 70B33111110109999812.0%
Llama 3.1 8B25201413107652010.3%
Gemma 3 27B11111010101010109810.0%
Gemma 3 12B141311109988779.5%
Mistral Small Creative1111866555536.5%
Ministral 3 3B118666443205.0%
Arcee AI: Trinity Large (Preview)98754442224.7%
Gemma 3 4B55332222222.8%
Mistral NeMO76433210002.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
GPT-5 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001005095.0%
Aion 2.01001001001001001001001001005095.0%
o4 Mini1001001001001001001001001005095.0%
Grok 4 Fast1001001001001001001001001005095.0%
Claude Haiku 4.51001001001001001001001001005095.0%
GPT-5 Nano1001001001001001001001001005095.0%
Arcee AI: Trinity Mini1001001001001001001001001005095.0%
GPT-4.1 Mini1001001001001001001001001001491.4%
Claude Opus 4.6 (Reasoning)100100100100100100100100505090.0%
Claude Opus 4.6100100100100100100100100505090.0%
GPT-5.2100100100100100100100100505090.0%
Z.AI GLM 4.6100100100100100100100100505090.0%
Z.AI GLM 4.7100100100100100100100100505090.0%
ByteDance Seed 1.6 Flash100100100100100100100100502587.5%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
Minimax M2.510010010010010010010050505085.0%
Z.AI GLM 4.7 Flash10010010010010010010050505085.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100201783.7%
Hermes 3 405B10010010010010010010033252578.3%
MoonshotAI: Kimi K2.5100100100100100505050505075.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100505050505075.0%
Gemini 2.5 Pro100100100100100505050505075.0%
Claude Sonnet 4.6 (Reasoning)10010010010050505050505070.0%
GPT-510010010010050505050505070.0%
GPT-4o, Aug. 6th (temp=0)10010010010050505050503368.3%
Qwen 3.5 397B A17B10010010010050505050333366.7%
GPT-4o, Aug. 6th (temp=1)10010010010050505050333366.7%
Mistral Small 3.2 24B1001001001001001001713111065.0%
Llama 3.1 70B10010010010050505033332564.2%
Gemini 3.1 Pro (Preview)100100505050505050505060.0%
Claude Sonnet 410010010010033333333333360.0%
Mistral Medium 3.11001001005050505033331458.1%
Z.AI GLM 4.510010010010010025887655.4%
DeepSeek V3 (2025-03-24)10050505050505050505055.0%
Claude 3.5 Sonnet1001001005033333333252052.8%
Claude Opus 4100100505050333333333351.7%
Rocinante 12B10010010010050201388650.4%
Hermes 3 70B1001001005050332520131150.2%
DeepSeek V3 (2024-12-26)5050505050505050505050.0%
DeepSeek-V2 Chat5050505050505050333346.7%
DeepSeek V3.15050505050505050252545.0%
Claude 3.7 Sonnet100100503325252525252042.8%
Writer: Palmyra X55050505033333320202036.0%
Claude Sonnet 4.55050333333333333252034.5%
Gemini 2.5 Flash50505050503325256534.3%
GPT-4.1 Nano100100100977654434.1%
WizardLM 2 8x22b505050503325252017832.8%
Gemini 3 Flash (Preview)5033333333333333202032.3%
Mistral Small Creative10010033201714888731.5%
Qwen 2.5 72B10050505011111098730.6%
DeepSeek V3.25033333333332525201730.3%
Qwen 3.5 Plus (2026-02-15)5050333325252020201729.3%
Claude 3 Haiku3333332525252020202025.5%
GPT-4o, May 13th (temp=1)50333333252020178724.7%
GPT-4o, May 13th (temp=0)3333333333201714141424.6%
Gemma 3 12B3333252525252517141123.4%
Llama 3.1 Nemotron 70B332525252020201711820.4%
Gemini 2.5 Flash Lite10013111087665517.0%
Ministral 3B50331714108877716.2%
Gemma 3 27B2520171714141313131015.4%
GPT-4o Mini (temp=1)2017141311101088811.8%
Ministral 3 3B332517777753211.2%
Arcee AI: Trinity Large (Preview)50208754442210.5%
Llama 3.1 8B25171398866549.9%
Mistral NeMO1713111010988769.9%
Gemma 3 4B111110109998779.2%
GPT-4o Mini (temp=0)88888888887.7%