Correct "no violations" response

Test: Codex Red Herring (False Positive Detection)

Avg. Score
46.1%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.1 Fast96.3%$0.001912.5s62%
2o4 Mini95.0%$0.01425.0s56%
3ByteDance Seed 1.6 Flash92.5%$0.00089.1s47%
4o4 Mini High96.3%$0.02752.5s62%
5GPT-5.193.8%$0.02526.1s52%
6GPT-5 Mini91.3%$0.005937.8s43%
7GPT-4.185.6%$0.00481.2s38%
8Gemini 2.5 Flash Lite (Reasoning)88.8%$0.002316.6s37%
9GPT-5 Nano90.0%$0.00351.1m40%
10Ministral 8B74.4%$0.00074.4s33%
11Z.AI GLM 591.3%$0.0171.4m43%
12Ministral 3 8B80.6%$0.001317.9s30%
13Gemini 2.5 Flash (Reasoning)82.5%$0.008514.2s24%
14Claude Sonnet 4.673.8%$0.03317.4s39%
15Claude Opus 4.681.9%$0.04913.2s38%
16GPT-4.1 Nano68.1%$0.00054.6s24%
17ByteDance Seed 1.681.3%$0.004332.7s22%
18GPT-5.281.3%$0.01314.5s22%
19Aion 2.083.8%$0.00961.3m26%
20Ministral 3B43.1%$0.000412.4s33%
21Minimax M2.573.1%$0.003225.9s12%
22Claude Haiku 4.551.2%$0.00804.2s24%
23GPT-4.1 Mini51.2%$0.00186.9s18%
24Mistral Medium 3.153.1%$0.00324.7s17%
25Mistral Small Creative37.5%$0.00118.9s26%
26Ministral 3 3B37.5%$0.001331.1s28%
27Z.AI GLM 4.561.3%$0.003825.0s12%
28Arcee AI: Trinity Mini67.5%$0.000926.3s6%
29Claude Opus 4.6 (Reasoning)92.5%$0.1201.0m47%
30Grok 4 Fast66.3%$0.001919.2s5%
31Qwen 2.5 72B45.6%$0.000911.3s15%
32Ministral 3 14B50.6%$0.001929.2s15%
33Z.AI GLM 4.668.1%$0.01559.0s14%
34Llama 3.1 8B26.9%$0.000333.8s25%
35Hermes 3 70B42.5%$0.001812.1s9%
36Llama 3.1 Nemotron 70B35.0%$0.007721.4s17%
37Arcee AI: Trinity Large (Preview)43.1%$0.00001.7m22%
38GPT-4o Mini (temp=1)32.5%$0.00085.6s7%
39Cohere Command R+ (Aug. 2024)50.0%$0.0177.4s0%
40Z.AI GLM 4.7 Flash70.0%$0.00392.5m8%
41Mistral Large 336.9%$0.004011.7s0%
42Claude Opus 4.540.0%$0.0364.4s9%
43GPT-4o, Aug. 6th (temp=1)32.5%$0.00942.4s0%
44Hermes 3 405B30.6%$0.00595.2s0%
45Gemini 2.5 Flash Lite25.0%$0.00104.0s0%
46Mistral Small 3.2 24B26.3%$0.001011.9s0%
47Mistral Large 235.0%$0.01611.4s0%
48Claude Sonnet 435.0%$0.0224.5s0%
49GPT-567.5%$0.0481.5m6%
50Mistral Large33.1%$0.01612.3s0%
51Claude 3 Haiku20.6%$0.00203.5s0%
52Llama 3.1 70B25.0%$0.003116.4s0%
53Rocinante 12B22.5%$0.001513.3s0%
54Writer: Palmyra X522.5%$0.008513.8s0%
55GPT-4o Mini (temp=0)16.9%$0.000914.5s0%
56Gemini 3 Flash (Preview, Reasoning)44.4%$0.02849.2s0%
57WizardLM 2 8x22b26.9%$0.004155.3s0%
58Claude Sonnet 4.520.6%$0.0245.9s0%
59DeepSeek-V2 Chat5.0%$0.00238.3s0%
60DeepSeek V3 (2024-12-26)4.4%$0.00236.6s0%
61Z.AI GLM 4.755.0%$0.0232.2m1%
62GPT-4o, Aug. 6th (temp=0)8.8%$0.0112.6s0%
63DeepSeek V3.112.5%$0.001936.6s0%
64DeepSeek V3 (2025-03-24)5.6%$0.001515.7s0%
65DeepSeek V3.213.8%$0.001842.8s0%
66Claude 3.7 Sonnet15.6%$0.0224.3s0%
67Gemini 3 Flash (Preview)2.5%$0.00343.2s0%
68Gemini 2.5 Flash0.6%$0.00211.5s0%
69Gemma 3 12B1.3%$0.000511.7s0%
70Gemma 3 27B0.0%$0.000712.6s0%
71Gemma 3 4B0.0%$0.000315.0s0%
72MoonshotAI: Kimi K2.552.5%$0.0172.6m0%
73GPT-4o, May 13th (temp=1)16.9%$0.0333.0s0%
74Grok 461.3%$0.0741.6m3%
75Claude 3.5 Sonnet18.8%$0.0456.9s0%
76Gemini 2.5 Pro46.3%$0.07352.1s0%
77Mistral NeMO8.8%$0.00181.6m0%
78GPT-4o, May 13th (temp=0)1.9%$0.0436.4s0%
79Qwen 3.5 Plus (2026-02-15)18.8%$0.0212.0m0%
80Claude Sonnet 4.6 (Reasoning)76.3%$0.1452.2m15%
81Gemini 3.1 Pro (Preview)57.5%$0.1201.5m1%
82Claude Opus 418.1%$0.1097.1s0%
83Gemini 3 Pro (Preview)53.8%$0.1441.7m0%
84Qwen 3.5 397B A17B31.3%$0.0285.4m0%
46.13%

Individual Scenarios

basic entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Claude Opus 4.61001001001001001001001001005095.0%
GPT-4.1 Nano1001001001001001001001001005095.0%
Ministral 3 8B1001001001001001001001001005095.0%
GPT-5100100100100100100100100100090.0%
GPT-5.2100100100100100100100100100090.0%
Grok 4.1 Fast100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
ByteDance Seed 1.6 Flash100100100100100100100100100090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001000080.0%
Z.AI GLM 51001001001001001001001000080.0%
o4 Mini High1001001001001001001001000080.0%
o4 Mini1001001001001001001001000080.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001000080.0%
MoonshotAI: Kimi K2.510010010010010010010000070.0%
Z.AI GLM 4.7 Flash10010010010010010010000070.0%
Claude Sonnet 4.6100100100100505050500060.0%
Ministral 8B10010010050505050500055.0%
Minimax M2.51001001001001000000050.0%
GPT-4.11001005050505050500050.0%
Z.AI GLM 4.6100100100100500000045.0%
ByteDance Seed 1.610010010010000000040.0%
Writer: Palmyra X550505050505050500040.0%
Qwen 2.5 72B50505050505050500040.0%
Arcee AI: Trinity Large (Preview)50505050505050500040.0%
Mistral Small Creative50505050505050500040.0%
Arcee AI: Trinity Mini10010010010000000040.0%
Ministral 3B50505050505050500040.0%
Rocinante 12B10010010050500000040.0%
Claude Opus 4.55050505050505000035.0%
Ministral 3 14B5050505050505000035.0%
Claude 3 Haiku5050505050505000035.0%
GPT-5 Mini100100100000000030.0%
Claude Sonnet 4.5505050505050000030.0%
Gemini 2.5 Flash Lite100100100000000030.0%
WizardLM 2 8x22b505050505050000030.0%
Ministral 3 3B505050505050000030.0%
Mistral NeMO100100100000000030.0%
Llama 3.1 8B505050505050000030.0%
Llama 3.1 Nemotron 70B50505050500000025.0%
Gemini 3.1 Pro (Preview)1001000000000020.0%
Claude Sonnet 45050505000000020.0%
Hermes 3 405B1001000000000020.0%
GPT-4o, Aug. 6th (temp=1)5050505000000020.0%
GPT-4o Mini (temp=1)5050505000000020.0%
Mistral Medium 3.15050505000000020.0%
Claude Opus 4505050000000015.0%
Z.AI GLM 4.5100500000000015.0%
Claude Haiku 4.5505050000000015.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
DeepSeek V3.1505050000000015.0%
Llama 3.1 70B100500000000015.0%
Hermes 3 70B100500000000015.0%
Gemini 3 Pro (Preview)10000000000010.0%
Z.AI GLM 4.710000000000010.0%
Grok 410000000000010.0%
GPT-4.1 Mini50500000000010.0%
DeepSeek V3.250500000000010.0%
Cohere Command R+ (Aug. 2024)10000000000010.0%
Gemini 3 Flash (Preview, Reasoning)500000000005.0%
Gemini 3 Flash (Preview)500000000005.0%
Claude 3.5 Sonnet500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Qwen 3.5 397B A17B00000000000.0%
Gemini 2.5 Pro00000000000.0%
Grok 4 Fast00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Mistral Large 300000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
Claude 3.7 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Mistral Large 200000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Large00000000000.0%
Mistral Small 3.2 24B00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Claude Sonnet 4.61001001001001001001001001005095.0%
Mistral Medium 3.11001001001001001001001001005095.0%
GPT-4.1 Nano1001001001001001001001001005095.0%
Aion 2.0100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
Ministral 3 8B100100100100100100100100100090.0%
Minimax M2.51001001001001001001001000080.0%
Gemini 2.5 Pro1001001001001001001001000080.0%
Grok 41001001001001001001001000080.0%
Ministral 8B100100100100100505050505075.0%
Gemini 2.5 Flash Lite10010010010010010010000070.0%
Mistral Small 3.2 24B10010010010010010010000070.0%
Hermes 3 70B10010010010010010050500070.0%
Z.AI GLM 4.51001001001001005050500065.0%
Z.AI GLM 4.610010010010050505000055.0%
Claude Haiku 4.55050505050505050505050.0%
Z.AI GLM 4.7 Flash1001001001001000000050.0%
Claude Sonnet 4.5505050505050505050045.0%
GPT-4.1 Mini505050505050505050045.0%
Writer: Palmyra X5505050505050505050045.0%
Ministral 3 3B505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Gemini 3.1 Pro (Preview)10010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Llama 3.1 Nemotron 70B10050505050505000040.0%
Arcee AI: Trinity Mini10010010010000000040.0%
Claude Opus 4.65050505050505000035.0%
Claude Sonnet 45050505050505000035.0%
GPT-4o Mini (temp=1)1005050505050000035.0%
Claude Opus 4505050505050000030.0%
WizardLM 2 8x22b100505050500000030.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
GPT-4o, Aug. 6th (temp=1)10050505000000025.0%
DeepSeek V3.110050505000000025.0%
Qwen 2.5 72B10050505000000025.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
DeepSeek V3.25050505000000020.0%
Arcee AI: Trinity Large (Preview)5050505000000020.0%
Mistral Small Creative5050505000000020.0%
Mistral NeMO1001000000000020.0%
Llama 3.1 8B5050505000000020.0%
Rocinante 12B1005050000000020.0%
DeepSeek V3 (2025-03-24)505050000000015.0%
Ministral 3 14B505050000000015.0%
Gemini 3 Flash (Preview, Reasoning)10000000000010.0%
Claude Opus 4.550500000000010.0%
Gemini 3 Pro (Preview)10000000000010.0%
Hermes 3 405B10000000000010.0%
Cohere Command R+ (Aug. 2024)10000000000010.0%
Gemini 3 Flash (Preview)500000000005.0%
DeepSeek-V2 Chat500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Mistral Large 2500000000005.0%
Llama 3.1 70B500000000005.0%
Claude 3 Haiku500000000005.0%
Mistral Large 300000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Claude 3.5 Sonnet00000000000.0%
Claude 3.7 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Large00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Ministral 3 8B1001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100090.0%
Gemini 2.5 Pro1001001001001001001001000080.0%
Z.AI GLM 4.7 Flash1001001001001001001001000080.0%
Arcee AI: Trinity Mini1001001001001001001001000080.0%
Claude Sonnet 4100100100100100100505050075.0%
Ministral 8B100100100100100100505050075.0%
Z.AI GLM 510010010010010010010000070.0%
GPT-5.210010010010010010010000070.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Hermes 3 405B10010010010010010010000070.0%
GPT-5 Nano10010010010010010010000070.0%
Hermes 3 70B1001001001001005050500065.0%
Z.AI GLM 4.6100100100100100505000060.0%
Minimax M2.5100100100100100100000060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Claude Haiku 4.510050505050505050505055.0%
Grok 41001001001001000000050.0%
Mistral Medium 3.15050505050505050505050.0%
Arcee AI: Trinity Large (Preview)5050505050505050505050.0%
Mistral Small Creative5050505050505050505050.0%
Ministral 3 3B5050505050505050505050.0%
Ministral 3B5050505050505050505050.0%
Claude Sonnet 4.6505050505050505050045.0%
GPT-4.1 Mini505050505050505050045.0%
GPT-4o, Aug. 6th (temp=1)100100100505050000045.0%
Qwen 2.5 72B100505050505050500045.0%
GPT-510010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
MoonshotAI: Kimi K2.510010010010000000040.0%
Gemini 3 Flash (Preview, Reasoning)10010010010000000040.0%
Z.AI GLM 4.710010010010000000040.0%
Z.AI GLM 4.510010010050500000040.0%
Claude 3.5 Sonnet10010010010000000040.0%
Ministral 3 14B50505050505050500040.0%
GPT-4.1 Nano10010010010000000040.0%
WizardLM 2 8x22b1005050505050000035.0%
Grok 4 Fast100100100000000030.0%
Qwen 3.5 Plus (2026-02-15)505050505050000030.0%
Writer: Palmyra X5100100505000000030.0%
Llama 3.1 Nemotron 70B505050505050000030.0%
Llama 3.1 8B505050505050000030.0%
Mistral Large 350505050500000025.0%
Claude 3.7 Sonnet10010050000000025.0%
Mistral Large 250505050500000025.0%
Llama 3.1 70B1005050000000020.0%
Mistral NeMO1005050000000020.0%
DeepSeek V3.1505050000000015.0%
GPT-4o Mini (temp=0)505050000000015.0%
Claude 3 Haiku505050000000015.0%
Rocinante 12B100500000000015.0%
Claude Opus 4.510000000000010.0%
Claude Sonnet 4.510000000000010.0%
DeepSeek V3.250500000000010.0%
Mistral Large50500000000010.0%
GPT-4o Mini (temp=1)50500000000010.0%
Claude Opus 4500000000005.0%
GPT-4o, May 13th (temp=0)500000000005.0%
GPT-4o, May 13th (temp=1)500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
GPT-4o, Aug. 6th (temp=0)500000000005.0%
Gemini 2.5 Flash500000000005.0%
Mistral Small 3.2 24B500000000005.0%
Gemma 3 12B500000000005.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.61001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Z.AI GLM 5100100100100100100100100100090.0%
o4 Mini High100100100100100100100100100090.0%
Minimax M2.5100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001000080.0%
GPT-5 Nano1001001001001001001001000080.0%
Arcee AI: Trinity Mini1001001001001001001001000080.0%
Z.AI GLM 4.5100100100100100100100500075.0%
GPT-4.1 Nano100100100100100505050505075.0%
GPT-5.210010010010010010010000070.0%
Aion 2.010010010010010010010000070.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Grok 4 Fast10010010010010010010000070.0%
Llama 3.1 Nemotron 70B10010010010010010050500070.0%
GPT-4.1 Mini1001001001005050505050065.0%
Claude Haiku 4.5100100505050505050505060.0%
Llama 3.1 70B100100100100100505000060.0%
Qwen 2.5 72B100100100100505050500060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Qwen 3.5 397B A17B1001001001001000000050.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001000000050.0%
Gemini 2.5 Pro1001001001001000000050.0%
Grok 41001001001001000000050.0%
Hermes 3 70B1001001005050505000050.0%
Ministral 3 14B5050505050505050505050.0%
Claude Sonnet 4.6505050505050505050045.0%
Claude Sonnet 4100100100100500000045.0%
Ministral 3 3B505050505050505050045.0%
Ministral 8B505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Mistral Large 350505050505050500040.0%
Claude 3.5 Sonnet10010010050500000040.0%
Hermes 3 405B10010010010000000040.0%
Mistral Medium 3.150505050505050500040.0%
Arcee AI: Trinity Large (Preview)10050505050505000040.0%
Mistral Small Creative50505050505050500040.0%
GPT-4.11001005050500000035.0%
Ministral 3 8B5050505050505000035.0%
Rocinante 12B1001001005000000035.0%
Z.AI GLM 4.7100100100000000030.0%
GPT-4o, Aug. 6th (temp=1)100505050500000030.0%
Mistral Large 2505050505050000030.0%
Mistral Large505050505050000030.0%
Claude Sonnet 4.510010050000000025.0%
DeepSeek V3.250505050500000025.0%
GPT-4o Mini (temp=1)10050505000000025.0%
GPT-51001000000000020.0%
DeepSeek-V2 Chat1001000000000020.0%
GPT-4o Mini (temp=0)5050505000000020.0%
Claude 3 Haiku5050505000000020.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
DeepSeek V3 (2024-12-26)100500000000015.0%
WizardLM 2 8x22b505050000000015.0%
Llama 3.1 8B505050000000015.0%
Qwen 3.5 Plus (2026-02-15)50500000000010.0%
Writer: Palmyra X550500000000010.0%
Claude 3.7 Sonnet500000000005.0%
DeepSeek V3.1500000000005.0%
MoonshotAI: Kimi K2.500000000000.0%
Claude Opus 4.500000000000.0%
Claude Opus 400000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Gemini 2.5 Flash00000000000.0%
Mistral Small 3.2 24B00000000000.0%
Gemma 3 12B00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%

detailed entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3 14B1001001001001001001001001005095.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100090.0%
MoonshotAI: Kimi K2.5100100100100100100100100100090.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100090.0%
GPT-5.2100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
ByteDance Seed 1.61001001001001001001001000080.0%
Minimax M2.51001001001001001001001000080.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001000080.0%
ByteDance Seed 1.6 Flash1001001001001001001001000080.0%
Grok 410010010010010010010000070.0%
Z.AI GLM 4.510010010010010010050500070.0%
Qwen 2.5 72B1001001001001005050500065.0%
Gemini 3.1 Pro (Preview)100100100100100100000060.0%
Claude Sonnet 4.6100100505050505050505060.0%
Z.AI GLM 4.7100100100100100100000060.0%
Arcee AI: Trinity Large (Preview)100100100505050505050060.0%
Z.AI GLM 4.610010010010010050000055.0%
Claude Sonnet 4.51001005050505050500050.0%
Claude Haiku 4.55050505050505050505050.0%
Mistral Medium 3.15050505050505050505050.0%
GPT-4.1 Nano1001001001005050000050.0%
Cohere Command R+ (Aug. 2024)1001001001001000000050.0%
GPT-4.1 Mini505050505050505050045.0%
Mistral Small Creative505050505050505050045.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Gemini 3 Pro (Preview)10010010010000000040.0%
Gemini 2.5 Flash Lite10010010050500000040.0%
Ministral 3B5050505050505000035.0%
GPT-4o, Aug. 6th (temp=1)100100505000000030.0%
Writer: Palmyra X5505050505050000030.0%
Mistral Small 3.2 24B100100100000000030.0%
WizardLM 2 8x22b100100505000000030.0%
Ministral 3 3B505050505050000030.0%
Llama 3.1 8B505050505050000030.0%
GPT-4o, May 13th (temp=1)50505050500000025.0%
GPT-4o, Aug. 6th (temp=0)50505050500000025.0%
DeepSeek V3.150505050500000025.0%
Claude Opus 45050505000000020.0%
DeepSeek V3.25050505000000020.0%
Claude Opus 4.5505050000000015.0%
Claude 3.7 Sonnet100500000000015.0%
Mistral Large505050000000015.0%
GPT-4o Mini (temp=1)505050000000015.0%
Llama 3.1 Nemotron 70B505050000000015.0%
Hermes 3 70B100500000000015.0%
Claude 3 Haiku505050000000015.0%
Rocinante 12B505050000000015.0%
Gemini 3 Flash (Preview)50500000000010.0%
DeepSeek-V2 Chat50500000000010.0%
Z.AI GLM 4.7 Flash10000000000010.0%
Claude Sonnet 4500000000005.0%
Mistral Large 3500000000005.0%
Hermes 3 405B500000000005.0%
Mistral Large 2500000000005.0%
Llama 3.1 70B500000000005.0%
Gemini 2.5 Pro00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Claude 3.5 Sonnet00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100090.0%
GPT-5100100100100100100100100100090.0%
Gemini 3 Pro (Preview)100100100100100100100100100090.0%
o4 Mini100100100100100100100100100090.0%
Z.AI GLM 4.7 Flash100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Claude Sonnet 4.610010010010010010010010050085.0%
Grok 4.1 Fast1001001001001001001001000080.0%
Z.AI GLM 4.5100100100100100100505050075.0%
GPT-4.1 Nano100100100100100100100500075.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010000070.0%
Z.AI GLM 4.61001001001001001005000065.0%
Minimax M2.51001001001001001005000065.0%
Mistral Medium 3.11001001001005050505050065.0%
GPT-4o, Aug. 6th (temp=1)100100100100100505000060.0%
Hermes 3 70B100100100505050505050060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
Arcee AI: Trinity Large (Preview)10010010010050505000055.0%
GPT-5.21001001001001000000050.0%
Claude Opus 4.55050505050505050505050.0%
Z.AI GLM 4.71001001001001000000050.0%
Grok 4 Fast1001001001001000000050.0%
Claude Haiku 4.55050505050505050505050.0%
Ministral 3B5050505050505050505050.0%
GPT-4.1 Mini505050505050505050045.0%
Ministral 8B100100100505050000045.0%
Claude Opus 4.650505050505050500040.0%
Aion 2.010010010010000000040.0%
Qwen 2.5 72B10010050505050000040.0%
Llama 3.1 Nemotron 70B10010050505050000040.0%
Qwen 3.5 Plus (2026-02-15)5050505050505000035.0%
WizardLM 2 8x22b5050505050505000035.0%
ByteDance Seed 1.6100100100000000030.0%
Gemini 3 Flash (Preview, Reasoning)100100100000000030.0%
Claude Sonnet 4100505050500000030.0%
Grok 4100100100000000030.0%
Claude Opus 4505050505050000030.0%
Ministral 3 8B100100505000000030.0%
Mistral Large 350505050500000025.0%
Ministral 3 3B50505050500000025.0%
Llama 3.1 8B50505050500000025.0%
Gemini 2.5 Pro1001000000000020.0%
Gemini 2.5 Flash (Reasoning)1001000000000020.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
Hermes 3 405B1001000000000020.0%
Gemini 2.5 Flash Lite1001000000000020.0%
Llama 3.1 70B1001000000000020.0%
Claude 3 Haiku5050505000000020.0%
Arcee AI: Trinity Mini1001000000000020.0%
Mistral Large 2505050000000015.0%
Mistral Small 3.2 24B100500000000015.0%
Mistral Small Creative505050000000015.0%
Ministral 3 14B505050000000015.0%
MoonshotAI: Kimi K2.510000000000010.0%
Claude 3.7 Sonnet50500000000010.0%
DeepSeek V3.150500000000010.0%
DeepSeek V3.250500000000010.0%
Mistral Large50500000000010.0%
Writer: Palmyra X550500000000010.0%
Claude Sonnet 4.5500000000005.0%
DeepSeek-V2 Chat500000000005.0%
DeepSeek V3 (2024-12-26)500000000005.0%
Rocinante 12B500000000005.0%
Qwen 3.5 397B A17B00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
Claude 3.5 Sonnet00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek V3 (2025-03-24)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemma 3 12B00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
Z.AI GLM 4.6100100100100100100100100505090.0%
Minimax M2.5100100100100100100100100100090.0%
Gemini 2.5 Pro100100100100100100100100100090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100090.0%
Z.AI GLM 4.5100100100100100100100100100090.0%
Z.AI GLM 4.7 Flash100100100100100100100100100090.0%
ByteDance Seed 1.6 Flash100100100100100100100100100090.0%
GPT-5.11001001001001001001001000080.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001000080.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001000080.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Z.AI GLM 4.710010010010010010010000070.0%
Cohere Command R+ (Aug. 2024)10010010010010010010000070.0%
Claude Opus 4.6 (Reasoning)100100100100100100000060.0%
GPT-5100100100100100100000060.0%
MoonshotAI: Kimi K2.5100100100100100100000060.0%
GPT-4.1 Mini100100100100505050500060.0%
Mistral Medium 3.1100100100100505050500060.0%
Ministral 3 14B10050505050505050505055.0%
Qwen 2.5 72B1001005050505050500050.0%
GPT-4.1 Nano1001005050505050500050.0%
Mistral Small Creative505050505050505050045.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Claude 3.7 Sonnet10010010010000000040.0%
Claude Haiku 4.55050505050505000035.0%
Claude 3.5 Sonnet1001005050500000035.0%
Arcee AI: Trinity Large (Preview)5050505050505000035.0%
WizardLM 2 8x22b1001005050500000035.0%
Ministral 3 3B5050505050505000035.0%
Ministral 3B5050505050505000035.0%
Gemini 3.1 Pro (Preview)100100100000000030.0%
Claude Sonnet 4100100100000000030.0%
Gemini 2.5 Flash Lite100100100000000030.0%
Mistral Small 3.2 24B100100505000000030.0%
Llama 3.1 8B505050505050000030.0%
Claude Opus 410050505000000025.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
Llama 3.1 Nemotron 70B50505050500000025.0%
Hermes 3 70B50505050500000025.0%
Claude Sonnet 4.6 (Reasoning)1001000000000020.0%
DeepSeek V3 (2025-03-24)1001000000000020.0%
GPT-4o Mini (temp=1)5050505000000020.0%
Llama 3.1 70B1005050000000020.0%
GPT-4o, May 13th (temp=1)505050000000015.0%
Writer: Palmyra X5505050000000015.0%
Hermes 3 405B10000000000010.0%
Claude 3 Haiku50500000000010.0%
Rocinante 12B50500000000010.0%
GPT-4o, May 13th (temp=0)500000000005.0%
DeepSeek V3.1500000000005.0%
DeepSeek V3.2500000000005.0%
Gemma 3 12B500000000005.0%
Claude Sonnet 4.500000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
GPT-4o, Aug. 6th (temp=1)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Gemini 2.5 Flash00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
GPT-5 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Claude Haiku 4.51001001001001001001001001005095.0%
GPT-4.1 Mini1001001001001001001001001005095.0%
GPT-5.1100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
o4 Mini100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
GPT-5 Nano100100100100100100100100100090.0%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
Claude Opus 4.610010010010010010010010050085.0%
Claude Opus 4.6 (Reasoning)1001001001001001001001000080.0%
GPT-5.21001001001001001001001000080.0%
Z.AI GLM 4.61001001001001001001001000080.0%
Z.AI GLM 4.71001001001001001001001000080.0%
ByteDance Seed 1.6 Flash1001001001001001001001000080.0%
Cohere Command R+ (Aug. 2024)1001001001001001001001000080.0%
Gemini 3 Pro (Preview)10010010010010010010000070.0%
Minimax M2.510010010010010010010000070.0%
Z.AI GLM 4.7 Flash10010010010010010010000070.0%
Hermes 3 405B10010010010010010010000070.0%
GPT-4.1 Nano1001001005050505050505065.0%
Z.AI GLM 4.5100100100100100505000060.0%
Mistral Small 3.2 24B100100100100100100000060.0%
Llama 3.1 70B10010010010050505000055.0%
MoonshotAI: Kimi K2.51001001001001000000050.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001000000050.0%
Gemini 2.5 Pro1001001001001000000050.0%
GPT-4o, Aug. 6th (temp=1)1001001001005050000050.0%
Mistral Medium 3.1100100100505050000045.0%
Arcee AI: Trinity Large (Preview)505050505050505050045.0%
Mistral Small Creative100100505050505000045.0%
Claude 3 Haiku505050505050505050045.0%
Ministral 3B505050505050505050045.0%
Claude Sonnet 4.6 (Reasoning)10010010010000000040.0%
GPT-510010010010000000040.0%
Qwen 3.5 397B A17B10010010010000000040.0%
Claude Sonnet 410010010010000000040.0%
GPT-4o, Aug. 6th (temp=0)10010010010000000040.0%
Qwen 2.5 72B10050505050505000040.0%
Hermes 3 70B10010010050500000040.0%
Ministral 3 3B50505050505050500040.0%
Rocinante 12B10010010010000000040.0%
GPT-4o Mini (temp=1)5050505050505000035.0%
Llama 3.1 Nemotron 70B5050505050505000035.0%
Llama 3.1 8B5050505050505000035.0%
Claude 3.5 Sonnet100100100000000030.0%
Claude 3.7 Sonnet100100505000000030.0%
Qwen 3.5 Plus (2026-02-15)50505050500000025.0%
Gemini 3.1 Pro (Preview)1001000000000020.0%
Claude Opus 41001000000000020.0%
GPT-4o, May 13th (temp=1)5050505000000020.0%
DeepSeek V3.250500000000010.0%
DeepSeek V3 (2025-03-24)10000000000010.0%
Gemini 2.5 Flash Lite10000000000010.0%
GPT-4o, May 13th (temp=0)500000000005.0%
WizardLM 2 8x22b500000000005.0%
Claude Sonnet 4.500000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
DeepSeek V3.100000000000.0%
Gemini 2.5 Flash00000000000.0%
Writer: Palmyra X500000000000.0%
Gemma 3 12B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Mistral NeMO00000000000.0%