Structural validity

Test: Codex Violation Detection

Avg. Score
97.1%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude 3 Haiku100.0%$0.00153.6s100%
2Ministral 3B99.9%$0.00022.9s99%
3Gemini 2.5 Flash100.0%$0.00252.8s100%
4Mistral Small Creative100.0%$0.00064.3s99%
5Ministral 3 3B99.9%$0.00053.3s99%
6GPT-4.1 Mini100.0%$0.00155.8s100%
7Gemini 3 Flash (Preview)100.0%$0.00314.5s100%
8Arcee AI: Trinity Mini100.0%$0.000410.3s100%
9Mistral Small 3.2 24B100.0%$0.000610.4s100%
10ByteDance Seed 1.6 Flash100.0%$0.000912.0s100%
11Gemma 3 27B100.0%$0.000513.3s100%
12Mistral Large 3100.0%$0.003010.2s100%
13DeepSeek V3 (2024-12-26)100.0%$0.001915.6s100%
14DeepSeek V3.2100.0%$0.001317.2s100%
15Qwen 2.5 72B99.9%$0.000813.9s98%
16Writer: Palmyra X5100.0%$0.006212.1s100%
17Z.AI GLM 4.5100.0%$0.002618.5s100%
18Hermes 3 405B100.0%$0.004417.2s100%
19Grok 4.1 Fast100.0%$0.002121.1s100%
20Mistral Large 2100.0%$0.0128.8s100%
21Mistral Large100.0%$0.0129.1s100%
22GPT-4o, Aug. 6th (temp=0)100.0%$0.0155.0s100%
23Minimax M2.599.9%$0.002025.5s99%
24GPT-4o, May 13th (temp=1)100.0%$0.0263.7s100%
25Claude Sonnet 4100.0%$0.0239.0s100%
26Claude Sonnet 4.5100.0%$0.0248.9s100%
27Claude Sonnet 4.6100.0%$0.0249.9s100%
28Ministral 3 14B98.9%$0.00106.3s80%
29Mistral Medium 3.198.9%$0.00297.7s80%
30o4 Mini100.0%$0.01928.1s100%
31Gemma 3 12B98.7%$0.000312.0s80%
32GPT-4o Mini (temp=1)98.3%$0.00067.4s77%
33DeepSeek-V2 Chat98.9%$0.002013.1s80%
34GPT-5.2100.0%$0.02424.2s100%
35Grok 4 Fast98.6%$0.001812.8s79%
36Claude 3.5 Haiku98.8%$0.00496.2s78%
37GPT-4o, Aug. 6th (temp=1)98.9%$0.0113.6s80%
38Z.AI GLM 4.7 Flash100.0%$0.00181.1m100%
39GPT-5 Mini99.9%$0.009251.7s99%
40ByteDance Seed 1.6100.0%$0.00671.0m100%
41Claude Opus 4.5100.0%$0.0419.7s100%
42Claude Opus 4.6100.0%$0.04010.2s100%
43Stealth: Aurora Alpha98.8%6.2s78%
44Claude 3.5 Sonnet100.0%$0.04210.5s100%
45Z.AI GLM 4.7100.0%$0.00911.0m100%
46Z.AI GLM 5100.0%$0.01356.4s100%
47Ministral 3 8B97.5%$0.00074.8s69%
48Gemini 2.5 Pro100.0%$0.03523.2s100%
49Qwen 3.5 Plus (2026-02-15)98.9%$0.004134.2s80%
50GPT-4.197.7%$0.008110.0s72%
51DeepSeek V3 (2025-03-24)97.7%$0.001522.9s72%
52Z.AI GLM 4.699.9%$0.00491.3m99%
53Llama 3.1 Nemotron 70B97.7%$0.005518.5s72%
54o4 Mini High100.0%$0.03351.3s100%
55Mistral NeMO96.0%$0.000713.3s62%
56GPT-5.1100.0%$0.03752.8s100%
57Claude Haiku 4.595.5%$0.00785.8s61%
58MoonshotAI: Kimi K2.5100.0%$0.0151.5m100%
59Gemma 3 4B93.5%$0.000212.5s59%
60Gemini 3 Pro (Preview)99.9%$0.05034.0s98%
61Claude 3.7 Sonnet96.6%$0.02310.2s66%
62GPT-4o, May 13th (temp=0)96.6%$0.0305.4s66%
63GPT-4o Mini (temp=0)94.4%$0.000625.0s56%
64DeepSeek V3.194.3%$0.001431.1s56%
65Ministral 8B91.9%$0.00055.6s47%
66Grok 4100.0%$0.0511.2m100%
67Llama 3.1 70B92.3%$0.002124.3s49%
68Gemini 3.1 Pro (Preview)100.0%$0.06852.3s100%
69Gemini 2.5 Flash Lite88.8%$0.00052.3s39%
70Hermes 3 70B92.2%$0.001329.2s47%
71Llama 3.1 8B88.5%$0.000216.1s43%
72Cohere Command R+ (Aug. 2024)91.5%$0.01410.0s45%
73GPT-5100.0%$0.0611.6m100%
74Claude Opus 4100.0%$0.11615.8s100%
75GPT-5 Nano96.5%$0.00491.9m64%
76Rocinante 12B80.2%$0.00096.0s26%
77Arcee AI: Trinity Large (Preview)97.6%$0.00003.1m72%
78Qwen 3.5 397B A17B98.9%$0.0262.9m80%
79WizardLM 2 8x22b61.5%$0.003615.7s11%
80GPT-4.1 Nano42.8%$0.00043.9s16%
97.09%

Individual Scenarios

matrix

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
Mistral Small 3.2 24B1001001001001001001001001009999.9%
Writer: Palmyra X51001001001001001001001001009999.9%
WizardLM 2 8x22b1001001001001001001001001009699.6%
Mistral Small Creative1001001001001001001001001009699.6%
Minimax M2.51001001001001001001001001009699.6%
Gemini 3 Pro (Preview)1001001001001001001001001009099.0%
Ministral 3B1001001001001001009999989499.0%
Gemma 3 12B100100100100100100100100949398.7%
GPT-4o Mini (temp=1)1001001001001001001001001005095.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001001001091.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001001091.0%
Ministral 3 14B1001001001001001001001001001091.0%
Llama 3.1 Nemotron 70B100100100100100100100100981090.8%
DeepSeek V3.110010010010010010010099961090.5%
Claude 3.5 Haiku100100100100100100100100100090.0%
Hermes 3 70B100100100100100100100100100090.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100101082.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100101082.0%
Rocinante 12B10010010010010010010010010081.0%
Llama 3.1 8B1001001001001001009975101079.4%
GPT-4o, May 13th (temp=0)10010010010010010010010101073.0%
Llama 3.1 70B100100100100100100951010071.5%
Gemma 3 4B10010010010050505010101058.0%
Gemini 2.5 Flash Lite1001001001009869000056.7%
GPT-4o Mini (temp=0)100100100100100101010101055.0%
GPT-4.1 Nano1001005050402810100038.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
GPT-5 Mini1001001001001001001001001009599.5%
Qwen 2.5 72B1001001001001001001001001009299.2%
Arcee AI: Trinity Large (Preview)1001001001001001001001001009099.0%
Gemini 2.5 Flash Lite100100100100100100100100959599.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001001091.0%
GPT-4o, Aug. 6th (temp=1)1001001001001001001001001001091.0%
GPT-4o Mini (temp=1)1001001001001001001001001001091.0%
Llama 3.1 Nemotron 70B100100100100100100100100961090.6%
Ministral 3 8B100100100100100100100100100090.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100090.0%
Gemma 3 4B10010010010010010010099505089.9%
Llama 3.1 70B1001001001001001001009695089.1%
Hermes 3 70B10010010010010010010010077087.7%
Claude Haiku 4.5100100100100100100100100101082.0%
DeepSeek V3.1100100100100100100100100101082.0%
Llama 3.1 8B10010010010010010010094101081.4%
GPT-5 Nano10010010010010010010010010081.0%
Rocinante 12B10010010010010010094100070.4%
GPT-4.1 Nano1005050505040301010039.0%
WizardLM 2 8x22b101010100000004.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Ministral 3 3B1001001001001001001001001009899.8%
Llama 3.1 70B1001001001001001001001001009499.4%
Ministral 8B1001001001001001001001001005095.0%
Mistral Medium 3.11001001001001001001001001001091.0%
DeepSeek V3.11001001001001001001001001001091.0%
Llama 3.1 8B1001001001001001001001001001091.0%
Hermes 3 70B100100100100100100100100100090.0%
Mistral NeMO1001001001001001001001000080.0%
Rocinante 12B1001001001001009590830076.9%
WizardLM 2 8x22b10010010010063505050101063.2%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
GPT-4.1 Nano10010050505047000039.7%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Rocinante 12B100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100090.0%
Hermes 3 70B100100100100100100100100100090.0%
GPT-4.1 Nano1001001007550505050505067.5%
WizardLM 2 8x22b100505050101010100029.0%

tiers

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Llama 3.1 70B1001001001001001001001001009699.6%
Mistral NeMO1001001001001001001001001008398.3%
GPT-5 Nano1001001001001001001001001001091.0%
DeepSeek V3.11001001001001001001001001001091.0%
Hermes 3 70B100100100100100100100100100090.0%
Rocinante 12B10010010010010010010010101073.0%
GPT-4.1 Nano100100505050505000045.0%
WizardLM 2 8x22b10101010101000006.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Ministral 3 3B1001001001001001001001001009499.4%
Grok 4 Fast1001001001001001001001001007597.5%
Llama 3.1 70B100100100100100100100100100090.0%
WizardLM 2 8x22b100100100100100100100100100090.0%
Llama 3.1 8B10010010010010010010090101081.0%
Claude 3.7 Sonnet10010010010010010010010101073.0%
Rocinante 12B10010010010010010088100069.8%
GPT-4.1 Nano10050505000000025.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Mistral NeMO100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Gemma 3 4B1001001001001001001001001009799.7%
Z.AI GLM 4.61001001001001001001001001009599.5%
Llama 3.1 8B1001001001001001001001001008098.0%
Rocinante 12B1001001001001001009588837594.2%
Qwen 3.5 397B A17B1001001001001001001001001001091.0%
Grok 4 Fast1001001001001001001001001001091.0%
DeepSeek-V2 Chat1001001001001001001001001001091.0%
Claude Haiku 4.51001001001001001001001001001091.0%
Gemma 3 12B1001001001001001001001001001091.0%
Hermes 3 70B100100100100100100100100100090.0%
Ministral 8B100100100100100100100100100090.0%
GPT-4.110010010010010010010093101081.3%
Gemini 2.5 Flash Lite1001001001001001001001000080.0%
GPT-4.1 Nano1001005050505010100042.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100100100100.0%
Ministral 3 3B100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Ministral 3B100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)1001001001001001001001001009899.8%
Claude Haiku 4.51001001001001001001001001001091.0%
Ministral 3 8B100100100100100100100100100090.0%
Mistral NeMO100100100100100100100100100090.0%
Llama 3.1 70B10010010010010010010010091089.1%
Llama 3.1 8B100100100100100968375101077.4%
Rocinante 12B1001001001001001001005010076.0%
Gemini 2.5 Flash Lite10010010010010010098480074.6%
Ministral 8B1001001001001000000050.0%
GPT-4.1 Nano100505050505050500045.0%