Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 63% | $0.0005 | 4.6s | |
| Ministral 8B | 68% | $0.0007 | 4.4s | |
| Ministral 3 8B | 78% | $0.0013 | 17.9s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | |
| Arcee AI: Trinity Mini | 73% | $0.0009 | 26.3s | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | |
| Grok 4 Fast | 73% | $0.0019 | 19.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | |
| Gemini 2.5 Flash (Reasoning) | 87% | $0.0085 | 14.2s | |
| Minimax M2.5 | 79% | $0.0032 | 25.9s | |
| ByteDance Seed 1.6 | 85% | $0.0043 | 32.7s | |
| GPT-5.2 | 85% | $0.013 | 14.5s | |
| GPT-5 Mini | 93% | $0.0059 | 37.8s | |
| o4 Mini | 96% | $0.014 | 25.0s | |
| GPT-5 Nano | 93% | $0.0035 | 1.1m | |
| GPT-5.1 | 95% | $0.025 | 26.1s | |
| Z.AI GLM 4.6 | 71% | $0.015 | 59.0s | |
| o4 Mini High | 97% | $0.027 | 52.5s | |
| Aion 2.0 | 88% | $0.0096 | 1.3m | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| o4 Mini High | 97% | 72% | 72% | |
| Grok 4.1 Fast | 97% | 70% | 70% | |
| o4 Mini | 96% | 67% | 67% | |
| GPT-5.1 | 95% | 64% | 64% | |
| Claude Opus 4.6 (Reasoning) | 94% | 60% | 60% | |
| ByteDance Seed 1.6 Flash | 94% | 59% | 59% | |
| Z.AI GLM 5 | 93% | 56% | 56% | |
| GPT-5 Mini | 93% | 56% | 56% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | 53% | 53% | |
| Aion 2.0 | 88% | 45% | 45% | |
| Gemini 2.5 Flash (Reasoning) | 87% | 43% | 43% | |
| GPT-4.1 | 86% | 42% | 42% | |
| ByteDance Seed 1.6 | 85% | 39% | 39% | |
| GPT-5.2 | 85% | 39% | 39% | |
| Claude Opus 4.6 | 81% | 37% | 37% | |
| Claude Sonnet 4.6 (Reasoning) | 82% | 35% | 35% | |
| Claude Sonnet 4.6 | 70% | 34% | 34% | |
| Minimax M2.5 | 79% | 32% | 32% | |
| GPT-5 | 76% | 29% | 29% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | 70% | |
| o4 Mini | 96% | $0.014 | 25.0s | 67% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | 59% | |
| o4 Mini High | 97% | $0.027 | 52.5s | 72% | |
| GPT-5.1 | 95% | $0.025 | 26.1s | 64% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | 53% | |
| GPT-5 Mini | 93% | $0.0059 | 37.8s | 56% | |
| GPT-5 Nano | 93% | $0.0035 | 1.1m | 55% | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | 42% | |
| Gemini 2.5 Flash (Reasoning) | 87% | $0.0085 | 14.2s | 43% | |
| Z.AI GLM 5 | 93% | $0.017 | 1.4m | 56% | |
| ByteDance Seed 1.6 | 85% | $0.0043 | 32.7s | 39% | |
| GPT-5.2 | 85% | $0.013 | 14.5s | 39% | |
| Aion 2.0 | 88% | $0.0096 | 1.3m | 45% | |
| Minimax M2.5 | 79% | $0.0032 | 25.9s | 32% | |
| Ministral 3 8B | 78% | $0.0013 | 17.9s | 24% | |
| Grok 4 Fast | 73% | $0.0019 | 19.2s | 25% | |
| Claude Opus 4.6 | 81% | $0.049 | 13.2s | 37% | |
| Arcee AI: Trinity Mini | 73% | $0.0009 | 26.3s | 22% | |
| Ministral 8B | 68% | $0.0007 | 4.4s | 21% | |
| basic entries | detailed entries | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Short text (~524 words), small codex (11 entries) | Short text (~524 words), big codex (51 entries) | Long text (~1594 words), small codex (11 entries) | Long text (~1594 words), big codex (51 entries) | Short text (~524 words), small codex (11 detailed entries) | Short text (~524 words), big codex (51 detailed entries) | Long text (~1594 words), small codex (11 detailed entries) | Long text (~1594 words), big codex (51 detailed entries) |
| o4 Mini High | 97% | 93% | 100% | 100% | 85% | 100% | 100% | 100% | 100% |
| Grok 4.1 Fast | 97% | 100% | 100% | 100% | 92% | 100% | 100% | 85% | 100% |
| o4 Mini | 96% | 100% | 100% | 100% | 85% | 93% | 100% | 93% | 100% |
| GPT-5.1 | 95% | 93% | 93% | 100% | 100% | 93% | 85% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 94% | 100% | 100% | 100% | 100% | 85% | 70% | 100% | 100% |
| ByteDance Seed 1.6 Flash | 94% | 100% | 100% | 100% | 93% | 84% | 93% | 100% | 84% |
| Z.AI GLM 5 | 93% | 93% | 78% | 100% | 83% | 100% | 93% | 100% | 100% |
| GPT-5 Mini | 93% | 100% | 100% | 100% | 45% | 100% | 100% | 100% | 100% |
| GPT-5 Nano | 93% | 85% | 78% | 100% | 100% | 93% | 100% | 93% | 93% |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | 93% | 100% | 93% | 93% | 100% | 85% | 78% | 93% |
| Aion 2.0 | 88% | 78% | 93% | 93% | 100% | 93% | 93% | 55% | 100% |
| Gemini 2.5 Flash (Reasoning) | 87% | 100% | 93% | 100% | 85% | 100% | 93% | 40% | 85% |
| GPT-4.1 | 86% | 41% | 100% | 100% | 45% | 100% | 100% | 100% | 100% |
| ByteDance Seed 1.6 | 85% | 100% | 100% | 100% | 51% | 100% | 100% | 48% | 85% |
| GPT-5.2 | 85% | 76% | 75% | 100% | 92% | 85% | 100% | 62% | 93% |
basic entries
Short text (~524 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Arcee AI: Trinity Mini | 81% | $0.0012 | 36.6s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | |
| Grok 4 Fast | 77% | $0.0008 | 8.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0012 | 8.6s | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | |
| Llama 3.1 Nemotron 70B | 67% | $0.0021 | 9.0s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.0046 | 2.5s | |
| Minimax M2.5 | 93% | $0.0015 | 12.4s | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | |
| Z.AI GLM 4.5 | 78% | $0.0018 | 13.0s | |
| GPT-5.2 | 76% | $0.0091 | 10.8s | |
| Z.AI GLM 4.6 | 93% | $0.0029 | 18.0s | |
| Z.AI GLM 4.7 Flash | 100% | $0.0012 | 47.1s | |
| GPT-5 Mini | 100% | $0.0044 | 29.1s | |
| o4 Mini | 100% | $0.0097 | 18.9s | |
| GPT-5.1 | 93% | $0.018 | 22.3s | |
| Aion 2.0 | 78% | $0.0054 | 59.6s | |
| GPT-5 Nano | 85% | $0.0036 | 1.2m | |
| Claude Opus 4.6 | 100% | $0.022 | 9.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 4.7 Flash | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 93% | 56% | 56% | |
| GPT-5.1 | 93% | 55% | 55% | |
| Z.AI GLM 5 | 93% | 55% | 55% | |
| o4 Mini High | 93% | 55% | 55% | |
| Minimax M2.5 | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| Claude Sonnet 4.6 (Reasoning) | 85% | 40% | 40% | |
| GPT-5 Nano | 85% | 40% | 40% | |
| Z.AI GLM 4.5 | 78% | 32% | 32% | |
| Aion 2.0 | 78% | 31% | 31% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.1s | 100% | |
| o4 Mini | 100% | $0.0097 | 18.9s | 100% | |
| Z.AI GLM 4.7 Flash | 100% | $0.0012 | 47.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.022 | 9.7s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.033 | 20.7s | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.039 | 27.8s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0012 | 8.6s | 55% | |
| Minimax M2.5 | 93% | $0.0015 | 12.4s | 55% | |
| Z.AI GLM 4.6 | 93% | $0.0029 | 18.0s | 56% | |
| GPT-5.1 | 93% | $0.018 | 22.3s | 55% | |
| o4 Mini High | 93% | $0.019 | 35.0s | 55% | |
| Z.AI GLM 5 | 93% | $0.013 | 1.4m | 55% | |
| Z.AI GLM 4.5 | 78% | $0.0018 | 13.0s | 32% | |
| Grok 4 Fast | 77% | $0.0008 | 8.7s | 29% | |
| GPT-5 Nano | 85% | $0.0036 | 1.2m | 40% | |
| GPT-5.2 | 76% | $0.0091 | 10.8s | 27% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 42.5% | Correct "no violations" response | ||
| 50.0% | No hallucinated violations |
Short text (~524 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 68% | $0.0003 | 3.4s | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | |
| GPT-4.1 | 100% | $0.0020 | 737ms | |
| Hermes 3 405B | 75% | $0.0026 | 1.7s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | |
| Arcee AI: Trinity Mini | 83% | $0.0003 | 15.8s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.0080 | 3.2s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0047 | 7.7s | |
| GPT-5.2 | 75% | $0.0087 | 9.4s | |
| Claude Sonnet 4 | 75% | $0.0096 | 2.1s | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | |
| Minimax M2.5 | 70% | $0.0016 | 17.5s | |
| GPT-5.1 | 93% | $0.014 | 17.8s | |
| GPT-5 Mini | 100% | $0.0044 | 28.8s | |
| o4 Mini | 100% | $0.010 | 20.6s | |
| Aion 2.0 | 93% | $0.0053 | 49.2s | |
| Claude Opus 4.6 | 100% | $0.031 | 12.1s | |
| o4 Mini High | 100% | $0.020 | 41.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-5.1 | 93% | 55% | 55% | |
| Aion 2.0 | 93% | 55% | 55% | |
| Gemini 2.5 Flash (Reasoning) | 93% | 55% | 55% | |
| Gemini 2.5 Pro | 85% | 40% | 40% | |
| Z.AI GLM 4.7 Flash | 85% | 40% | 40% | |
| Claude Sonnet 4 | 75% | 37% | 37% | |
| Claude Haiku 4.5 | 54% | 69% | 35% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4.1 | 100% | $0.0020 | 737ms | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 28.8s | 100% | |
| o4 Mini | 100% | $0.010 | 20.6s | 100% | |
| Claude Opus 4.6 | 100% | $0.031 | 12.1s | 100% | |
| o4 Mini High | 100% | $0.020 | 41.1s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.041 | 24.0s | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.045 | 32.9s | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | $0.044 | 35.6s | 100% | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | 56% | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0047 | 7.7s | 55% | |
| GPT-5.1 | 93% | $0.014 | 17.8s | 55% | |
| Aion 2.0 | 93% | $0.0053 | 49.2s | 55% | |
| Arcee AI: Trinity Mini | 83% | $0.0003 | 15.8s | 31% | |
| Claude Sonnet 4 | 75% | $0.0096 | 2.1s | 37% | |
| Hermes 3 405B | 75% | $0.0026 | 1.7s | 24% | |
| GPT-5.2 | 75% | $0.0087 | 9.4s | 24% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 40.0% | Correct "no violations" response | ||
| 50.8% | No hallucinated violations |
Long text (~1594 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| GPT-5 | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Grok 4.1 Fast | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | |
| Grok 4 Fast | 100% | |
| GPT-5 Nano | 100% | |
| ByteDance Seed 1.6 Flash | 100% | |
| Mistral Medium 3.1 | 95% | |
| Claude Sonnet 4.6 | 93% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0001 | 962ms | |
| Gemini 2.5 Flash Lite | 74% | $0.0003 | 516ms | |
| Mistral Small 3.2 24B | 72% | $0.0003 | 1.0s | |
| Ministral 3 8B | 90% | $0.0004 | 1.4s | |
| Mistral Medium 3.1 | 95% | $0.0011 | 764ms | |
| GPT-4.1 | 100% | $0.0022 | 815ms | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | |
| Hermes 3 70B | 69% | $0.0009 | 6.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 14.3s | |
| GPT-5.2 | 100% | $0.0083 | 9.6s | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0074 | 11.8s | |
| Minimax M2.5 | 88% | $0.0020 | 24.5s | |
| o4 Mini | 100% | $0.0085 | 15.4s | |
| GPT-5.1 | 100% | $0.012 | 13.2s | |
| GPT-5 Mini | 100% | $0.0044 | 29.6s | |
| ByteDance Seed 1.6 | 100% | $0.0033 | 31.2s | |
| Claude Sonnet 4.6 | 93% | $0.019 | 12.6s | |
| Z.AI GLM 5 | 100% | $0.0087 | 48.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Grok 4 Fast | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| Mistral Medium 3.1 | 95% | 70% | 70% | |
| Claude Sonnet 4.6 | 93% | 59% | 59% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4.1 | 100% | $0.0022 | 815ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | 100% | |
| GPT-5.2 | 100% | $0.0083 | 9.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0074 | 11.8s | 100% | |
| o4 Mini | 100% | $0.0085 | 15.4s | 100% | |
| GPT-5.1 | 100% | $0.012 | 13.2s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0033 | 31.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.6s | 100% | |
| Z.AI GLM 5 | 100% | $0.0087 | 48.2s | 100% | |
| o4 Mini High | 100% | $0.018 | 33.8s | 100% | |
| GPT-5 Nano | 100% | $0.0033 | 1.0m | 100% | |
| Mistral Medium 3.1 | 95% | $0.0011 | 764ms | 70% | |
| GPT-5 | 100% | $0.030 | 57.1s | 100% | |
| Z.AI GLM 4.7 | 100% | $0.015 | 1.4m | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | $0.012 | 1.6m | 100% | |
| GPT-4.1 Nano | 93% | $0.0001 | 962ms | 58% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 14.3s | 55% | |
| Claude Sonnet 4.6 | 93% | $0.019 | 12.6s | 59% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 40.0% | Correct "no violations" response | ||
| 33.0% | No hallucinated violations |
Long text (~1594 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | |
| GPT-5.2 | 92% | $0.015 | 17.4s | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | |
| o4 Mini | 85% | $0.021 | 41.4s | |
| GPT-5.1 | 100% | $0.029 | 28.6s | |
| Z.AI GLM 5 | 83% | $0.019 | 1.7m | |
| Claude Opus 4.6 | 94% | $0.048 | 16.8s | |
| MoonshotAI: Kimi K2.5 | 77% | $0.019 | 2.8m | |
| o4 Mini High | 85% | $0.045 | 1.5m | |
| GPT-5 | 92% | $0.074 | 2.5m | |
| Z.AI GLM 4.7 Flash | 77% | $0.0058 | 4.4m | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.143 | 1.3m | |
| Minimax M2.5 | 59% | $0.0035 | 36.5s | |
| Arcee AI: Trinity Mini | 60% | $0.0004 | 10.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| Claude Opus 4.6 | 94% | 61% | 61% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-4.1 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| ByteDance Seed 1.6 Flash | 93% | 55% | 55% | |
| GPT-5 | 92% | 50% | 50% | |
| GPT-5.2 | 92% | 50% | 50% | |
| Grok 4.1 Fast | 92% | 50% | 50% | |
| o4 Mini High | 85% | 40% | 40% | |
| o4 Mini | 85% | 40% | 40% | |
| Gemini 2.5 Flash (Reasoning) | 85% | 40% | 40% | |
| Claude Sonnet 4.6 (Reasoning) | 83% | 33% | 33% | |
| Z.AI GLM 5 | 83% | 33% | 33% | |
| Claude Opus 4.5 | 33% | 72% | 30% | |
| MoonshotAI: Kimi K2.5 | 77% | 29% | 29% | |
| Z.AI GLM 4.7 Flash | 77% | 29% | 29% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.1 | 100% | $0.029 | 28.6s | 100% | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | 100% | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | 100% | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | 56% | |
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | 55% | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | 55% | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | 50% | |
| GPT-5.2 | 92% | $0.015 | 17.4s | 50% | |
| Claude Opus 4.6 | 94% | $0.048 | 16.8s | 61% | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | 40% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.143 | 1.3m | 100% | |
| o4 Mini | 85% | $0.021 | 41.4s | 40% | |
| o4 Mini High | 85% | $0.045 | 1.5m | 40% | |
| Z.AI GLM 5 | 83% | $0.019 | 1.7m | 33% | |
| Arcee AI: Trinity Mini | 60% | $0.0004 | 10.5s | 16% | |
| GPT-5 | 92% | $0.074 | 2.5m | 50% | |
| Minimax M2.5 | 59% | $0.0035 | 36.5s | 11% | |
| Claude 3 Haiku | 29% | $0.0014 | 3.1s | 28% | |
| GPT-4.1 | 45% | $0.0053 | 2.7s | 14% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 27.5% | Correct "no violations" response | ||
| 27.8% | No hallucinated violations |
detailed entries
Short text (~524 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| GPT-5 Mini | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| o4 Mini High | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| GPT-4.1 | 100% | |
| Grok 4 | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | |
| Mistral Large 3 | 100% | |
| Mistral Large 2 | 100% | |
| Mistral Large | 100% | |
| Ministral 3 14B | 100% | |
| Ministral 3 8B | 100% | |
| Ministral 8B | 100% | |
| Claude Haiku 4.5 | 95% | |
| GPT-4.1 Mini | 93% | |
| GPT-5.1 | 93% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0004 | 338ms | |
| Mistral Small 3.2 24B | 63% | $0.0004 | 1.7s | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | |
| GPT-4.1 Mini | 93% | $0.0007 | 1.6s | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | |
| Arcee AI: Trinity Mini | 93% | $0.0013 | 41.5s | |
| ByteDance Seed 1.6 Flash | 84% | $0.0006 | 7.9s | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | |
| GPT-4.1 | 100% | $0.0035 | 613ms | |
| Grok 4 Fast | 93% | $0.0013 | 13.5s | |
| Hermes 3 405B | 74% | $0.0039 | 2.5s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0021 | 15.1s | |
| Claude Haiku 4.5 | 95% | $0.0044 | 1.3s | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0060 | 21.5s | |
| Mistral Large 2 | 100% | $0.0078 | 650ms | |
| Mistral Large | 100% | $0.0078 | 1.0s | |
| Minimax M2.5 | 78% | $0.0029 | 22.7s | |
| ByteDance Seed 1.6 | 100% | $0.0035 | 29.7s | |
| Cohere Command R+ (Aug. 2024) | 82% | $0.010 | 2.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| GPT-5 Mini | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | 100% | 100% | |
| Mistral Large 3 | 100% | 100% | 100% | |
| Mistral Large 2 | 100% | 100% | 100% | |
| Mistral Large | 100% | 100% | 100% | |
| Ministral 3 14B | 100% | 100% | 100% | |
| Ministral 3 8B | 100% | 100% | 100% | |
| Ministral 8B | 100% | 100% | 100% | |
| Claude Haiku 4.5 | 95% | 70% | 70% | |
| GPT-4.1 Mini | 93% | 59% | 59% | |
| GPT-5.1 | 93% | 55% | 55% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0004 | 338ms | 100% | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | 100% | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | 100% | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | 100% | |
| GPT-4.1 | 100% | $0.0035 | 613ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0021 | 15.1s | 100% | |
| Mistral Large 2 | 100% | $0.0078 | 650ms | 100% | |
| Mistral Large | 100% | $0.0078 | 1.0s | 100% | |
| Claude Sonnet 4.6 | 100% | $0.013 | 1.0s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0060 | 21.5s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0035 | 29.7s | 100% | |
| GPT-5 Mini | 100% | $0.0047 | 29.0s | 100% | |
| Claude Opus 4.5 | 100% | $0.021 | 2.0s | 100% | |
| Z.AI GLM 5 | 100% | $0.012 | 57.7s | 100% | |
| o4 Mini High | 100% | $0.023 | 51.7s | 100% | |
| Claude Haiku 4.5 | 95% | $0.0044 | 1.3s | 70% | |
| GPT-4.1 Mini | 93% | $0.0007 | 1.6s | 59% | |
| Grok 4 | 100% | $0.043 | 52.2s | 100% | |
| Grok 4 Fast | 93% | $0.0013 | 13.5s | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 67.5% | No hallucinated violations |
Short text (~524 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Grok 4 | 100% | |
| Grok 4 Fast | 100% | |
| Mistral Large 3 | 100% | |
| GPT-5 Nano | 100% | |
| Mistral Large 2 | 100% | |
| Mistral Large | 100% | |
| Ministral 3 8B | 100% | |
| Arcee AI: Trinity Mini | 100% | |
| Ministral 8B | 100% | |
| Z.AI GLM 5 | 93% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0013 | 564ms | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | |
| ByteDance Seed 1.6 Flash | 93% | $0.0013 | 7.7s | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 85% | $0.0023 | 14.2s | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | |
| Z.AI GLM 4.6 | 90% | $0.0053 | 12.9s | |
| GPT-5 Nano | 100% | $0.0020 | 36.0s | |
| ByteDance Seed 1.6 | 100% | $0.0050 | 19.7s | |
| Minimax M2.5 | 93% | $0.0049 | 21.4s | |
| GPT-5 Mini | 100% | $0.0049 | 30.6s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0092 | 11.7s | |
| GPT-5.2 | 100% | $0.014 | 11.3s | |
| Z.AI GLM 4.5 | 93% | $0.0079 | 42.0s | |
| o4 Mini | 100% | $0.012 | 20.0s | |
| o4 Mini High | 100% | $0.020 | 33.9s | |
| Mistral Large | 100% | $0.026 | 1.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Grok 4 Fast | 100% | 100% | 100% | |
| Mistral Large 3 | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| Mistral Large 2 | 100% | 100% | 100% | |
| Mistral Large | 100% | 100% | 100% | |
| Ministral 3 8B | 100% | 100% | 100% | |
| Arcee AI: Trinity Mini | 100% | 100% | 100% | |
| Ministral 8B | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 90% | 60% | 60% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0013 | 564ms | 100% | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | 100% | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | 100% | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | 100% | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | 100% | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0050 | 19.7s | 100% | |
| GPT-5.2 | 100% | $0.014 | 11.3s | 100% | |
| o4 Mini | 100% | $0.012 | 20.0s | 100% | |
| GPT-5 Mini | 100% | $0.0049 | 30.6s | 100% | |
| Mistral Large 2 | 100% | $0.026 | 1.4s | 100% | |
| Mistral Large | 100% | $0.026 | 1.6s | 100% | |
| GPT-5 Nano | 100% | $0.0020 | 36.0s | 100% | |
| Claude Sonnet 4.6 | 100% | $0.043 | 1.3s | 100% | |
| o4 Mini High | 100% | $0.020 | 33.9s | 100% | |
| Claude Opus 4.5 | 100% | $0.072 | 2.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.082 | 9.9s | 100% | |
| Grok 4 | 100% | $0.063 | 52.0s | 100% | |
| ByteDance Seed 1.6 Flash | 93% | $0.0013 | 7.7s | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 47.5% | Correct "no violations" response | ||
| 56.1% | No hallucinated violations |
Long text (~1594 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 78% | $0.0003 | 1.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | |
| GPT-4.1 | 100% | $0.0045 | 728ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | |
| Grok 4.1 Fast | 85% | $0.0022 | 20.0s | |
| Minimax M2.5 | 73% | $0.0029 | 22.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 78% | $0.0030 | 24.9s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.013 | 2.6s | |
| o4 Mini | 93% | $0.0094 | 17.8s | |
| GPT-5 Mini | 100% | $0.0060 | 40.1s | |
| GPT-5 Nano | 93% | $0.0044 | 1.4m | |
| Z.AI GLM 4.5 | 72% | $0.0047 | 42.7s | |
| Z.AI GLM 5 | 100% | $0.014 | 1.3m | |
| Claude Sonnet 4.6 | 86% | $0.029 | 15.0s | |
| o4 Mini High | 100% | $0.022 | 48.0s | |
| GPT-5.1 | 100% | $0.028 | 32.0s | |
| Z.AI GLM 4.7 Flash | 93% | $0.0038 | 2.5m | |
| GPT-5 | 93% | $0.045 | 1.3m | |
| Z.AI GLM 4.6 | 69% | $0.023 | 1.8m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 93% | 55% | 55% | |
| GPT-5 | 93% | 55% | 55% | |
| Gemini 3 Pro (Preview) | 93% | 55% | 55% | |
| o4 Mini | 93% | 55% | 55% | |
| Z.AI GLM 4.7 Flash | 93% | 55% | 55% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| Claude Haiku 4.5 | 46% | 89% | 44% | |
| Claude Sonnet 4.6 | 86% | 43% | 43% | |
| Grok 4.1 Fast | 85% | 40% | 40% | |
| Claude Opus 4.5 | 39% | 95% | 36% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | 100% | |
| GPT-4.1 | 100% | $0.0045 | 728ms | 100% | |
| GPT-5 Mini | 100% | $0.0060 | 40.1s | 100% | |
| GPT-5.1 | 100% | $0.028 | 32.0s | 100% | |
| o4 Mini High | 100% | $0.022 | 48.0s | 100% | |
| Z.AI GLM 5 | 100% | $0.014 | 1.3m | 100% | |
| o4 Mini | 93% | $0.0094 | 17.8s | 55% | |
| GPT-5 Nano | 93% | $0.0044 | 1.4m | 55% | |
| Grok 4.1 Fast | 85% | $0.0022 | 20.0s | 40% | |
| Z.AI GLM 4.7 Flash | 93% | $0.0038 | 2.5m | 55% | |
| Claude Sonnet 4.6 | 86% | $0.029 | 15.0s | 43% | |
| GPT-4.1 Nano | 78% | $0.0003 | 1.4s | 29% | |
| GPT-5 | 93% | $0.045 | 1.3m | 55% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.144 | 1.3m | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 78% | $0.0030 | 24.9s | 31% | |
| Minimax M2.5 | 73% | $0.0029 | 22.1s | 31% | |
| Z.AI GLM 4.5 | 72% | $0.0047 | 42.7s | 28% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | $0.150 | 2.2m | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 32.5% | Correct "no violations" response | ||
| 38.8% | No hallucinated violations |
Long text (~1594 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Z.AI GLM 5 | 100% | |
| o4 Mini High | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Ministral 3 8B | 100% | |
| Ministral 8B | 100% | |
| Ministral 3 14B | 93% | |
| Claude Sonnet 4.6 (Reasoning) | 93% | |
| GPT-5.2 | 93% | |
| Grok 4 Fast | 93% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | |
| GPT-5 Nano | 93% | |
| MoonshotAI: Kimi K2.5 | 92% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0014 | 525ms | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | |
| Ministral 3 14B | 93% | $0.0029 | 2.7s | |
| Arcee AI: Trinity Mini | 90% | $0.0018 | 45.5s | |
| GPT-4.1 | 100% | $0.0097 | 972ms | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | |
| ByteDance Seed 1.6 Flash | 84% | $0.0016 | 13.8s | |
| Grok 4 Fast | 93% | $0.0032 | 25.0s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0038 | 27.3s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.015 | 19.0s | |
| GPT-5.2 | 93% | $0.021 | 20.9s | |
| GPT-5 Mini | 100% | $0.0090 | 53.0s | |
| ByteDance Seed 1.6 | 85% | $0.0074 | 45.3s | |
| Minimax M2.5 | 83% | $0.0064 | 50.3s | |
| GPT-5 Nano | 93% | $0.0046 | 1.3m | |
| o4 Mini | 100% | $0.026 | 43.6s | |
| Z.AI GLM 4.5 | 70% | $0.0061 | 34.6s | |
| Gemini 3 Flash (Preview, Reasoning) | 91% | $0.031 | 52.9s | |
| Aion 2.0 | 100% | $0.017 | 1.7m | |
| GPT-5.1 | 100% | $0.051 | 44.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Ministral 3 8B | 100% | 100% | 100% | |
| Ministral 8B | 100% | 100% | 100% | |
| Ministral 3 14B | 93% | 56% | 56% | |
| Claude Sonnet 4.6 (Reasoning) | 93% | 55% | 55% | |
| GPT-5.2 | 93% | 55% | 55% | |
| Grok 4 Fast | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| MoonshotAI: Kimi K2.5 | 92% | 50% | 50% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0014 | 525ms | 100% | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | 100% | |
| GPT-4.1 | 100% | $0.0097 | 972ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | 100% | |
| GPT-5 Mini | 100% | $0.0090 | 53.0s | 100% | |
| o4 Mini | 100% | $0.026 | 43.6s | 100% | |
| GPT-5.1 | 100% | $0.051 | 44.5s | 100% | |
| Aion 2.0 | 100% | $0.017 | 1.7m | 100% | |
| Claude Opus 4.6 | 100% | $0.104 | 19.4s | 100% | |
| o4 Mini High | 100% | $0.048 | 1.5m | 100% | |
| Z.AI GLM 5 | 100% | $0.031 | 2.4m | 100% | |
| GPT-5 | 100% | $0.066 | 2.0m | 100% | |
| Ministral 3 14B | 93% | $0.0029 | 2.7s | 56% | |
| Grok 4 Fast | 93% | $0.0032 | 25.0s | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0038 | 27.3s | 55% | |
| GPT-5.2 | 93% | $0.021 | 20.9s | 55% | |
| GPT-5 Nano | 93% | $0.0046 | 1.3m | 55% | |
| Arcee AI: Trinity Mini | 90% | $0.0018 | 45.5s | 40% | |
| Gemini 3 Flash (Preview, Reasoning) | 91% | $0.031 | 52.9s | 46% | |
| ByteDance Seed 1.6 Flash | 84% | $0.0016 | 13.8s | 37% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 40.0% | Correct "no violations" response | ||
| 36.6% | No hallucinated violations |