Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 72% | $0.0004 | 39.0s | |
| GPT-4.1 Nano | 63% | $0.0005 | 4.6s | |
| GPT-5.4 Nano | 68% | $0.0005 | 1.5s | |
| Ministral 8B | 68% | $0.0007 | 4.4s | |
| Ministral 3 8B | 78% | $0.0013 | 17.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | |
| Inception Mercury | 98% | $0.0004 | 8.0s | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | |
| Gemma 4 31B | 71% | $0.0009 | 11.7s | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | |
| Arcee AI: Trinity Mini | 73% | $0.0009 | 26.3s | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | |
| Stealth: Healer Alpha | 84% | $0.0000 | 21.5s | |
| Grok 4 Fast | 73% | $0.0019 | 19.2s | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | |
| Gemma 4 26B | 65% | $0.0012 | 39.3s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Nemotron 3 Super | 99% | 83% | 83% | |
| Inception Mercury | 98% | 77% | 77% | |
| o4 Mini High | 97% | 72% | 72% | |
| Grok 4.1 Fast | 97% | 70% | 70% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 70% | 70% | |
| o4 Mini | 96% | 67% | 67% | |
| Inception Mercury 2 | 96% | 67% | 67% | |
| Z.AI GLM 5 Turbo | 96% | 65% | 65% | |
| GPT-5.1 | 95% | 64% | 64% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 64% | 64% | |
| Claude Opus 4.6 (Reasoning) | 94% | 60% | 60% | |
| ByteDance Seed 1.6 Flash | 94% | 59% | 59% | |
| GPT-5.4 Nano (Reasoning) | 94% | 57% | 57% | |
| Z.AI GLM 5 | 93% | 56% | 56% | |
| GPT-5 Mini | 93% | 56% | 56% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| Qwen 3.5 Plus (2026-04-20) | 93% | 55% | 55% | |
| Z.AI GLM 5.1 | 93% | 55% | 55% | |
| Gemma 4 26B (Reasoning) | 93% | 54% | 54% | |
| Grok 4.3 (Reasoning) | 92% | 53% | 53% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury | 98% | $0.0004 | 8.0s | 77% | |
| Nemotron 3 Super | 99% | $0.0000 | 1.3m | 83% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | 70% | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | 70% | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | 67% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | 64% | |
| Z.AI GLM 5 Turbo | 96% | $0.0071 | 16.0s | 65% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | 59% | |
| o4 Mini | 96% | $0.014 | 25.0s | 67% | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | 57% | |
| o4 Mini High | 97% | $0.027 | 52.5s | 72% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | 53% | |
| GPT-5.1 | 95% | $0.025 | 26.1s | 64% | |
| GPT-5 Mini | 93% | $0.0059 | 37.8s | 56% | |
| GPT-5 Nano | 93% | $0.0035 | 1.1m | 55% | |
| MiniMax M2.7 | 90% | $0.0047 | 34.6s | 48% | |
| DeepSeek V4 Flash (Reasoning) | 89% | $0.0009 | 30.8s | 46% | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | 46% | |
| Grok 4.3 (Reasoning) | 92% | $0.014 | 50.4s | 53% | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | 42% | |
| basic entries | detailed entries | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Short text (~524 words), small codex (11 entries) | Short text (~524 words), big codex (51 entries) | Long text (~1594 words), small codex (11 entries) | Long text (~1594 words), big codex (51 entries) | Short text (~524 words), small codex (11 detailed entries) | Short text (~524 words), big codex (51 detailed entries) | Long text (~1594 words), small codex (11 detailed entries) | Long text (~1594 words), big codex (51 detailed entries) |
| Nemotron 3 Super | 99% | 100% | 100% | 100% | 93% | 100% | 100% | 100% | 100% |
| Inception Mercury | 98% | 100% | 93% | 100% | 93% | 100% | 100% | 100% | 100% |
| o4 Mini High | 97% | 93% | 100% | 100% | 85% | 100% | 100% | 100% | 100% |
| Grok 4.1 Fast | 97% | 100% | 100% | 100% | 92% | 100% | 100% | 85% | 100% |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 100% | 90% | 100% | 100% | 93% | 93% | 100% | 100% |
| o4 Mini | 96% | 100% | 100% | 100% | 85% | 93% | 100% | 93% | 100% |
| Inception Mercury 2 | 96% | 100% | 100% | 100% | 100% | 85% | 100% | 100% | 85% |
| Z.AI GLM 5 Turbo | 96% | 100% | 100% | 100% | 68% | 100% | 100% | 100% | 100% |
| GPT-5.1 | 95% | 93% | 93% | 100% | 100% | 93% | 85% | 100% | 100% |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 93% | 85% | 100% | 85% | 100% | 100% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 94% | 100% | 100% | 100% | 100% | 85% | 70% | 100% | 100% |
| ByteDance Seed 1.6 Flash | 94% | 100% | 100% | 100% | 93% | 84% | 93% | 100% | 84% |
| GPT-5.4 Nano (Reasoning) | 94% | 100% | 100% | 100% | 58% | 100% | 100% | 93% | 100% |
| Z.AI GLM 5 | 93% | 93% | 78% | 100% | 83% | 100% | 93% | 100% | 100% |
| GPT-5 Mini | 93% | 100% | 100% | 100% | 45% | 100% | 100% | 100% | 100% |
basic entries
Short text (~524 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | |
| Gemma 4 26B | 100% | $0.0002 | 3.8s | |
| GPT-5.4 Nano | 80% | $0.0003 | 1.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | |
| Inception Mercury | 100% | $0.0001 | 3.6s | |
| Arcee AI: Trinity Mini | 81% | $0.0012 | 36.6s | |
| Gemma 4 31B | 100% | $0.0002 | 12.4s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| GPT-5.5 | 100% | $0.0041 | 1.3s | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 100% | $0.0024 | 850ms | |
| Grok 4 Fast | 77% | $0.0008 | 8.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0012 | 8.6s | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0019 | 3.6s | |
| Llama 3.1 Nemotron 70B | 67% | $0.0021 | 9.0s | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | |
| Qwen 3 32B | 78% | $0.0004 | 12.3s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.0046 | 2.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.7 Max | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | 100% | |
| Inception Mercury | 100% | $0.0001 | 3.6s | 100% | |
| Gemma 4 26B | 100% | $0.0002 | 3.8s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | 100% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 100% | $0.0024 | 850ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | 100% | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | 100% | |
| Gemma 4 31B | 100% | $0.0002 | 12.4s | 100% | |
| GPT-5.5 | 100% | $0.0041 | 1.3s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0036 | 9.8s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0039 | 9.0s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | 100% | |
| Z.AI GLM 4.5 Air | 100% | $0.0012 | 19.2s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | 100% | |
| MiniMax M2.7 | 100% | $0.0014 | 22.4s | 100% | |
| Claude Opus 4.7 | 100% | $0.011 | 940ms | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | $0.011 | 2.0s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 65.0% | No hallucinated violations |
Short text (~524 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | |
| Ministral 8B | 68% | $0.0003 | 3.4s | |
| Gemma 4 26B | 95% | $0.0003 | 2.8s | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | |
| GPT-5.4 Nano | 73% | $0.0004 | 1.4s | |
| Gemma 4 31B | 100% | $0.0004 | 1.7s | |
| Inception Mercury | 93% | $0.0002 | 5.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 90% | $0.0006 | 3.2s | |
| GPT-4.1 | 100% | $0.0020 | 737ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | |
| Arcee AI: Trinity Mini | 83% | $0.0003 | 15.8s | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | |
| Hermes 3 405B | 75% | $0.0026 | 1.7s | |
| Qwen 3 32B | 93% | $0.0004 | 10.6s | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0023 | 3.5s | |
| DeepSeek V4 Flash (Reasoning) | 85% | $0.0005 | 18.9s | |
| GPT-5.5 | 100% | $0.0069 | 1.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.7 Max | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Qwen 3.6 27B | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-5.5 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | 100% | |
| Gemma 4 31B | 100% | $0.0004 | 1.7s | 100% | |
| GPT-4.1 | 100% | $0.0020 | 737ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0034 | 7.9s | 100% | |
| GPT-5.5 | 100% | $0.0069 | 1.6s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 28.8s | 100% | |
| o4 Mini | 100% | $0.010 | 20.6s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 51.1s | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | $0.019 | 997ms | 100% | |
| Grok 4.20 (Reasoning) | 100% | $0.0095 | 31.1s | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | $0.011 | 1.1m | 100% | |
| o4 Mini High | 100% | $0.020 | 41.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.031 | 12.1s | 100% | |
| Z.AI GLM 5.1 | 100% | $0.013 | 1.3m | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 65.0% | No hallucinated violations |
Long text (~1594 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0001 | 962ms | |
| LFM2 24B | 100% | $0.0001 | 1.8s | |
| Gemini 2.5 Flash Lite | 74% | $0.0003 | 516ms | |
| Mistral Small 3.2 24B | 72% | $0.0003 | 1.0s | |
| Ministral 3 8B | 90% | $0.0004 | 1.4s | |
| Grok 4.20 (Beta) | 77% | $0.0018 | 984ms | |
| GPT-5.4 Nano | 66% | $0.0004 | 1.6s | |
| Gemini 3.1 Flash Lite | 85% | $0.0007 | 741ms | |
| Gemini 3.1 Flash Lite (Reasoning) | 79% | $0.0008 | 874ms | |
| Gemini 3.1 Flash Lite (Preview) | 70% | $0.0008 | 882ms | |
| Gemma 4 31B | 100% | $0.0004 | 20.6s | |
| Inception Mercury | 100% | $0.0002 | 3.7s | |
| Mistral Medium 3.1 | 95% | $0.0011 | 764ms | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | |
| Skyfall 36B V2 | 69% | $0.0010 | 1.9s | |
| GPT-4.1 | 100% | $0.0022 | 815ms | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| Cydonia 24B V4.1 | 62% | $0.0008 | 9.4s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.7 Max | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.8s | 100% | |
| Inception Mercury | 100% | $0.0002 | 3.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | 100% | |
| GPT-4.1 | 100% | $0.0022 | 815ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.0s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0039 | 9.7s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0005 | 19.3s | 100% | |
| Gemma 4 31B | 100% | $0.0004 | 20.6s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0065 | 7.8s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0077 | 5.5s | 100% | |
| Xiaomi MIMO v2.5 | 100% | $0.0035 | 14.7s | 100% | |
| GPT-5.2 | 100% | $0.0083 | 9.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0074 | 11.8s | 100% | |
| MiniMax M2.7 | 100% | $0.0020 | 27.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 65.0% | Correct "no violations" response | ||
| 73.5% | No hallucinated violations |
Long text (~1594 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | |
| LFM2 24B | 100% | $0.0001 | 1.3s | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | |
| Cydonia 24B V4.1 | 71% | $0.0008 | 4.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | |
| Inception Mercury | 93% | $0.0002 | 11.6s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | |
| DeepSeek V4 Flash (Reasoning) | 78% | $0.0010 | 39.3s | |
| Z.AI GLM 5 Turbo | 68% | $0.0096 | 25.3s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | |
| GPT-5.2 | 92% | $0.015 | 17.4s | |
| Xiaomi MIMO v2.5 | 93% | $0.011 | 47.4s | |
| GPT-OSS 120B | 85% | $0.0013 | 1.2m | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| DeepSeek V4 Pro (Reasoning) | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| LFM2 24B | 100% | 100% | 100% | |
| Claude Opus 4.6 | 94% | 61% | 61% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-4.1 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| Xiaomi MIMO v2.5 | 93% | 55% | 55% | |
| Nemotron 3 Super | 93% | 55% | 55% | |
| Inception Mercury | 93% | 55% | 55% | |
| Nemotron 3 Nano | 93% | 55% | 55% | |
| ByteDance Seed 1.6 Flash | 93% | 55% | 55% | |
| Grok 4.3 (Reasoning) | 92% | 50% | 50% | |
| GPT-5 | 92% | 50% | 50% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.3s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | 100% | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | 100% | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | 100% | |
| GPT-5.1 | 100% | $0.029 | 28.6s | 100% | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | 100% | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | 56% | |
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | 55% | |
| Inception Mercury | 93% | $0.0002 | 11.6s | 55% | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | 55% | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | 50% | |
| DeepSeek V4 Pro (Reasoning) | 100% | $0.013 | 4.9m | 100% | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | 50% | |
| Xiaomi MIMO v2.5 | 93% | $0.011 | 47.4s | 55% | |
| GPT-5.2 | 92% | $0.015 | 17.4s | 50% | |
| Claude Opus 4.6 | 94% | $0.048 | 16.8s | 61% | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | 40% | |
| Nemotron 3 Super | 93% | $0.0000 | 2.3m | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 30.0% | Correct "no violations" response | ||
| 40.0% | No hallucinated violations |
detailed entries
Short text (~524 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Z.AI GLM 5.1 | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Gemma 4 31B (Reasoning) | 100% | |
| Gemma 4 26B (Reasoning) | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| Qwen 3.5 27B | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| o4 Mini High | 100% | |
| Claude Opus 4.7 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| GPT-4.1 | 100% | |
| Grok 4 | 100% | |
| Gemma 4 31B | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 62% | $0.0004 | 56.5s | |
| Ministral 8B | 100% | $0.0004 | 338ms | |
| Mistral Small 3.2 24B | 63% | $0.0004 | 1.7s | |
| GPT-5.4 Nano | 83% | $0.0005 | 1.3s | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | |
| Gemma 4 26B | 100% | $0.0004 | 3.9s | |
| Gemma 4 31B | 100% | $0.0005 | 7.1s | |
| GPT-4.1 Mini | 93% | $0.0007 | 1.6s | |
| Cydonia 24B V4.1 | 83% | $0.0007 | 1.4s | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | |
| GPT-5.4 Nano (Reasoning, Low) | 93% | $0.0006 | 2.8s | |
| Arcee AI: Trinity Mini | 93% | $0.0013 | 41.5s | |
| Inception Mercury | 100% | $0.0003 | 6.3s | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | |
| ByteDance Seed 1.6 Flash | 84% | $0.0006 | 7.9s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.9s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0012 | 7.5s | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | |
| GPT-4.1 | 100% | $0.0035 | 613ms | |
| Inception Mercury 2 | 85% | $0.0023 | 3.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Claude Opus 4.7 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Gemma 4 31B | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0004 | 338ms | 100% | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | 100% | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | 100% | |
| Gemma 4 26B | 100% | $0.0004 | 3.9s | 100% | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | 100% | |
| Inception Mercury | 100% | $0.0003 | 6.3s | 100% | |
| Gemma 4 31B | 100% | $0.0005 | 7.1s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.9s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0012 | 7.5s | 100% | |
| GPT-4.1 | 100% | $0.0035 | 613ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0050 | 6.1s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0021 | 15.1s | 100% | |
| Mistral Large 2 | 100% | $0.0078 | 650ms | 100% | |
| Mistral Large | 100% | $0.0078 | 1.0s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0068 | 5.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0055 | 11.5s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0060 | 21.5s | 100% | |
| Claude Sonnet 4.6 | 100% | $0.013 | 1.0s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0035 | 29.7s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 60.0% | Correct "no violations" response | ||
| 75.0% | No hallucinated violations |
Short text (~524 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Qwen3.7 Max | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Gemma 4 31B (Reasoning) | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | |
| Gemma 4 26B (Reasoning) | 100% | |
| Claude Sonnet 4.6 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| MiniMax M2.7 | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Grok 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.4 Nano | 74% | $0.0006 | 1.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 93% | $0.0008 | 3.4s | |
| Ministral 8B | 100% | $0.0013 | 564ms | |
| Gemma 4 26B | 88% | $0.0012 | 5.3s | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | |
| Gemma 4 31B | 100% | $0.0018 | 3.1s | |
| Inception Mercury | 100% | $0.0009 | 8.6s | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0038 | 2.7s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0013 | 7.7s | |
| Inception Mercury 2 | 100% | $0.0031 | 3.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0020 | 8.1s | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0009 | 19.7s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0063 | 5.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 85% | $0.0023 | 14.2s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0028 | 23.8s | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.7 Max | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0013 | 564ms | 100% | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | 100% | |
| Gemma 4 31B | 100% | $0.0018 | 3.1s | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | 100% | |
| Inception Mercury 2 | 100% | $0.0031 | 3.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0038 | 2.7s | 100% | |
| Inception Mercury | 100% | $0.0009 | 8.6s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0020 | 8.1s | 100% | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | 100% | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0063 | 5.2s | 100% | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | 100% | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0009 | 19.7s | 100% | |
| MiniMax M2.7 | 100% | $0.0045 | 17.4s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0077 | 13.6s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0050 | 19.7s | 100% | |
| GPT-5.2 | 100% | $0.014 | 11.3s | 100% | |
| GPT-5 Mini | 100% | $0.0049 | 30.6s | 100% | |
| GPT-5 Nano | 100% | $0.0020 | 36.0s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 60.0% | Correct "no violations" response | ||
| 80.0% | No hallucinated violations |
Long text (~1594 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0002 | 1.3s | |
| GPT-4.1 Nano | 78% | $0.0003 | 1.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.0s | |
| Inception Mercury | 100% | $0.0003 | 4.6s | |
| Inception Mercury 2 | 100% | $0.0018 | 2.7s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0029 | 3.0s | |
| GPT-5.4 Nano (Reasoning) | 93% | $0.0014 | 6.7s | |
| GPT-4.1 | 100% | $0.0045 | 728ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.3s | |
| Grok 4.1 Fast | 85% | $0.0022 | 20.0s | |
| GPT-OSS 120B | 93% | $0.0008 | 34.8s | |
| DeepSeek V4 Flash (Reasoning) | 85% | $0.0010 | 35.4s | |
| Z.AI GLM 5 Turbo | 100% | $0.0064 | 16.6s | |
| MiniMax M2.5 | 73% | $0.0029 | 22.1s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0084 | 8.9s | |
| Mistral Small 4 (Reasoning) | 78% | $0.0030 | 31.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 78% | $0.0030 | 24.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Inception Mercury | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | 100% | |
| LFM2 24B | 100% | $0.0002 | 1.3s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.0s | 100% | |
| Inception Mercury | 100% | $0.0003 | 4.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0018 | 2.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0029 | 3.0s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | 100% | |
| GPT-4.1 | 100% | $0.0045 | 728ms | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.3s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0084 | 8.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0064 | 16.6s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.010 | 9.1s | 100% | |
| GPT-5 Mini | 100% | $0.0060 | 40.1s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 1.1m | 100% | |
| GPT-5.4 (Reasoning) | 100% | $0.029 | 27.7s | 100% | |
| GPT-5.1 | 100% | $0.028 | 32.0s | 100% | |
| o4 Mini High | 100% | $0.022 | 48.0s | 100% | |
| Z.AI GLM 5 | 100% | $0.014 | 1.3m | 100% | |
| Z.AI GLM 5.1 | 100% | $0.021 | 1.8m | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 60.0% | No hallucinated violations |
Long text (~1594 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Qwen3.6 Max Preview | 100% | |
| Z.AI GLM 5.1 | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| Grok 4.3 (Reasoning) | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | |
| Gemma 4 26B (Reasoning) | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| Grok 4.20 (Reasoning) | 100% | |
| Z.AI GLM 5 | 100% | |
| o4 Mini High | 100% | |
| Qwen 3.6 27B | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.4 Nano | 78% | $0.0008 | 1.0s | |
| Ministral 8B | 100% | $0.0014 | 525ms | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | |
| Ministral 3 14B | 93% | $0.0029 | 2.7s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0014 | 6.3s | |
| Arcee AI: Trinity Mini | 90% | $0.0018 | 45.5s | |
| GPT-4.1 | 100% | $0.0097 | 972ms | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0073 | 5.9s | |
| Inception Mercury 2 | 85% | $0.0056 | 7.7s | |
| ByteDance Seed 1.6 Flash | 84% | $0.0016 | 13.8s | |
| Grok 4 Fast | 93% | $0.0032 | 25.0s | |
| Inception Mercury | 100% | $0.0007 | 20.5s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0035 | 40.9s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0038 | 27.3s | |
| ByteDance Seed 2.0 Lite | 100% | $0.0056 | 25.4s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.015 | 19.0s | |
| GPT-5.4 (Reasoning, Low) | 92% | $0.019 | 14.5s | |
| GPT-5.2 | 93% | $0.021 | 20.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Qwen 3.6 27B | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0014 | 525ms | 100% | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0014 | 6.3s | 100% | |
| GPT-4.1 | 100% | $0.0097 | 972ms | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0073 | 5.9s | 100% | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | 100% | |
| Inception Mercury | 100% | $0.0007 | 20.5s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.7s | 100% | |
| ByteDance Seed 2.0 Lite | 100% | $0.0056 | 25.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0035 | 40.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.016 | 33.5s | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.033 | 18.8s | 100% | |
| GPT-5 Mini | 100% | $0.0090 | 53.0s | 100% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 1.1m | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0021 | 1.2m | 100% | |
| o4 Mini | 100% | $0.026 | 43.6s | 100% | |
| Xiaomi MIMO v2.5 | 100% | $0.017 | 1.1m | 100% | |
| Grok 4.20 (Reasoning) | 100% | $0.027 | 57.4s | 100% | |
| Grok 4.3 (Reasoning) | 100% | $0.027 | 1.1m | 100% | |
| GPT-5.1 | 100% | $0.051 | 44.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 63.3% | No hallucinated violations |