Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 72% | $0.0004 | 39.0s | |
| GPT-4.1 Nano | 63% | $0.0005 | 4.6s | |
| GPT-5.4 Nano | 68% | $0.0005 | 1.5s | |
| Ministral 8B | 68% | $0.0007 | 4.4s | |
| Ministral 3 8B | 78% | $0.0013 | 17.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | |
| Inception Mercury | 98% | $0.0004 | 8.0s | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | |
| Arcee AI: Trinity Mini | 73% | $0.0009 | 26.3s | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | |
| Stealth: Healer Alpha | 84% | $0.0000 | 21.5s | |
| Grok 4 Fast | 73% | $0.0019 | 19.2s | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | |
| Mistral Small 4 (Reasoning) | 83% | $0.0025 | 22.6s | |
| Z.AI GLM 5 Turbo | 96% | $0.0071 | 16.0s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Nemotron 3 Super | 99% | 83% | 83% | |
| Inception Mercury | 98% | 77% | 77% | |
| o4 Mini High | 97% | 72% | 72% | |
| Grok 4.1 Fast | 97% | 70% | 70% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 70% | 70% | |
| o4 Mini | 96% | 67% | 67% | |
| Inception Mercury 2 | 96% | 67% | 67% | |
| Z.AI GLM 5 Turbo | 96% | 65% | 65% | |
| GPT-5.1 | 95% | 64% | 64% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 64% | 64% | |
| Claude Opus 4.6 (Reasoning) | 94% | 60% | 60% | |
| ByteDance Seed 1.6 Flash | 94% | 59% | 59% | |
| GPT-5.4 Nano (Reasoning) | 94% | 57% | 57% | |
| Z.AI GLM 5 | 93% | 56% | 56% | |
| GPT-5 Mini | 93% | 56% | 56% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | 53% | 53% | |
| MiniMax M2.7 | 90% | 48% | 48% | |
| GPT-5.4 Mini (Reasoning) | 89% | 46% | 46% | |
| Aion 2.0 | 88% | 45% | 45% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury | 98% | $0.0004 | 8.0s | 77% | |
| Nemotron 3 Super | 99% | $0.0000 | 1.3m | 83% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | 70% | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | 70% | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | 67% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | 64% | |
| Z.AI GLM 5 Turbo | 96% | $0.0071 | 16.0s | 65% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | 59% | |
| o4 Mini | 96% | $0.014 | 25.0s | 67% | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | 57% | |
| o4 Mini High | 97% | $0.027 | 52.5s | 72% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | 53% | |
| GPT-5.1 | 95% | $0.025 | 26.1s | 64% | |
| GPT-5 Mini | 93% | $0.0059 | 37.8s | 56% | |
| GPT-5 Nano | 93% | $0.0035 | 1.1m | 55% | |
| MiniMax M2.7 | 90% | $0.0047 | 34.6s | 48% | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | 46% | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | 42% | |
| Gemini 2.5 Flash (Reasoning) | 87% | $0.0085 | 14.2s | 43% | |
| Z.AI GLM 5 | 93% | $0.017 | 1.4m | 56% | |
| basic entries | detailed entries | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Short text (~524 words), small codex (11 entries) | Short text (~524 words), big codex (51 entries) | Long text (~1594 words), small codex (11 entries) | Long text (~1594 words), big codex (51 entries) | Short text (~524 words), small codex (11 detailed entries) | Short text (~524 words), big codex (51 detailed entries) | Long text (~1594 words), small codex (11 detailed entries) | Long text (~1594 words), big codex (51 detailed entries) |
| Nemotron 3 Super | 99% | 100% | 100% | 100% | 93% | 100% | 100% | 100% | 100% |
| Inception Mercury | 98% | 100% | 93% | 100% | 93% | 100% | 100% | 100% | 100% |
| o4 Mini High | 97% | 93% | 100% | 100% | 85% | 100% | 100% | 100% | 100% |
| Grok 4.1 Fast | 97% | 100% | 100% | 100% | 92% | 100% | 100% | 85% | 100% |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 100% | 90% | 100% | 100% | 93% | 93% | 100% | 100% |
| o4 Mini | 96% | 100% | 100% | 100% | 85% | 93% | 100% | 93% | 100% |
| Inception Mercury 2 | 96% | 100% | 100% | 100% | 100% | 85% | 100% | 100% | 85% |
| Z.AI GLM 5 Turbo | 96% | 100% | 100% | 100% | 68% | 100% | 100% | 100% | 100% |
| GPT-5.1 | 95% | 93% | 93% | 100% | 100% | 93% | 85% | 100% | 100% |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 93% | 85% | 100% | 85% | 100% | 100% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 94% | 100% | 100% | 100% | 100% | 85% | 70% | 100% | 100% |
| ByteDance Seed 1.6 Flash | 94% | 100% | 100% | 100% | 93% | 84% | 93% | 100% | 84% |
| GPT-5.4 Nano (Reasoning) | 94% | 100% | 100% | 100% | 58% | 100% | 100% | 93% | 100% |
| Z.AI GLM 5 | 93% | 93% | 78% | 100% | 83% | 100% | 93% | 100% | 100% |
| GPT-5 Mini | 93% | 100% | 100% | 100% | 45% | 100% | 100% | 100% | 100% |
basic entries
Short text (~524 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | |
| GPT-5.4 Nano | 80% | $0.0003 | 1.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | |
| Inception Mercury | 100% | $0.0001 | 3.6s | |
| Arcee AI: Trinity Mini | 81% | $0.0012 | 36.6s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | |
| Grok 4 Fast | 77% | $0.0008 | 8.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0012 | 8.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0019 | 3.6s | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | |
| Llama 3.1 Nemotron 70B | 67% | $0.0021 | 9.0s | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | |
| Qwen 3 32B | 78% | $0.0004 | 12.3s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.0046 | 2.5s | |
| MiniMax M2.5 | 93% | $0.0015 | 12.4s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0039 | 9.0s | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 4.7 Flash | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Mistral Small 4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
| Inception Mercury | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | 100% | |
| Inception Mercury | 100% | $0.0001 | 3.6s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | 100% | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0039 | 9.0s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0036 | 9.8s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | 100% | |
| MiniMax M2.7 | 100% | $0.0014 | 22.4s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.1s | 100% | |
| o4 Mini | 100% | $0.0097 | 18.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 47.1s | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.016 | 11.8s | 100% | |
| Z.AI GLM 4.7 Flash | 100% | $0.0012 | 47.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.022 | 9.7s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.033 | 20.7s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 45.0% | Correct "no violations" response | ||
| 58.1% | No hallucinated violations |
Short text (~524 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | |
| Ministral 8B | 68% | $0.0003 | 3.4s | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | |
| GPT-5.4 Nano | 73% | $0.0004 | 1.4s | |
| Inception Mercury | 93% | $0.0002 | 5.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 90% | $0.0006 | 3.2s | |
| GPT-4.1 | 100% | $0.0020 | 737ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | |
| Arcee AI: Trinity Mini | 83% | $0.0003 | 15.8s | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | |
| Hermes 3 405B | 75% | $0.0026 | 1.7s | |
| Qwen 3 32B | 93% | $0.0004 | 10.6s | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0023 | 3.5s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0015 | 12.6s | |
| Z.AI GLM 5 Turbo | 100% | $0.0034 | 7.9s | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0047 | 7.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| LFM2 24B | 100% | 100% | 100% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-5.1 | 93% | 55% | 55% | |
| GPT-5.4 Mini (Reasoning) | 93% | 55% | 55% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | 100% | |
| GPT-4.1 | 100% | $0.0020 | 737ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0034 | 7.9s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 28.8s | 100% | |
| o4 Mini | 100% | $0.010 | 20.6s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 51.1s | 100% | |
| o4 Mini High | 100% | $0.020 | 41.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.031 | 12.1s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.041 | 24.0s | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.045 | 32.9s | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | $0.044 | 35.6s | 100% | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | 56% | |
| Inception Mercury | 93% | $0.0002 | 5.4s | 55% | |
| Qwen 3 32B | 93% | $0.0004 | 10.6s | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 45.0% | Correct "no violations" response | ||
| 59.4% | No hallucinated violations |
Long text (~1594 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| GPT-5 | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Grok 4.1 Fast | 100% | |
| MiniMax M2.7 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0001 | 962ms | |
| LFM2 24B | 100% | $0.0001 | 1.8s | |
| Gemini 2.5 Flash Lite | 74% | $0.0003 | 516ms | |
| Mistral Small 3.2 24B | 72% | $0.0003 | 1.0s | |
| Ministral 3 8B | 90% | $0.0004 | 1.4s | |
| Grok 4.20 (Beta) | 77% | $0.0018 | 984ms | |
| GPT-5.4 Nano | 66% | $0.0004 | 1.6s | |
| Gemini 3.1 Flash Lite (Preview) | 70% | $0.0008 | 882ms | |
| Inception Mercury | 100% | $0.0002 | 3.7s | |
| Mistral Medium 3.1 | 95% | $0.0011 | 764ms | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | |
| GPT-4.1 | 100% | $0.0022 | 815ms | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | |
| Hermes 3 70B | 69% | $0.0009 | 6.2s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.0s | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.8s | 100% | |
| Inception Mercury | 100% | $0.0002 | 3.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | 100% | |
| GPT-4.1 | 100% | $0.0022 | 815ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.0s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0039 | 9.7s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0065 | 7.8s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0077 | 5.5s | 100% | |
| GPT-5.2 | 100% | $0.0083 | 9.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0074 | 11.8s | 100% | |
| MiniMax M2.7 | 100% | $0.0020 | 27.5s | 100% | |
| o4 Mini | 100% | $0.0085 | 15.4s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0033 | 31.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.6s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 47.5% | Correct "no violations" response | ||
| 61.9% | No hallucinated violations |
Long text (~1594 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | |
| LFM2 24B | 100% | $0.0001 | 1.3s | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | |
| Inception Mercury | 93% | $0.0002 | 11.6s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | |
| Z.AI GLM 5 Turbo | 68% | $0.0096 | 25.3s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | |
| GPT-5.2 | 92% | $0.015 | 17.4s | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | |
| o4 Mini | 85% | $0.021 | 41.4s | |
| GPT-5.1 | 100% | $0.029 | 28.6s | |
| Nemotron 3 Super | 93% | $0.0000 | 2.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| LFM2 24B | 100% | 100% | 100% | |
| Claude Opus 4.6 | 94% | 61% | 61% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-4.1 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| Nemotron 3 Super | 93% | 55% | 55% | |
| Inception Mercury | 93% | 55% | 55% | |
| Nemotron 3 Nano | 93% | 55% | 55% | |
| ByteDance Seed 1.6 Flash | 93% | 55% | 55% | |
| GPT-5 | 92% | 50% | 50% | |
| GPT-5.2 | 92% | 50% | 50% | |
| Grok 4.1 Fast | 92% | 50% | 50% | |
| Mistral Small 4 (Reasoning) | 92% | 50% | 50% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.3s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | 100% | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | 100% | |
| GPT-5.1 | 100% | $0.029 | 28.6s | 100% | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | 100% | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | 100% | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | 56% | |
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | 55% | |
| Inception Mercury | 93% | $0.0002 | 11.6s | 55% | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | 55% | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | 50% | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | 50% | |
| GPT-5.2 | 92% | $0.015 | 17.4s | 50% | |
| Claude Opus 4.6 | 94% | $0.048 | 16.8s | 61% | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | 40% | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | 40% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.143 | 1.3m | 100% | |
| Nemotron 3 Super | 93% | $0.0000 | 2.3m | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 30.0% | Correct "no violations" response | ||
| 33.8% | No hallucinated violations |
detailed entries
Short text (~524 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 62% | $0.0004 | 56.5s | |
| Ministral 8B | 100% | $0.0004 | 338ms | |
| Mistral Small 3.2 24B | 63% | $0.0004 | 1.7s | |
| GPT-5.4 Nano | 83% | $0.0005 | 1.3s | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | |
| GPT-4.1 Mini | 93% | $0.0007 | 1.6s | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | |
| GPT-5.4 Nano (Reasoning, Low) | 93% | $0.0006 | 2.8s | |
| Arcee AI: Trinity Mini | 93% | $0.0013 | 41.5s | |
| Inception Mercury | 100% | $0.0003 | 6.3s | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | |
| ByteDance Seed 1.6 Flash | 84% | $0.0006 | 7.9s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.9s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0012 | 7.5s | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | |
| GPT-4.1 | 100% | $0.0035 | 613ms | |
| Inception Mercury 2 | 85% | $0.0023 | 3.4s | |
| Stealth: Healer Alpha | 85% | $0.0000 | 25.2s | |
| Hermes 3 405B | 74% | $0.0039 | 2.5s | |
| Grok 4 Fast | 93% | $0.0013 | 13.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | 100% | 100% | |
| Mistral Large 3 | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Mistral Large 2 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0004 | 338ms | 100% | |
| Ministral 3 8B | 100% | $0.0006 | 339ms | 100% | |
| Ministral 3 14B | 100% | $0.0008 | 531ms | 100% | |
| Mistral Large 3 | 100% | $0.0020 | 774ms | 100% | |
| Inception Mercury | 100% | $0.0003 | 6.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.9s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0012 | 7.5s | 100% | |
| GPT-4.1 | 100% | $0.0035 | 613ms | 100% | |
| Grok 4.1 Fast | 100% | $0.0011 | 8.6s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0050 | 6.1s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0021 | 15.1s | 100% | |
| Mistral Large 2 | 100% | $0.0078 | 650ms | 100% | |
| Mistral Large | 100% | $0.0078 | 1.0s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0068 | 5.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0055 | 11.5s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0060 | 21.5s | 100% | |
| Claude Sonnet 4.6 | 100% | $0.013 | 1.0s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0035 | 29.7s | 100% | |
| GPT-5 Mini | 100% | $0.0047 | 29.0s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 52.3s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 60.0% | Correct "no violations" response | ||
| 75.0% | No hallucinated violations |
Short text (~524 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| MiniMax M2.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Grok 4 | 100% | |
| Grok 4 Fast | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | |
| Mistral Large 3 | 100% | |
| Nemotron 3 Super | 100% | |
| Inception Mercury 2 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.4 Nano | 74% | $0.0006 | 1.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 93% | $0.0008 | 3.4s | |
| Ministral 8B | 100% | $0.0013 | 564ms | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | |
| Inception Mercury | 100% | $0.0009 | 8.6s | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0038 | 2.7s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0013 | 7.7s | |
| Inception Mercury 2 | 100% | $0.0031 | 3.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0020 | 8.1s | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0063 | 5.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 85% | $0.0023 | 14.2s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0028 | 23.8s | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | |
| Z.AI GLM 4.6 | 90% | $0.0053 | 12.9s | |
| MiniMax M2.7 | 100% | $0.0045 | 17.4s | |
| ByteDance Seed 1.6 | 100% | $0.0050 | 19.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Grok 4 Fast | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Mistral Large 3 | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0013 | 564ms | 100% | |
| Ministral 3 8B | 100% | $0.0020 | 589ms | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0007 | 6.9s | 100% | |
| Inception Mercury 2 | 100% | $0.0031 | 3.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0038 | 2.7s | 100% | |
| Inception Mercury | 100% | $0.0009 | 8.6s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0020 | 8.1s | 100% | |
| Grok 4.1 Fast | 100% | $0.0028 | 7.6s | 100% | |
| Mistral Large 3 | 100% | $0.0066 | 1.4s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0063 | 5.2s | 100% | |
| GPT-4.1 | 100% | $0.0086 | 1.0s | 100% | |
| Grok 4 Fast | 100% | $0.0025 | 16.0s | 100% | |
| MiniMax M2.7 | 100% | $0.0045 | 17.4s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0077 | 13.6s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0050 | 19.7s | 100% | |
| GPT-5.2 | 100% | $0.014 | 11.3s | 100% | |
| GPT-5 Mini | 100% | $0.0049 | 30.6s | 100% | |
| GPT-5 Nano | 100% | $0.0020 | 36.0s | 100% | |
| o4 Mini | 100% | $0.012 | 20.0s | 100% | |
| Mistral Large 2 | 100% | $0.026 | 1.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 60.0% | Correct "no violations" response | ||
| 76.7% | No hallucinated violations |
Long text (~1594 words), small codex (11 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0002 | 1.3s | |
| GPT-4.1 Nano | 78% | $0.0003 | 1.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.0s | |
| Inception Mercury | 100% | $0.0003 | 4.6s | |
| Inception Mercury 2 | 100% | $0.0018 | 2.7s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0029 | 3.0s | |
| GPT-5.4 Nano (Reasoning) | 93% | $0.0014 | 6.7s | |
| GPT-4.1 | 100% | $0.0045 | 728ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.3s | |
| Grok 4.1 Fast | 85% | $0.0022 | 20.0s | |
| Z.AI GLM 5 Turbo | 100% | $0.0064 | 16.6s | |
| MiniMax M2.5 | 73% | $0.0029 | 22.1s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0084 | 8.9s | |
| Mistral Small 4 (Reasoning) | 78% | $0.0030 | 31.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 78% | $0.0030 | 24.9s | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.010 | 9.1s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.013 | 2.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Inception Mercury | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 597ms | 100% | |
| LFM2 24B | 100% | $0.0002 | 1.3s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0005 | 670ms | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.0s | 100% | |
| Inception Mercury | 100% | $0.0003 | 4.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0018 | 2.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0029 | 3.0s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0008 | 9.1s | 100% | |
| GPT-4.1 | 100% | $0.0045 | 728ms | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.3s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0084 | 8.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0064 | 16.6s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.010 | 9.1s | 100% | |
| GPT-5 Mini | 100% | $0.0060 | 40.1s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 1.1m | 100% | |
| GPT-5.4 (Reasoning) | 100% | $0.029 | 27.7s | 100% | |
| GPT-5.1 | 100% | $0.028 | 32.0s | 100% | |
| o4 Mini High | 100% | $0.022 | 48.0s | 100% | |
| Z.AI GLM 5 | 100% | $0.014 | 1.3m | 100% | |
| GPT-5.4 Nano (Reasoning) | 93% | $0.0014 | 6.7s | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 45.0% | Correct "no violations" response | ||
| 55.8% | No hallucinated violations |
Long text (~1594 words), big codex (51 detailed entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| Z.AI GLM 5 | 100% | |
| o4 Mini High | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Stealth: Hunter Alpha | 100% | |
| ByteDance Seed 2.0 Mini | 100% | |
| Stealth: Healer Alpha | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | |
| ByteDance Seed 2.0 Lite | 100% | |
| Nemotron 3 Super | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.4 Nano | 78% | $0.0008 | 1.0s | |
| Ministral 8B | 100% | $0.0014 | 525ms | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | |
| Ministral 3 14B | 93% | $0.0029 | 2.7s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0014 | 6.3s | |
| Arcee AI: Trinity Mini | 90% | $0.0018 | 45.5s | |
| GPT-4.1 | 100% | $0.0097 | 972ms | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0073 | 5.9s | |
| Inception Mercury 2 | 85% | $0.0056 | 7.7s | |
| ByteDance Seed 1.6 Flash | 84% | $0.0016 | 13.8s | |
| Grok 4 Fast | 93% | $0.0032 | 25.0s | |
| Inception Mercury | 100% | $0.0007 | 20.5s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0035 | 40.9s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0038 | 27.3s | |
| ByteDance Seed 2.0 Lite | 100% | $0.0056 | 25.4s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.015 | 19.0s | |
| GPT-5.4 (Reasoning, Low) | 92% | $0.019 | 14.5s | |
| GPT-5.2 | 93% | $0.021 | 20.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Stealth: Hunter Alpha | 100% | 100% | 100% | |
| ByteDance Seed 2.0 Mini | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| ByteDance Seed 2.0 Lite | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0014 | 525ms | 100% | |
| Ministral 3 8B | 100% | $0.0022 | 678ms | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0014 | 6.3s | 100% | |
| GPT-4.1 | 100% | $0.0097 | 972ms | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0073 | 5.9s | 100% | |
| Grok 4.1 Fast | 100% | $0.0031 | 13.4s | 100% | |
| Inception Mercury | 100% | $0.0007 | 20.5s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.7s | 100% | |
| ByteDance Seed 2.0 Lite | 100% | $0.0056 | 25.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0035 | 40.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.016 | 33.5s | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.033 | 18.8s | 100% | |
| GPT-5 Mini | 100% | $0.0090 | 53.0s | 100% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 1.1m | 100% | |
| o4 Mini | 100% | $0.026 | 43.6s | 100% | |
| GPT-5.1 | 100% | $0.051 | 44.5s | 100% | |
| Aion 2.0 | 100% | $0.017 | 1.7m | 100% | |
| o4 Mini High | 100% | $0.048 | 1.5m | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 2.6m | 100% | |
| Claude Opus 4.6 | 100% | $0.104 | 19.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 50.0% | Correct "no violations" response | ||
| 57.5% | No hallucinated violations |