Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 72% | $0.0004 | 39.0s | |
| GPT-4.1 Nano | 63% | $0.0005 | 4.6s | |
| GPT-5.4 Nano | 68% | $0.0005 | 1.5s | |
| Ministral 8B | 68% | $0.0007 | 4.4s | |
| Ministral 3 8B | 78% | $0.0013 | 17.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | |
| Inception Mercury | 98% | $0.0004 | 8.0s | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | |
| Arcee AI: Trinity Mini | 73% | $0.0009 | 26.3s | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | |
| Stealth: Healer Alpha | 84% | $0.0000 | 21.5s | |
| Grok 4 Fast | 73% | $0.0019 | 19.2s | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | |
| Mistral Small 4 (Reasoning) | 83% | $0.0025 | 22.6s | |
| Z.AI GLM 5 Turbo | 96% | $0.0071 | 16.0s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Nemotron 3 Super | 99% | 83% | 83% | |
| Inception Mercury | 98% | 77% | 77% | |
| o4 Mini High | 97% | 72% | 72% | |
| Grok 4.1 Fast | 97% | 70% | 70% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 70% | 70% | |
| o4 Mini | 96% | 67% | 67% | |
| Inception Mercury 2 | 96% | 67% | 67% | |
| Z.AI GLM 5 Turbo | 96% | 65% | 65% | |
| GPT-5.1 | 95% | 64% | 64% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 64% | 64% | |
| Claude Opus 4.6 (Reasoning) | 94% | 60% | 60% | |
| ByteDance Seed 1.6 Flash | 94% | 59% | 59% | |
| GPT-5.4 Nano (Reasoning) | 94% | 57% | 57% | |
| Z.AI GLM 5 | 93% | 56% | 56% | |
| GPT-5 Mini | 93% | 56% | 56% | |
| GPT-5 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | 53% | 53% | |
| MiniMax M2.7 | 90% | 48% | 48% | |
| GPT-5.4 Mini (Reasoning) | 89% | 46% | 46% | |
| Aion 2.0 | 88% | 45% | 45% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury | 98% | $0.0004 | 8.0s | 77% | |
| Nemotron 3 Super | 99% | $0.0000 | 1.3m | 83% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0008 | 3.9s | 70% | |
| Grok 4.1 Fast | 97% | $0.0019 | 12.5s | 70% | |
| Inception Mercury 2 | 96% | $0.0027 | 3.8s | 67% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0034 | 4.0s | 64% | |
| Z.AI GLM 5 Turbo | 96% | $0.0071 | 16.0s | 65% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0008 | 9.1s | 59% | |
| o4 Mini | 96% | $0.014 | 25.0s | 67% | |
| GPT-5.4 Nano (Reasoning) | 94% | $0.0017 | 11.4s | 57% | |
| o4 Mini High | 97% | $0.027 | 52.5s | 72% | |
| Gemini 2.5 Flash Lite (Reasoning) | 92% | $0.0023 | 16.6s | 53% | |
| GPT-5.1 | 95% | $0.025 | 26.1s | 64% | |
| GPT-5 Mini | 93% | $0.0059 | 37.8s | 56% | |
| GPT-5 Nano | 93% | $0.0035 | 1.1m | 55% | |
| MiniMax M2.7 | 90% | $0.0047 | 34.6s | 48% | |
| GPT-5.4 Mini (Reasoning) | 89% | $0.0090 | 10.8s | 46% | |
| GPT-4.1 | 86% | $0.0048 | 1.2s | 42% | |
| Gemini 2.5 Flash (Reasoning) | 87% | $0.0085 | 14.2s | 43% | |
| Z.AI GLM 5 | 93% | $0.017 | 1.4m | 56% | |
| basic entries | detailed entries | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Short text (~524 words), small codex (11 entries) | Short text (~524 words), big codex (51 entries) | Long text (~1594 words), small codex (11 entries) | Long text (~1594 words), big codex (51 entries) | Short text (~524 words), small codex (11 detailed entries) | Short text (~524 words), big codex (51 detailed entries) | Long text (~1594 words), small codex (11 detailed entries) | Long text (~1594 words), big codex (51 detailed entries) |
| Nemotron 3 Super | 99% | 100% | 100% | 100% | 93% | 100% | 100% | 100% | 100% |
| Inception Mercury | 98% | 100% | 93% | 100% | 93% | 100% | 100% | 100% | 100% |
| o4 Mini High | 97% | 93% | 100% | 100% | 85% | 100% | 100% | 100% | 100% |
| Grok 4.1 Fast | 97% | 100% | 100% | 100% | 92% | 100% | 100% | 85% | 100% |
| GPT-5.4 Nano (Reasoning, Low) | 97% | 100% | 90% | 100% | 100% | 93% | 93% | 100% | 100% |
| o4 Mini | 96% | 100% | 100% | 100% | 85% | 93% | 100% | 93% | 100% |
| Inception Mercury 2 | 96% | 100% | 100% | 100% | 100% | 85% | 100% | 100% | 85% |
| Z.AI GLM 5 Turbo | 96% | 100% | 100% | 100% | 68% | 100% | 100% | 100% | 100% |
| GPT-5.1 | 95% | 93% | 93% | 100% | 100% | 93% | 85% | 100% | 100% |
| GPT-5.4 Mini (Reasoning, Low) | 95% | 93% | 85% | 100% | 85% | 100% | 100% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 94% | 100% | 100% | 100% | 100% | 85% | 70% | 100% | 100% |
| ByteDance Seed 1.6 Flash | 94% | 100% | 100% | 100% | 93% | 84% | 93% | 100% | 84% |
| GPT-5.4 Nano (Reasoning) | 94% | 100% | 100% | 100% | 58% | 100% | 100% | 93% | 100% |
| Z.AI GLM 5 | 93% | 93% | 78% | 100% | 83% | 100% | 93% | 100% | 100% |
| GPT-5 Mini | 93% | 100% | 100% | 100% | 45% | 100% | 100% | 100% | 100% |
basic entries
Short text (~524 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | |
| GPT-5.4 Nano | 80% | $0.0003 | 1.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | |
| Inception Mercury | 100% | $0.0001 | 3.6s | |
| Arcee AI: Trinity Mini | 81% | $0.0012 | 36.6s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | |
| Grok 4 Fast | 77% | $0.0008 | 8.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0012 | 8.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0019 | 3.6s | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | |
| Llama 3.1 Nemotron 70B | 67% | $0.0021 | 9.0s | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | |
| Qwen 3 32B | 78% | $0.0004 | 12.3s | |
| Cohere Command R+ (Aug. 2024) | 65% | $0.0046 | 2.5s | |
| MiniMax M2.5 | 93% | $0.0015 | 12.4s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0039 | 9.0s | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 4.7 Flash | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Mistral Small 4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
| Inception Mercury | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0000 | 1.3s | 100% | |
| Inception Mercury | 100% | $0.0001 | 3.6s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0005 | 2.8s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.9s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 5.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.5s | 100% | |
| Mistral Small 4 (Reasoning) | 100% | $0.0009 | 9.3s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0039 | 9.0s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0036 | 9.8s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0051 | 8.5s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0017 | 16.9s | 100% | |
| MiniMax M2.7 | 100% | $0.0014 | 22.4s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.1s | 100% | |
| o4 Mini | 100% | $0.0097 | 18.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 47.1s | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.016 | 11.8s | 100% | |
| Z.AI GLM 4.7 Flash | 100% | $0.0012 | 47.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.022 | 9.7s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.033 | 20.7s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 45.0% | Correct "no violations" response | ||
| 58.1% | No hallucinated violations |
Short text (~524 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | |
| Ministral 8B | 68% | $0.0003 | 3.4s | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | |
| GPT-5.4 Nano | 73% | $0.0004 | 1.4s | |
| Inception Mercury | 93% | $0.0002 | 5.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 90% | $0.0006 | 3.2s | |
| GPT-4.1 | 100% | $0.0020 | 737ms | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | |
| Arcee AI: Trinity Mini | 83% | $0.0003 | 15.8s | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | |
| Hermes 3 405B | 75% | $0.0026 | 1.7s | |
| Qwen 3 32B | 93% | $0.0004 | 10.6s | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0023 | 3.5s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0015 | 12.6s | |
| Z.AI GLM 5 Turbo | 100% | $0.0034 | 7.9s | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0047 | 7.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
| ByteDance Seed 1.6 Flash | 100% | 100% | 100% | |
| LFM2 24B | 100% | 100% | 100% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-5.1 | 93% | 55% | 55% | |
| GPT-5.4 Mini (Reasoning) | 93% | 55% | 55% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 2.4s | 100% | |
| GPT-4.1 | 100% | $0.0020 | 737ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 5.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 6.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0011 | 6.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0024 | 3.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0012 | 9.6s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0034 | 7.9s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0021 | 16.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 28.8s | 100% | |
| o4 Mini | 100% | $0.010 | 20.6s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 51.1s | 100% | |
| o4 Mini High | 100% | $0.020 | 41.1s | 100% | |
| Claude Opus 4.6 | 100% | $0.031 | 12.1s | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.041 | 24.0s | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.045 | 32.9s | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | $0.044 | 35.6s | 100% | |
| Ministral 3 8B | 93% | $0.0004 | 1.2s | 56% | |
| Inception Mercury | 93% | $0.0002 | 5.4s | 55% | |
| Qwen 3 32B | 93% | $0.0004 | 10.6s | 55% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 45.0% | Correct "no violations" response | ||
| 59.4% | No hallucinated violations |
Long text (~1594 words), small codex (11 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| GPT-5 | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Grok 4.1 Fast | 100% | |
| MiniMax M2.7 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0001 | 962ms | |
| LFM2 24B | 100% | $0.0001 | 1.8s | |
| Gemini 2.5 Flash Lite | 74% | $0.0003 | 516ms | |
| Mistral Small 3.2 24B | 72% | $0.0003 | 1.0s | |
| Ministral 3 8B | 90% | $0.0004 | 1.4s | |
| Grok 4.20 (Beta) | 77% | $0.0018 | 984ms | |
| GPT-5.4 Nano | 66% | $0.0004 | 1.6s | |
| Gemini 3.1 Flash Lite (Preview) | 70% | $0.0008 | 882ms | |
| Inception Mercury | 100% | $0.0002 | 3.7s | |
| Mistral Medium 3.1 | 95% | $0.0011 | 764ms | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | |
| GPT-4.1 | 100% | $0.0022 | 815ms | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | |
| Hermes 3 70B | 69% | $0.0009 | 6.2s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.0s | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.8s | 100% | |
| Inception Mercury | 100% | $0.0002 | 3.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 3.6s | 100% | |
| GPT-4.1 | 100% | $0.0022 | 815ms | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 3.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.0s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 6.1s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 7.4s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 13.9s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0039 | 9.7s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0065 | 7.8s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0077 | 5.5s | 100% | |
| GPT-5.2 | 100% | $0.0083 | 9.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0074 | 11.8s | 100% | |
| MiniMax M2.7 | 100% | $0.0020 | 27.5s | 100% | |
| o4 Mini | 100% | $0.0085 | 15.4s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0033 | 31.2s | 100% | |
| GPT-5 Mini | 100% | $0.0044 | 29.6s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 47.5% | Correct "no violations" response | ||
| 61.9% | No hallucinated violations |
Long text (~1594 words), big codex (51 entries)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | |
| LFM2 24B | 100% | $0.0001 | 1.3s | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | |
| Inception Mercury | 93% | $0.0002 | 11.6s | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | |
| Z.AI GLM 5 Turbo | 68% | $0.0096 | 25.3s | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | |
| GPT-5.2 | 92% | $0.015 | 17.4s | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | |
| o4 Mini | 85% | $0.021 | 41.4s | |
| GPT-5.1 | 100% | $0.029 | 28.6s | |
| Nemotron 3 Super | 93% | $0.0000 | 2.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Stealth: Healer Alpha | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | 100% | 100% | |
| LFM2 24B | 100% | 100% | 100% | |
| Claude Opus 4.6 | 94% | 61% | 61% | |
| Ministral 3 8B | 93% | 56% | 56% | |
| GPT-4.1 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | 55% | 55% | |
| Nemotron 3 Super | 93% | 55% | 55% | |
| Inception Mercury | 93% | 55% | 55% | |
| Nemotron 3 Nano | 93% | 55% | 55% | |
| ByteDance Seed 1.6 Flash | 93% | 55% | 55% | |
| GPT-5 | 92% | 50% | 50% | |
| GPT-5.2 | 92% | 50% | 50% | |
| Grok 4.1 Fast | 92% | 50% | 50% | |
| Mistral Small 4 (Reasoning) | 92% | 50% | 50% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| LFM2 24B | 100% | $0.0001 | 1.3s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0011 | 6.4s | 100% | |
| Inception Mercury 2 | 100% | $0.0039 | 5.5s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 27.0s | 100% | |
| GPT-5.1 | 100% | $0.029 | 28.6s | 100% | |
| GPT-5 Nano | 100% | $0.0038 | 1.3m | 100% | |
| Aion 2.0 | 100% | $0.0086 | 1.3m | 100% | |
| Ministral 3 8B | 93% | $0.0006 | 1.9s | 56% | |
| GPT-4.1 Nano | 93% | $0.0003 | 2.2s | 55% | |
| Inception Mercury | 93% | $0.0002 | 11.6s | 55% | |
| ByteDance Seed 1.6 Flash | 93% | $0.0010 | 14.8s | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0033 | 22.0s | 55% | |
| Mistral Small 4 (Reasoning) | 92% | $0.0026 | 21.1s | 50% | |
| Grok 4.1 Fast | 92% | $0.0028 | 27.0s | 50% | |
| GPT-5.2 | 92% | $0.015 | 17.4s | 50% | |
| Claude Opus 4.6 | 94% | $0.048 | 16.8s | 61% | |
| GPT-5.4 Mini (Reasoning, Low) | 85% | $0.0048 | 6.5s | 40% | |
| Gemini 2.5 Flash (Reasoning) | 85% | $0.012 | 18.1s | 40% | |
| Claude Opus 4.6 (Reasoning) | 100% | $0.143 | 1.3m | 100% | |
| Nemotron 3 Super | 93% | $0.0000 | 2.3m | 55% | |