Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0018 | 12.8s | |
| Grok 4.1 Fast | 97% | $0.0021 | 21.1s | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | |
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | |
| Stealth: Healer Alpha | 95% | $0.0000 | 21.8s | |
| GPT-5.4 | 97% | $0.013 | 8.8s | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | |
| Qwen 3.5 Flash | 96% | $0.0038 | 1.0m | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.0s | |
| ByteDance Seed 2.0 Lite | 96% | $0.0067 | 1.1m | |
| ByteDance Seed 1.6 | 96% | $0.0067 | 1.0m | |
| Z.AI GLM 4.7 | 96% | $0.0091 | 1.0m | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 17.6s | |
| Aion 2.0 | 95% | $0.0084 | 1.3m | |
| Z.AI GLM 5 | 97% | $0.013 | 56.4s | |
| GPT-5.4 (Reasoning, Low) | 96% | $0.020 | 13.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0055 | 6.7s | |
| Gemini 2.5 Flash | 91% | $0.0025 | 2.8s | |
| Inception Mercury 2 | 92% | $0.0030 | 4.6s | |
| MiniMax M2.7 | 92% | $0.0022 | 29.4s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
8 low-scoring outliers hidden: Llama 3.1 8B (46.2%), GPT-5.4 Nano (Reasoning, Low) (42.6%), Gemma 3 4B (39.2%), WizardLM 2 8x22b (36.9%), GPT-5.4 Nano (36.6%), Rocinante 12B (36.1%), GPT-4.1 Nano (21.9%), LFM2 24B (14.2%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Claude Opus 4.5 | 99% | 97% | 97% | |
| Claude Opus 4.6 (Reasoning) | 99% | 96% | 96% | |
| Gemini 2.5 Pro | 99% | 95% | 95% | |
| Grok 4.20 (Beta, Reasoning) | 98% | 95% | 95% | |
| Claude Opus 4.6 | 99% | 94% | 94% | |
| Qwen 3.5 27B | 98% | 94% | 94% | |
| Gemini 2.5 Flash (Reasoning) | 97% | 92% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 92% | 92% | |
| Z.AI GLM 5 Turbo | 97% | 92% | 92% | |
| Grok 4.1 Fast | 97% | 92% | 92% | |
| GPT-5.2 | 97% | 94% | 92% | |
| GPT-5.4 | 97% | 91% | 91% | |
| Qwen 3.5 35B | 97% | 92% | 91% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 91% | 91% | |
| Z.AI GLM 5 | 97% | 92% | 91% | |
| GPT-5.1 | 97% | 93% | 90% | |
| Qwen 3.5 122B | 96% | 91% | 90% | |
| GPT-5 | 97% | 93% | 90% | |
| Grok 4 | 97% | 91% | 89% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.1 Fast | 97% | $0.0021 | 21.1s | 92% | |
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | 92% | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | 92% | |
| GPT-5.4 | 97% | $0.013 | 8.8s | 91% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | 92% | |
| Gemini 3 Flash (Preview) | 94% | $0.0031 | 4.5s | 83% | |
| Stealth: Healer Alpha | 95% | $0.0000 | 21.8s | 86% | |
| Inception Mercury 2 | 92% | $0.0030 | 4.6s | 83% | |
| Gemini 2.5 Flash | 91% | $0.0025 | 2.8s | 82% | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0055 | 6.7s | 82% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 17.6s | 84% | |
| Grok 4 Fast | 96% | $0.0018 | 12.8s | 77% | |
| Grok 4.20 (Beta, Reasoning) | 98% | $0.026 | 16.8s | 95% | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0019 | 2.2s | 75% | |
| Claude Sonnet 4.5 | 97% | $0.024 | 8.9s | 89% | |
| GPT-5.4 (Reasoning, Low) | 96% | $0.020 | 13.5s | 88% | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.0s | 86% | |
| Claude Sonnet 4 | 95% | $0.023 | 9.0s | 87% | |
| Claude Opus 4.5 | 99% | $0.041 | 9.7s | 97% | |
| MiniMax M2.7 | 92% | $0.0022 | 29.4s | 83% | |
| matrix | tiers | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total â–¼ | Small codex (7 entries), short passage (165 words) | Large codex (40 entries), short passage (165 words) | Small codex (7 entries), long passage (734 words) | Large codex (40 entries), long passage (1,019 words) | 5 codex entries | 10 codex entries | 20 codex entries | 40 codex entries |
| Claude Opus 4.5 | 99% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 98% |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | 99% | 96% | 100% | 100% | 99% | 100% |
| Claude Opus 4.6 (Reasoning) | 99% | 100% | 97% | 98% | 96% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 99% | 99% | 100% | 99% | 96% | 100% | 99% | 97% | 99% |
| Claude Opus 4.6 | 99% | 100% | 98% | 99% | 91% | 100% | 100% | 100% | 100% |
| Grok 4.20 (Beta, Reasoning) | 98% | 100% | 97% | 100% | 96% | 100% | 98% | 96% | 100% |
| Qwen 3.5 27B | 98% | 100% | 94% | 98% | 96% | 99% | 100% | 94% | 100% |
| Z.AI GLM 5 Turbo | 97% | 100% | 98% | 94% | 94% | 99% | 100% | 95% | 100% |
| Grok 4.1 Fast | 97% | 99% | 96% | 93% | 95% | 100% | 100% | 95% | 100% |
| GPT-5.4 | 97% | 100% | 98% | 91% | 95% | 100% | 100% | 95% | 99% |
| Grok 4 | 97% | 97% | 99% | 90% | 98% | 97% | 99% | 96% | 100% |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 100% | 96% | 97% | 92% | 100% | 100% | 95% | 98% |
| GPT-5.2 | 97% | 100% | 95% | 96% | 96% | 100% | 99% | 95% | 96% |
| Gemini 2.5 Flash (Reasoning) | 97% | 100% | 95% | 95% | 95% | 100% | 100% | 93% | 98% |
| Qwen 3.5 35B | 97% | 99% | 96% | 93% | 93% | 100% | 100% | 94% | 100% |
matrix
Small codex (7 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 18.6s | |
| Mistral Medium 3.1 | 97% | $0.0018 | 5.6s | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | |
| Grok 4 Fast | 99% | $0.0012 | 9.5s | |
| Inception Mercury | 95% | $0.0004 | 5.9s | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | |
| Grok 4.1 Fast | 99% | $0.0011 | 13.4s | |
| Stealth: Aurora Alpha | 87% | — | 5.8s | |
| Gemini 2.5 Flash | 97% | $0.0017 | 2.1s | |
| Inception Mercury 2 | 97% | $0.0018 | 2.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0012 | 6.9s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0022 | 9.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0041 | 4.4s | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 28.3s | |
| Z.AI GLM 4.5 | 98% | $0.0019 | 15.2s | |
| Gemma 3 12B | 95% | $0.0001 | 14.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Stealth: Hunter Alpha | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | 100% | |
| Gemini 2.5 Flash | 97% | $0.0017 | 2.1s | 97% | |
| Grok 4 Fast | 99% | $0.0012 | 9.5s | 97% | |
| Inception Mercury 2 | 97% | $0.0018 | 2.6s | 97% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 18.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0052 | 7.8s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0041 | 4.4s | 97% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | 98% | |
| Grok 4.1 Fast | 99% | $0.0011 | 13.4s | 97% | |
| Z.AI GLM 5 Turbo | 100% | $0.0047 | 10.5s | 98% | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | 95% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 28.3s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 99% | $0.0080 | 3.4s | 97% | |
| Mistral Medium 3.1 | 97% | $0.0018 | 5.6s | 92% | |
| GPT-4.1 | 100% | $0.0065 | 7.7s | 97% | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | 95% | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | 94% | |
| Z.AI GLM 4.5 | 98% | $0.0019 | 15.2s | 95% | |
| Inception Mercury | 95% | $0.0004 | 5.9s | 89% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 91.5% | Accuracy (recall) | ||
| 99.0% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 97% | $0.0024 | 15.1s | |
| Stealth: Healer Alpha | 95% | $0.0000 | 25.4s | |
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | |
| Grok 4.1 Fast | 96% | $0.0032 | 24.9s | |
| Grok 4.20 (Beta) | 95% | $0.0090 | 3.5s | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.1s | |
| Gemini 2.5 Flash | 90% | $0.0038 | 4.0s | |
| GPT-5.4 | 98% | $0.018 | 11.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | |
| GPT-4.1 | 94% | $0.011 | 15.1s | |
| MiniMax M2.5 | 91% | $0.0023 | 30.1s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0098 | 14.4s | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0031 | 3.0s | |
| Gemini 3 Flash (Preview, Reasoning) | 96% | $0.012 | 19.4s | |
| Gemini 2.5 Flash Lite (Reasoning) | 90% | $0.0030 | 31.0s | |
| GPT-5.4 Nano (Reasoning) | 90% | $0.0045 | 18.9s | |
| Mistral Small 4 (Reasoning) | 87% | $0.0025 | 16.2s | |
| Qwen 3.5 Flash | 96% | $0.0045 | 1.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 98% | 98% | |
| Gemini 2.5 Pro | 100% | 97% | 97% | |
| Grok 4 | 99% | 98% | 97% | |
| Grok 4.20 (Beta, Reasoning) | 97% | 98% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| Claude Opus 4.6 | 98% | 97% | 95% | |
| Claude Sonnet 4.6 | 98% | 96% | 94% | |
| Z.AI GLM 5 Turbo | 98% | 96% | 94% | |
| GPT-5.4 Mini (Reasoning) | 96% | 97% | 94% | |
| MoonshotAI: Kimi K2.5 | 98% | 96% | 94% | |
| Grok 4 Fast | 97% | 97% | 94% | |
| ByteDance Seed 1.6 | 97% | 97% | 94% | |
| GPT-5.4 | 98% | 96% | 93% | |
| Qwen 3.5 35B | 96% | 96% | 93% | |
| Grok 4.1 Fast | 96% | 96% | 93% | |
| GPT-5 | 96% | 96% | 93% | |
| Qwen 3.5 Flash | 96% | 96% | 92% | |
| GPT-5.4 (Reasoning) | 96% | 96% | 92% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 97% | $0.0024 | 15.1s | 94% | |
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | 94% | |
| Grok 4.1 Fast | 96% | $0.0032 | 24.9s | 93% | |
| Grok 4.20 (Beta) | 95% | $0.0090 | 3.5s | 90% | |
| Stealth: Healer Alpha | 95% | $0.0000 | 25.4s | 91% | |
| GPT-5.4 | 98% | $0.018 | 11.4s | 93% | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | 88% | |
| Claude Sonnet 4.5 | 100% | $0.036 | 11.3s | 100% | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | 88% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0098 | 14.4s | 90% | |
| Gemini 3 Flash (Preview, Reasoning) | 96% | $0.012 | 19.4s | 91% | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | 86% | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0031 | 3.0s | 84% | |
| Grok 4.20 (Beta, Reasoning) | 97% | $0.031 | 17.2s | 95% | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.1s | 88% | |
| GPT-4.1 | 94% | $0.011 | 15.1s | 86% | |
| Gemini 2.5 Flash | 90% | $0.0038 | 4.0s | 83% | |
| Gemini 2.5 Pro | 100% | $0.038 | 23.8s | 97% | |
| Claude Sonnet 4.6 | 98% | $0.038 | 13.8s | 94% | |
| GPT-5.4 Nano (Reasoning) | 90% | $0.0045 | 18.9s | 84% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 75.0% | Accuracy (recall) | ||
| 91.3% | Precision | ||
| 100.0% | Structural validity |
Small codex (7 entries), long passage (734 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | |
| Grok 4 Fast | 93% | $0.0019 | 16.5s | |
| Stealth: Healer Alpha | 90% | $0.0000 | 25.0s | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | |
| DeepSeek-V2 Chat | 85% | $0.0013 | 9.8s | |
| Gemini 2.5 Flash | 87% | $0.0020 | 2.4s | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | |
| Grok 4.1 Fast | 93% | $0.0025 | 31.2s | |
| Mistral Medium 3.1 | 79% | $0.0021 | 5.0s | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | |
| Inception Mercury 2 | 88% | $0.0032 | 4.9s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.012 | 20.1s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 1.2m | |
| Z.AI GLM 4.5 | 85% | $0.0018 | 13.7s | |
| Ministral 3 14B | 80% | $0.0006 | 4.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 86% | $0.0021 | 16.2s | |
| MiniMax M2.5 | 88% | $0.0018 | 23.0s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Gemini 2.5 Pro | 99% | 96% | 96% | |
| Qwen 3.5 27B | 98% | 97% | 95% | |
| Claude Opus 4.6 | 99% | 95% | 95% | |
| Qwen 3.5 397B A17B | 98% | 95% | 95% | |
| Claude Opus 4.6 (Reasoning) | 98% | 93% | 93% | |
| Claude Haiku 4.5 | 93% | 100% | 93% | |
| GPT-5.2 | 96% | 97% | 93% | |
| Qwen 3.5 Flash | 98% | 96% | 93% | |
| o4 Mini | 95% | 95% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 94% | 92% | |
| GPT-5 Mini | 96% | 95% | 91% | |
| Gemini 3 Pro (Preview) | 95% | 94% | 91% | |
| Aion 2.0 | 96% | 93% | 90% | |
| Qwen 3.5 122B | 97% | 91% | 89% | |
| ByteDance Seed 1.6 | 94% | 92% | 89% | |
| ByteDance Seed 2.0 Mini | 94% | 91% | 88% | |
| GPT-5.1 | 96% | 90% | 87% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | 93% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.012 | 20.1s | 92% | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | 84% | |
| Grok 4 Fast | 93% | $0.0019 | 16.5s | 83% | |
| Inception Mercury 2 | 88% | $0.0032 | 4.9s | 86% | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | 83% | |
| Claude Opus 4.5 | 100% | $0.030 | 9.1s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | 87% | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | 82% | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | 85% | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | 85% | |
| Gemini 3.1 Flash Lite (Preview) | 84% | $0.0014 | 1.9s | 81% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.032 | 17.9s | 100% | |
| Grok 4.1 Fast | 93% | $0.0025 | 31.2s | 82% | |
| MiniMax M2.5 | 88% | $0.0018 | 23.0s | 83% | |
| Stealth: Healer Alpha | 90% | $0.0000 | 25.0s | 79% | |
| Gemini 2.5 Flash | 87% | $0.0020 | 2.4s | 76% | |
| Qwen 3.5 Flash | 98% | $0.0044 | 1.3m | 93% | |
| Gemini 2.5 Flash Lite (Reasoning) | 86% | $0.0021 | 16.2s | 81% | |
| Claude Opus 4.6 | 99% | $0.030 | 10.2s | 95% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 67.3% | Accuracy (recall) | ||
| 88.6% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), long passage (1,019 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0032 | 21.1s | |
| Grok 4.1 Fast | 95% | $0.0040 | 40.0s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | |
| Stealth: Healer Alpha | 91% | $0.0000 | 33.9s | |
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | |
| Stealth: Hunter Alpha | 91% | $0.0000 | 1.1m | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | |
| Gemini 2.5 Flash | 87% | $0.0051 | 5.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | |
| GPT-5.4 | 95% | $0.028 | 20.6s | |
| Qwen 3.5 Flash | 93% | $0.0058 | 1.4m | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | |
| GPT-5.4 Mini (Reasoning, Low) | 87% | $0.010 | 9.6s | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | |
| Z.AI GLM 4.7 | 93% | $0.018 | 2.3m | |
| Aion 2.0 | 87% | $0.015 | 2.3m | |
| Gemini 3.1 Flash Lite (Preview) | 83% | $0.0038 | 3.7s | |
| Mistral Large 3 | 85% | $0.0060 | 25.8s | |
| GPT-5.4 Nano (Reasoning) | 70% | $0.0096 | 40.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Grok 4 | 98% | 97% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| GPT-5 | 97% | 98% | 95% | |
| Claude Opus 4.6 (Reasoning) | 96% | 97% | 94% | |
| Gemini 3.1 Pro (Preview) | 96% | 98% | 94% | |
| Qwen 3.5 27B | 96% | 99% | 94% | |
| Claude Opus 4.5 | 96% | 98% | 94% | |
| GPT-5.4 | 95% | 98% | 93% | |
| GPT-5.2 | 96% | 96% | 93% | |
| GPT-5 Mini | 95% | 97% | 93% | |
| Gemini 2.5 Pro | 96% | 96% | 93% | |
| GPT-5.4 (Reasoning) | 95% | 96% | 92% | |
| Nemotron 3 Super | 95% | 97% | 92% | |
| Gemini 2.5 Flash (Reasoning) | 95% | 97% | 92% | |
| GPT-5.4 (Reasoning, Low) | 95% | 97% | 92% | |
| Qwen 3.5 397B A17B | 97% | 96% | 92% | |
| Grok 4.20 (Beta, Reasoning) | 96% | 96% | 92% | |
| Z.AI GLM 5 Turbo | 94% | 97% | 92% | |
| o4 Mini High | 95% | 97% | 91% | |
| GPT-5.1 | 94% | 96% | 90% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0032 | 21.1s | 89% | |
| Grok 4.1 Fast | 95% | $0.0040 | 40.0s | 90% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | 92% | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | 92% | |
| GPT-5.4 | 95% | $0.028 | 20.6s | 93% | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | 93% | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | 87% | |
| Stealth: Hunter Alpha | 91% | $0.0000 | 1.1m | 86% | |
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | 85% | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | 88% | |
| GPT-5.4 (Reasoning, Low) | 95% | $0.042 | 30.5s | 92% | |
| Qwen 3.5 Flash | 93% | $0.0058 | 1.4m | 86% | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | 85% | |
| Grok 4.20 (Beta, Reasoning) | 96% | $0.049 | 27.1s | 92% | |
| Stealth: Healer Alpha | 91% | $0.0000 | 33.9s | 81% | |
| Qwen 3.5 27B | 96% | $0.027 | 2.2m | 94% | |
| GPT-5.4 Mini (Reasoning, Low) | 87% | $0.010 | 9.6s | 84% | |
| o4 Mini | 92% | $0.031 | 42.1s | 89% | |
| GPT-5.2 | 96% | $0.052 | 55.4s | 93% | |
| Gemini 2.5 Flash | 87% | $0.0051 | 5.7s | 81% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 56.3% | Accuracy (recall) | ||
| 89.1% | Precision | ||
| 100.0% | Structural validity |
tiers
5 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | |
| Grok 4 Fast | 99% | $0.0008 | 6.9s | |
| DeepSeek-V2 Chat | 96% | $0.0007 | 6.2s | |
| DeepSeek V3 (2024-12-26) | 96% | $0.0007 | 7.3s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0014 | 6.3s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 12.7s | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | |
| Grok 4.1 Fast | 100% | $0.0009 | 9.2s | |
| Claude 3.5 Haiku | 97% | $0.0022 | 5.3s | |
| Stealth: Aurora Alpha | 94% | — | 2.7s | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0014 | 12.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0032 | 3.8s | |
| Llama 3.1 Nemotron 70B | 96% | $0.0020 | 8.2s | |
| GPT-4.1 | 100% | $0.0035 | 3.2s | |
| Hermes 3 405B | 100% | $0.0016 | 10.8s | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | |
| Llama 3.1 70B | 95% | $0.0008 | 10.8s | |
| Z.AI GLM 4.6 | 97% | $0.0014 | 23.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 9.2s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 12.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0032 | 3.8s | 100% | |
| GPT-4.1 | 100% | $0.0035 | 3.2s | 100% | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | 100% | |
| Hermes 3 405B | 100% | $0.0016 | 10.8s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | 92% | |
| Grok 4 Fast | 99% | $0.0008 | 6.9s | 95% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0037 | 5.9s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | 94% | |
| GPT-5.4 | 100% | $0.0051 | 3.7s | 100% | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | 95% | |
| Z.AI GLM 4.5 | 99% | $0.0013 | 11.9s | 95% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0055 | 7.3s | 100% | |
| Z.AI GLM 5 Turbo | 99% | $0.0031 | 7.6s | 95% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0055 | 8.9s | 100% | |
| Inception Mercury 2 | 95% | $0.0011 | 1.8s | 88% | |
| DeepSeek-V2 Chat | 96% | $0.0007 | 6.2s | 89% | |
| Claude Sonnet 4 | 100% | $0.0081 | 4.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 92.0% | Accuracy (recall) | ||
| 96.0% | Precision | ||
| 100.0% | Structural validity |
10 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 4.5s | |
| Gemini 3.1 Flash Lite (Preview) | 94% | $0.0011 | 1.5s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | |
| Grok 4 Fast | 97% | $0.0010 | 7.8s | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | |
| Stealth: Healer Alpha | 97% | $0.0000 | 19.0s | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | |
| Qwen 3.5 Plus (2026-02-15) | 94% | $0.0018 | 12.4s | |
| Mistral Small Creative | 93% | $0.0003 | 2.3s | |
| Grok 4.1 Fast | 100% | $0.0013 | 18.5s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0014 | 11.4s | |
| MiniMax M2.5 | 97% | $0.0014 | 17.9s | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | |
| Writer: Palmyra X5 | 94% | $0.0034 | 6.6s | |
| GPT-5.4 Nano (Reasoning) | 87% | $0.0018 | 7.4s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0005 | 6.5s | |
| MiniMax M2.7 | 98% | $0.0014 | 19.7s | |
| GPT-5.4 | 100% | $0.0061 | 4.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0014 | 10.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Qwen 3.5 35B | 100% | 100% | 100% | |
| ByteDance Seed 2.0 Mini | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | |
| ByteDance Seed 2.0 Lite | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| Stealth: Aurora Alpha | 100% | — | 4.5s | 100% | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | 96% | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | 100% | |
| GPT-5.4 | 100% | $0.0061 | 4.2s | 100% | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | 94% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0049 | 7.5s | 100% | |
| Grok 4.1 Fast | 100% | $0.0013 | 18.5s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0067 | 11.1s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0096 | 6.3s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | 90% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0005 | 6.5s | 90% | |
| Claude Sonnet 4 | 100% | $0.012 | 4.6s | 100% | |
| Claude Sonnet 4.5 | 100% | $0.012 | 5.1s | 100% | |
| Llama 3.1 Nemotron 70B | 97% | $0.0030 | 9.7s | 91% | |
| Mistral Small Creative | 93% | $0.0003 | 2.3s | 83% | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0014 | 10.7s | 89% | |
| GPT-4o, Aug. 6th (temp=0) | 94% | $0.0063 | 2.2s | 91% | |
| GPT-5.4 Mini (Reasoning, Low) | 95% | $0.0029 | 4.0s | 86% | |
| MiniMax M2.7 | 98% | $0.0014 | 19.7s | 93% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 86.7% | Accuracy (recall) | ||
| 93.4% | Precision | ||
| 100.0% | Structural validity |
20 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4.20 (Beta) | 99% | $0.0047 | 2.0s | |
| Grok 4 Fast | 87% | $0.0018 | 13.2s | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | |
| Stealth: Healer Alpha | 92% | $0.0000 | 23.5s | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | |
| Grok 4.1 Fast | 95% | $0.0022 | 21.9s | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | |
| ByteDance Seed 1.6 Flash | 87% | $0.0009 | 11.8s | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | |
| GPT-4.1 | 75% | $0.0076 | 8.2s | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | |
| GPT-5.4 | 95% | $0.011 | 8.4s | |
| Claude Haiku 4.5 | 83% | $0.0070 | 5.0s | |
| Inception Mercury 2 | 86% | $0.0026 | 4.0s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0076 | 12.3s | |
| Stealth: Hunter Alpha | 93% | $0.0000 | 52.6s | |
| GPT-5.4 Nano (Reasoning) | 55% | $0.0019 | 14.4s | |
| Gemini 2.5 Flash Lite | 71% | $0.0004 | 1.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Grok 4.20 (Beta) | 99% | 95% | 95% | |
| GPT-5 | 97% | 97% | 94% | |
| GPT-5.1 | 96% | 96% | 93% | |
| GPT-5.4 | 95% | 97% | 93% | |
| Claude Sonnet 4.5 | 97% | 93% | 93% | |
| Gemini 3 Flash (Preview) | 92% | 100% | 92% | |
| GPT-5.2 | 95% | 96% | 92% | |
| Gemini 2.5 Pro | 97% | 96% | 92% | |
| Grok 4.20 (Beta, Reasoning) | 96% | 94% | 92% | |
| ByteDance Seed 2.0 Lite | 97% | 95% | 91% | |
| Grok 4 | 96% | 94% | 91% | |
| Gemini 3 Pro (Preview) | 95% | 95% | 91% | |
| Grok 4.1 Fast | 95% | 95% | 91% | |
| Claude Sonnet 4 | 92% | 97% | 89% | |
| GPT-4o, Aug. 6th (temp=0) | 91% | 97% | 89% | |
| Claude Sonnet 4.6 (Reasoning) | 93% | 95% | 89% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.20 (Beta) | 99% | $0.0047 | 2.0s | 95% | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | 92% | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | 88% | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | 88% | |
| GPT-5.4 | 95% | $0.011 | 8.4s | 93% | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | 88% | |
| Grok 4.1 Fast | 95% | $0.0022 | 21.9s | 91% | |
| GPT-4o, Aug. 6th (temp=0) | 91% | $0.011 | 3.8s | 89% | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0076 | 12.3s | 86% | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | 85% | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | 89% | |
| Claude Sonnet 4.5 | 97% | $0.022 | 8.9s | 93% | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | 86% | |
| Claude Opus 4.6 | 100% | $0.035 | 8.0s | 100% | |
| Claude Opus 4.5 | 100% | $0.037 | 8.6s | 100% | |
| Stealth: Healer Alpha | 92% | $0.0000 | 23.5s | 84% | |
| Gemini 3 Flash (Preview, Reasoning) | 95% | $0.013 | 20.4s | 88% | |
| GPT-5.4 (Reasoning, Low) | 94% | $0.018 | 11.7s | 88% | |
| Inception Mercury 2 | 86% | $0.0026 | 4.0s | 80% | |
| ByteDance Seed 1.6 Flash | 87% | $0.0009 | 11.8s | 80% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 74.5% | Accuracy (recall) | ||
| 88.4% | Precision | ||
| 100.0% | Structural validity |
40 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Healer Alpha | 98% | $0.0000 | 16.1s | |
| Grok 4.1 Fast | 100% | $0.0016 | 10.1s | |
| Grok 4 Fast | 100% | $0.0021 | 12.1s | |
| Stealth: Hunter Alpha | 97% | $0.0000 | 32.5s | |
| Qwen 3.5 Flash | 98% | $0.0036 | 47.1s | |
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | |
| Z.AI GLM 4.7 | 98% | $0.0070 | 28.9s | |
| Nemotron 3 Super | 97% | $0.0000 | 2.8m | |
| Inception Mercury 2 | 94% | $0.0029 | 4.4s | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | |
| GPT-5 Mini | 99% | $0.0075 | 1.3m | |
| GPT-5.4 | 99% | $0.011 | 7.9s | |
| Mistral Small 4 (Reasoning) | 94% | $0.0020 | 15.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 94% | $0.0054 | 6.2s | |
| Qwen 3.5 9B | 95% | $0.0020 | 2.0m | |
| Stealth: Aurora Alpha | 92% | — | 5.8s | |
| Mistral Large 3 | 94% | $0.0040 | 6.6s | |
| ByteDance Seed 2.0 Mini | 98% | $0.0030 | 2.6m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Grok 4 Fast | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 98% | 98% | |
| Qwen 3.5 35B | 100% | 98% | 98% | |
| Qwen 3.5 27B | 100% | 97% | 97% | |
| GPT-5.4 Mini (Reasoning) | 99% | 97% | 97% | |
| ByteDance Seed 2.0 Lite | 97% | 100% | 97% | |
| GPT-5.4 (Reasoning) | 99% | 96% | 96% | |
| GPT-5 Mini | 99% | 96% | 96% | |
| GPT-5.4 | 99% | 96% | 96% | |
| Claude Opus 4.5 | 98% | 97% | 95% | |
| Aion 2.0 | 98% | 97% | 95% | |
| Gemini 2.5 Pro | 99% | 95% | 95% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.1 Fast | 100% | $0.0016 | 10.1s | 100% | |
| Grok 4 Fast | 100% | $0.0021 | 12.1s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | 98% | |
| Stealth: Healer Alpha | 98% | $0.0000 | 16.1s | 95% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.017 | 11.0s | 100% | |
| GPT-5.4 | 99% | $0.011 | 7.9s | 96% | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | 94% | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | 91% | |
| GPT-5.4 Mini (Reasoning) | 99% | $0.013 | 18.6s | 97% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.022 | 13.0s | 100% | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | 90% | |
| Stealth: Hunter Alpha | 97% | $0.0000 | 32.5s | 91% | |
| Gemini 3 Flash (Preview, Reasoning) | 98% | $0.0093 | 14.7s | 92% | |
| Qwen 3.5 35B | 100% | $0.015 | 41.5s | 98% | |
| ByteDance Seed 2.0 Lite | 97% | $0.0060 | 51.4s | 97% | |
| Z.AI GLM 4.7 | 98% | $0.0070 | 28.9s | 92% | |
| Inception Mercury 2 | 94% | $0.0029 | 4.4s | 88% | |
| Qwen 3.5 Flash | 98% | $0.0036 | 47.1s | 92% | |
| Z.AI GLM 5 | 97% | $0.0082 | 33.0s | 92% | |
| GPT-5 Mini | 99% | $0.0075 | 1.3m | 96% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 76.8% | Accuracy (recall) | ||
| 92.6% | Precision | ||
| 100.0% | Structural validity |