Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0018 | 12.8s | |
| Grok 4.1 Fast | 97% | $0.0021 | 21.1s | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | |
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | |
| Stealth: Healer Alpha | 95% | $0.0000 | 21.8s | |
| Gemma 4 31B | 97% | $0.0009 | 37.9s | |
| GPT-5.4 | 97% | $0.013 | 8.8s | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | |
| DeepSeek V4 Flash (Reasoning) | 97% | $0.0010 | 40.3s | |
| Qwen 3.5 Flash | 96% | $0.0038 | 1.0m | |
| Qwen 3.6 Flash | 96% | $0.011 | 29.9s | |
| Xiaomi MIMO v2.5 | 95% | $0.0058 | 21.8s | |
| Gemma 4 31B (Reasoning) | 97% | $0.0017 | 2.8m | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0091 | 37.0s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.0s | |
| ByteDance Seed 2.0 Lite | 96% | $0.0067 | 1.1m | |
| ByteDance Seed 1.6 | 96% | $0.0067 | 1.0m | |
| Gemma 4 26B (Reasoning) | 95% | $0.0022 | 2.0m | |
| Qwen 3.6 35B | 96% | $0.013 | 51.0s | |
| Z.AI GLM 4.7 | 96% | $0.0091 | 1.0m | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
10 low-scoring outliers hidden: Claude 3 Haiku (58.4%), Mistral NeMO (53.6%), Llama 3.1 8B (46.2%), GPT-5.4 Nano (Reasoning, Low) (42.6%), Gemma 3 4B (39.2%), WizardLM 2 8x22b (36.9%), GPT-5.4 Nano (36.6%), Rocinante 12B (36.1%), GPT-4.1 Nano (21.9%), LFM2 24B (14.2%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Claude Opus 4.5 | 99% | 97% | 97% | |
| Claude Opus 4.6 (Reasoning) | 99% | 96% | 96% | |
| Gemini 2.5 Pro | 99% | 95% | 95% | |
| Grok 4.20 (Reasoning) | 98% | 95% | 95% | |
| Grok 4.20 (Beta, Reasoning) | 98% | 95% | 95% | |
| Qwen3.6 Max Preview | 98% | 95% | 95% | |
| Claude Opus 4.6 | 99% | 94% | 94% | |
| Qwen 3.5 27B | 98% | 94% | 94% | |
| Z.AI GLM 5.1 | 98% | 94% | 94% | |
| GPT-5.5 | 98% | 93% | 93% | |
| GPT-5.5 (Reasoning) | 97% | 93% | 93% | |
| Qwen 3.5 Plus (2026-04-20) | 97% | 93% | 93% | |
| GPT-5.5 (Reasoning, Low) | 97% | 93% | 93% | |
| Gemma 4 31B (Reasoning) | 97% | 93% | 93% | |
| Gemini 2.5 Flash (Reasoning) | 97% | 92% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 92% | 92% | |
| Z.AI GLM 5 Turbo | 97% | 92% | 92% | |
| Grok 4.1 Fast | 97% | 92% | 92% | |
| Claude Opus 4.7 (Reasoning) | 97% | 92% | 92% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.1 Fast | 97% | $0.0021 | 21.1s | 92% | |
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | 92% | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | 92% | |
| GPT-5.4 | 97% | $0.013 | 8.8s | 91% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | 92% | |
| Stealth: Healer Alpha | 95% | $0.0000 | 21.8s | 86% | |
| Gemini 3 Flash (Preview) | 94% | $0.0031 | 4.5s | 83% | |
| Gemma 4 31B | 97% | $0.0009 | 37.9s | 90% | |
| Inception Mercury 2 | 92% | $0.0030 | 4.6s | 83% | |
| DeepSeek V4 Flash (Reasoning) | 97% | $0.0010 | 40.3s | 90% | |
| Gemini 2.5 Flash | 91% | $0.0025 | 2.8s | 82% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 17.6s | 84% | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0055 | 6.7s | 82% | |
| Grok 4 Fast | 96% | $0.0018 | 12.8s | 77% | |
| Grok 4.20 (Beta, Reasoning) | 98% | $0.026 | 16.8s | 95% | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.0s | 86% | |
| GPT-5.5 | 98% | $0.030 | 7.7s | 93% | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0091 | 37.0s | 89% | |
| Grok 4.20 (Reasoning) | 98% | $0.015 | 45.4s | 95% | |
| Claude Sonnet 4.5 | 97% | $0.024 | 8.9s | 89% | |
| matrix | tiers | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total â–¼ | Small codex (7 entries), short passage (165 words) | Large codex (40 entries), short passage (165 words) | Small codex (7 entries), long passage (734 words) | Large codex (40 entries), long passage (1,019 words) | 5 codex entries | 10 codex entries | 20 codex entries | 40 codex entries |
| Claude Opus 4.5 | 99% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 98% |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | 99% | 96% | 100% | 100% | 99% | 100% |
| Claude Opus 4.6 (Reasoning) | 99% | 100% | 97% | 98% | 96% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 99% | 99% | 100% | 99% | 96% | 100% | 99% | 97% | 99% |
| Claude Opus 4.6 | 99% | 100% | 98% | 99% | 91% | 100% | 100% | 100% | 100% |
| Grok 4.20 (Beta, Reasoning) | 98% | 100% | 97% | 100% | 96% | 100% | 98% | 96% | 100% |
| Qwen3.6 Max Preview | 98% | 100% | 97% | 98% | 98% | 100% | 100% | 94% | 100% |
| Grok 4.20 (Reasoning) | 98% | 100% | 97% | 97% | 98% | 100% | 97% | 96% | 100% |
| Z.AI GLM 5.1 | 98% | 100% | 98% | 96% | 96% | 100% | 99% | 93% | 99% |
| GPT-5.5 | 98% | 100% | 100% | 94% | 95% | 100% | 99% | 92% | 100% |
| Qwen 3.5 27B | 98% | 100% | 94% | 98% | 96% | 99% | 100% | 94% | 100% |
| Grok 4.3 (Reasoning) | 97% | 100% | 97% | 97% | 96% | 100% | 96% | 95% | 99% |
| GPT-5.5 (Reasoning) | 97% | 100% | 97% | 92% | 97% | 100% | 100% | 93% | 100% |
| Z.AI GLM 5 Turbo | 97% | 100% | 98% | 94% | 94% | 99% | 100% | 95% | 100% |
| Gemma 4 31B (Reasoning) | 97% | 99% | 96% | 96% | 93% | 100% | 100% | 95% | 99% |
matrix
Small codex (7 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Healer Alpha | 100% | $0.0000 | 18.6s | |
| Gemini 3.1 Flash Lite | 100% | $0.0012 | 2.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0011 | 6.5s | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | |
| Grok 4 Fast | 99% | $0.0012 | 9.5s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 28.3s | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | |
| Mistral Medium 3.1 | 97% | $0.0018 | 5.6s | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | |
| Grok 4.1 Fast | 99% | $0.0011 | 13.4s | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | |
| DeepSeek V4 Flash | 96% | $0.0002 | 6.6s | |
| Inception Mercury | 95% | $0.0004 | 5.9s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0022 | 9.4s | |
| Stealth: Aurora Alpha | 87% | — | 5.8s | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 11.9s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0012 | 6.9s | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | |
| Gemma 4 26B | 97% | $0.0003 | 24.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Qwen 3.6 Flash | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0012 | 2.0s | 97% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 18.6s | 100% | |
| Grok 4 Fast | 99% | $0.0012 | 9.5s | 97% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | 98% | |
| Gemini 2.5 Flash | 97% | $0.0017 | 2.1s | 97% | |
| Grok 4.1 Fast | 99% | $0.0011 | 13.4s | 97% | |
| Inception Mercury 2 | 97% | $0.0018 | 2.6s | 97% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0052 | 7.8s | 100% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 28.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0041 | 4.4s | 97% | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 11.9s | 97% | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | 95% | |
| Z.AI GLM 5 Turbo | 100% | $0.0047 | 10.5s | 98% | |
| DeepSeek V4 Flash | 96% | $0.0002 | 6.6s | 93% | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | 94% | |
| Z.AI GLM 4.5 | 98% | $0.0019 | 15.2s | 95% | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | 95% | |
| Xiaomi MIMO v2.5 Pro | 100% | $0.0051 | 20.3s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 93.0% | Accuracy (recall) | ||
| 100.0% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 97% | $0.0024 | 15.1s | |
| Stealth: Healer Alpha | 95% | $0.0000 | 25.4s | |
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | |
| Grok 4.1 Fast | 96% | $0.0032 | 24.9s | |
| Grok 4.20 | 94% | $0.0081 | 8.6s | |
| Grok 4.20 (Beta) | 95% | $0.0090 | 3.5s | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | |
| DeepSeek V4 Flash | 92% | $0.0005 | 12.9s | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | |
| Gemma 4 31B | 94% | $0.0013 | 1.2m | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 44.1s | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0015 | 58.3s | |
| Gemini 2.5 Flash | 90% | $0.0038 | 4.0s | |
| GPT-5.4 | 98% | $0.018 | 11.4s | |
| Xiaomi MIMO v2.5 | 94% | $0.0071 | 25.0s | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0093 | 35.3s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0027 | 3.2s | |
| Qwen 3.6 Flash | 97% | $0.012 | 33.9s | |
| GPT-4.1 | 94% | $0.011 | 15.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 100% | 100% | 100% | |
| GPT-5.5 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 98% | 98% | |
| Gemini 2.5 Pro | 100% | 97% | 97% | |
| Grok 4 | 99% | 98% | 97% | |
| GPT-5.5 (Reasoning, Low) | 99% | 97% | 96% | |
| Z.AI GLM 5.1 | 98% | 97% | 96% | |
| GPT-5.5 (Reasoning) | 97% | 98% | 96% | |
| Grok 4.20 (Beta, Reasoning) | 97% | 98% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| Claude Opus 4.6 | 98% | 97% | 95% | |
| MoonshotAI: Kimi K2.6 | 98% | 97% | 95% | |
| Qwen3.6 Max Preview | 97% | 97% | 94% | |
| DeepSeek V4 Pro (Reasoning) | 97% | 96% | 94% | |
| Claude Sonnet 4.6 | 98% | 96% | 94% | |
| Grok 4.20 (Reasoning) | 97% | 98% | 94% | |
| Z.AI GLM 5 Turbo | 98% | 96% | 94% | |
| Qwen 3.6 Flash | 97% | 96% | 94% | |
| GPT-5.4 Mini (Reasoning) | 96% | 97% | 94% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 97% | $0.0024 | 15.1s | 94% | |
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | 94% | |
| Grok 4.1 Fast | 96% | $0.0032 | 24.9s | 93% | |
| Grok 4.20 (Beta) | 95% | $0.0090 | 3.5s | 90% | |
| Stealth: Healer Alpha | 95% | $0.0000 | 25.4s | 91% | |
| GPT-5.4 | 98% | $0.018 | 11.4s | 93% | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | 88% | |
| Claude Sonnet 4.5 | 100% | $0.036 | 11.3s | 100% | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | 88% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0098 | 14.4s | 90% | |
| Grok 4.20 | 94% | $0.0081 | 8.6s | 87% | |
| GPT-5.5 | 100% | $0.042 | 10.0s | 100% | |
| Qwen 3.6 Flash | 97% | $0.012 | 33.9s | 94% | |
| Gemini 3 Flash (Preview, Reasoning) | 96% | $0.012 | 19.4s | 91% | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | 86% | |
| DeepSeek V4 Flash | 92% | $0.0005 | 12.9s | 84% | |
| Xiaomi MIMO v2.5 | 94% | $0.0071 | 25.0s | 89% | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0031 | 3.0s | 84% | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0093 | 35.3s | 90% | |
| Grok 4.20 (Beta, Reasoning) | 97% | $0.031 | 17.2s | 95% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 81.3% | Accuracy (recall) | ||
| 93.0% | Precision | ||
| 100.0% | Structural validity |
Small codex (7 entries), long passage (734 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 4 31B | 97% | $0.0005 | 44.4s | |
| Grok 4 Fast | 93% | $0.0019 | 16.5s | |
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | |
| Stealth: Healer Alpha | 90% | $0.0000 | 25.0s | |
| Gemma 4 26B | 90% | $0.0005 | 15.6s | |
| DeepSeek V4 Flash | 89% | $0.0002 | 8.0s | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | |
| Grok 4.1 Fast | 93% | $0.0025 | 31.2s | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0010 | 45.3s | |
| Z.AI GLM 4.5 Air | 92% | $0.0026 | 39.2s | |
| DeepSeek-V2 Chat | 85% | $0.0013 | 9.8s | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | |
| Gemini 2.5 Flash | 87% | $0.0020 | 2.4s | |
| Mistral Medium 3.1 | 79% | $0.0021 | 5.0s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 1.2m | |
| Xiaomi MIMO v2.5 | 93% | $0.0057 | 22.2s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Claude Opus 4.7 | 97% | 100% | 97% | |
| Gemini 2.5 Pro | 99% | 96% | 96% | |
| Qwen 3.5 27B | 98% | 97% | 95% | |
| Claude Opus 4.6 | 99% | 95% | 95% | |
| Qwen 3.5 397B A17B | 98% | 95% | 95% | |
| Qwen3.6 Max Preview | 98% | 97% | 94% | |
| Claude Opus 4.6 (Reasoning) | 98% | 93% | 93% | |
| Claude Haiku 4.5 | 93% | 100% | 93% | |
| Gemma 4 31B | 97% | 96% | 93% | |
| GPT-5.2 | 96% | 97% | 93% | |
| Qwen 3.5 Flash | 98% | 96% | 93% | |
| o4 Mini | 95% | 95% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 94% | 92% | |
| Qwen 3.6 27B | 95% | 94% | 91% | |
| Qwen 3.5 Plus (2026-04-20) | 96% | 94% | 91% | |
| Grok 4.20 (Reasoning) | 97% | 95% | 91% | |
| GPT-5 Mini | 96% | 95% | 91% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | 93% | |
| Gemma 4 31B | 97% | $0.0005 | 44.4s | 93% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.012 | 20.1s | 92% | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | 84% | |
| Grok 4 Fast | 93% | $0.0019 | 16.5s | 83% | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0010 | 45.3s | 88% | |
| Gemma 4 26B | 90% | $0.0005 | 15.6s | 83% | |
| Xiaomi MIMO v2.5 | 93% | $0.0057 | 22.2s | 87% | |
| Inception Mercury 2 | 88% | $0.0032 | 4.9s | 86% | |
| Claude Opus 4.5 | 100% | $0.030 | 9.1s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | 87% | |
| Qwen 3.5 Flash | 98% | $0.0044 | 1.3m | 93% | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | 83% | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | 85% | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | 85% | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | 82% | |
| DeepSeek V4 Flash | 89% | $0.0002 | 8.0s | 80% | |
| Grok 4.1 Fast | 93% | $0.0025 | 31.2s | 82% | |
| Qwen 3.6 Flash | 95% | $0.012 | 33.6s | 89% | |
| Z.AI GLM 4.5 Air | 92% | $0.0026 | 39.2s | 85% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 70.5% | Accuracy (recall) | ||
| 89.3% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), long passage (1,019 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0032 | 21.1s | |
| Grok 4.1 Fast | 95% | $0.0040 | 40.0s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0022 | 1.5m | |
| Stealth: Healer Alpha | 91% | $0.0000 | 33.9s | |
| Gemma 4 31B | 92% | $0.0018 | 55.9s | |
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | |
| Stealth: Hunter Alpha | 91% | $0.0000 | 1.1m | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | |
| Gemini 2.5 Flash | 87% | $0.0051 | 5.7s | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | |
| GPT-5.4 | 95% | $0.028 | 20.6s | |
| Qwen 3.5 Flash | 93% | $0.0058 | 1.4m | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | |
| Qwen 3.6 Flash | 94% | $0.020 | 54.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 87% | $0.010 | 9.6s | |
| Xiaomi MIMO v2.5 | 86% | $0.014 | 51.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | |
| Z.AI GLM 4.7 | 93% | $0.018 | 2.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 98% | 99% | 97% | |
| Grok 4 | 98% | 97% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| GPT-5 | 97% | 98% | 95% | |
| GPT-5.5 (Reasoning) | 97% | 97% | 95% | |
| Claude Opus 4.6 (Reasoning) | 96% | 97% | 94% | |
| Grok 4.20 (Reasoning) | 98% | 97% | 94% | |
| Gemini 3.1 Pro (Preview) | 96% | 98% | 94% | |
| Qwen 3.5 27B | 96% | 99% | 94% | |
| Claude Opus 4.5 | 96% | 98% | 94% | |
| Claude Opus 4.7 (Reasoning) | 96% | 98% | 94% | |
| GPT-5.5 | 95% | 98% | 94% | |
| GPT-5.5 (Reasoning, Low) | 96% | 96% | 93% | |
| GPT-5.4 | 95% | 98% | 93% | |
| GPT-5.2 | 96% | 96% | 93% | |
| DeepSeek V4 Flash (Reasoning) | 96% | 96% | 93% | |
| GPT-5 Mini | 95% | 97% | 93% | |
| Gemini 2.5 Pro | 96% | 96% | 93% | |
| Grok 4.3 (Reasoning) | 96% | 96% | 93% | |
| GPT-5.4 (Reasoning) | 95% | 96% | 92% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 96% | $0.0032 | 21.1s | 89% | |
| Grok 4.1 Fast | 95% | $0.0040 | 40.0s | 90% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | 92% | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0022 | 1.5m | 93% | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | 92% | |
| GPT-5.4 | 95% | $0.028 | 20.6s | 93% | |
| Gemma 4 31B | 92% | $0.0018 | 55.9s | 90% | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | 93% | |
| Grok 4.20 (Reasoning) | 98% | $0.030 | 1.4m | 94% | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | 87% | |
| Qwen 3.6 Flash | 94% | $0.020 | 54.5s | 91% | |
| Stealth: Hunter Alpha | 91% | $0.0000 | 1.1m | 86% | |
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | 85% | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | 88% | |
| GPT-5.4 (Reasoning, Low) | 95% | $0.042 | 30.5s | 92% | |
| Qwen 3.5 Flash | 93% | $0.0058 | 1.4m | 86% | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | 85% | |
| Stealth: Healer Alpha | 91% | $0.0000 | 33.9s | 81% | |
| Grok 4.20 (Beta, Reasoning) | 96% | $0.049 | 27.1s | 92% | |
| Qwen 3.5 27B | 96% | $0.027 | 2.2m | 94% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 65.2% | Accuracy (recall) | ||
| 90.6% | Precision | ||
| 100.0% | Structural validity |
tiers
5 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| DeepSeek V4 Flash | 98% | $0.0001 | 3.4s | |
| Gemini 3.1 Flash Lite (Reasoning) | 96% | $0.0008 | 1.4s | |
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | |
| Gemma 4 26B | 100% | $0.0002 | 8.4s | |
| Grok 4 Fast | 99% | $0.0008 | 6.9s | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | |
| DeepSeek-V2 Chat | 96% | $0.0007 | 6.2s | |
| DeepSeek V3 (2024-12-26) | 96% | $0.0007 | 7.3s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 12.7s | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0003 | 12.7s | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | |
| Grok 4.1 Fast | 100% | $0.0009 | 9.2s | |
| Gemma 4 31B | 100% | $0.0003 | 23.4s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0014 | 6.3s | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0014 | 12.5s | |
| DeepSeek V4 Pro | 97% | $0.0010 | 8.1s | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | |
| Stealth: Aurora Alpha | 94% | — | 2.7s | |
| Llama 3.1 Nemotron 70B | 96% | $0.0020 | 8.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | 100% | |
| Gemma 4 26B | 100% | $0.0002 | 8.4s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 9.2s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 12.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0032 | 3.8s | 100% | |
| DeepSeek V4 Flash | 98% | $0.0001 | 3.4s | 94% | |
| GPT-4.1 | 100% | $0.0035 | 3.2s | 100% | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | 100% | |
| Hermes 3 405B | 100% | $0.0016 | 10.8s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | 92% | |
| Grok 4 Fast | 99% | $0.0008 | 6.9s | 95% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0037 | 5.9s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | 94% | |
| Gemini 3.1 Flash Lite (Reasoning) | 96% | $0.0008 | 1.4s | 91% | |
| GPT-5.4 | 100% | $0.0051 | 3.7s | 100% | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | 95% | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0003 | 12.7s | 95% | |
| Z.AI GLM 4.5 | 99% | $0.0013 | 11.9s | 95% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0055 | 7.3s | 100% | |
| Gemma 4 31B | 100% | $0.0003 | 23.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 96.0% | Accuracy (recall) | ||
| 98.0% | Precision | ||
| 100.0% | Structural validity |
10 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 4.5s | |
| Gemini 3.1 Flash Lite (Preview) | 94% | $0.0011 | 1.5s | |
| Gemma 4 26B | 100% | $0.0003 | 10.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | |
| Grok 4 Fast | 97% | $0.0010 | 7.8s | |
| DeepSeek V4 Pro | 98% | $0.0027 | 11.8s | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | |
| Stealth: Healer Alpha | 97% | $0.0000 | 19.0s | |
| Gemma 4 31B | 100% | $0.0005 | 32.7s | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | |
| Qwen 3.5 Plus (2026-02-15) | 94% | $0.0018 | 12.4s | |
| Mistral Small Creative | 93% | $0.0003 | 2.3s | |
| DeepSeek V4 Flash (Reasoning) | 98% | $0.0005 | 17.7s | |
| Grok 4.1 Fast | 100% | $0.0013 | 18.5s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0014 | 11.4s | |
| MiniMax M2.5 | 97% | $0.0014 | 17.9s | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | |
| Gemini 3.1 Flash Lite (Reasoning) | 89% | $0.0010 | 2.2s | |
| Writer: Palmyra X5 | 94% | $0.0034 | 6.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| DeepSeek V4 Pro (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.6 27B | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 4.5s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| Gemma 4 26B | 100% | $0.0003 | 10.1s | 100% | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | 96% | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0049 | 7.5s | 100% | |
| GPT-5.4 | 100% | $0.0061 | 4.2s | 100% | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | 94% | |
| Grok 4.1 Fast | 100% | $0.0013 | 18.5s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0067 | 11.1s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 98% | $0.0005 | 17.7s | 94% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0096 | 6.3s | 100% | |
| Gemma 4 31B | 100% | $0.0005 | 32.7s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | 90% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0005 | 6.5s | 90% | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 12.0s | 92% | |
| Claude Sonnet 4 | 100% | $0.012 | 4.6s | 100% | |
| Llama 3.1 Nemotron 70B | 97% | $0.0030 | 9.7s | 91% | |
| DeepSeek V4 Pro | 98% | $0.0027 | 11.8s | 91% | |
| Z.AI GLM 4.7 | 100% | $0.0047 | 26.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 89.2% | Accuracy (recall) | ||
| 94.0% | Precision | ||
| 100.0% | Structural validity |
20 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4.20 (Beta) | 99% | $0.0047 | 2.0s | |
| Grok 4.20 | 99% | $0.0047 | 5.5s | |
| Grok 4 Fast | 87% | $0.0018 | 13.2s | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | |
| Stealth: Healer Alpha | 92% | $0.0000 | 23.5s | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | |
| Grok 4.1 Fast | 95% | $0.0022 | 21.9s | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | |
| DeepSeek V4 Flash | 88% | $0.0003 | 8.1s | |
| Gemma 4 31B | 92% | $0.0007 | 20.8s | |
| ByteDance Seed 1.6 Flash | 87% | $0.0009 | 11.8s | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0009 | 34.1s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0018 | 2.1s | |
| Gemini 3.1 Flash Lite | 89% | $0.0018 | 2.6s | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | |
| GPT-4.1 | 75% | $0.0076 | 8.2s | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | |
| GPT-5.4 | 95% | $0.011 | 8.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.20 | 99% | 98% | 98% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Grok 4.20 (Beta) | 99% | 95% | 95% | |
| GPT-5 | 97% | 97% | 94% | |
| Grok 4.20 (Reasoning) | 96% | 96% | 93% | |
| GPT-5.1 | 96% | 96% | 93% | |
| GPT-5.4 | 95% | 97% | 93% | |
| Claude Sonnet 4.5 | 97% | 93% | 93% | |
| Gemini 3 Flash (Preview) | 92% | 100% | 92% | |
| Gemma 4 31B | 92% | 100% | 92% | |
| GPT-5.2 | 95% | 96% | 92% | |
| Gemini 2.5 Pro | 97% | 96% | 92% | |
| MoonshotAI: Kimi K2.6 | 97% | 94% | 92% | |
| Grok 4.20 (Beta, Reasoning) | 96% | 94% | 92% | |
| ByteDance Seed 2.0 Lite | 97% | 95% | 91% | |
| Grok 4 | 96% | 94% | 91% | |
| Gemini 3 Pro (Preview) | 95% | 95% | 91% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.20 | 99% | $0.0047 | 5.5s | 98% | |
| Grok 4.20 (Beta) | 99% | $0.0047 | 2.0s | 95% | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | 92% | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | 88% | |
| Gemma 4 31B | 92% | $0.0007 | 20.8s | 92% | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | 88% | |
| GPT-5.4 | 95% | $0.011 | 8.4s | 93% | |
| Grok 4.1 Fast | 95% | $0.0022 | 21.9s | 91% | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | 88% | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0018 | 2.1s | 83% | |
| Gemini 3.1 Flash Lite | 89% | $0.0018 | 2.6s | 83% | |
| GPT-4o, Aug. 6th (temp=0) | 91% | $0.011 | 3.8s | 89% | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | 89% | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0076 | 12.3s | 86% | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | 85% | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | 86% | |
| Claude Sonnet 4.5 | 97% | $0.022 | 8.9s | 93% | |
| Claude Opus 4.6 | 100% | $0.035 | 8.0s | 100% | |
| Stealth: Healer Alpha | 92% | $0.0000 | 23.5s | 84% | |
| Claude Opus 4.5 | 100% | $0.037 | 8.6s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 79.1% | Accuracy (recall) | ||
| 89.2% | Precision | ||
| 100.0% | Structural validity |
40 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Healer Alpha | 98% | $0.0000 | 16.1s | |
| Grok 4.1 Fast | 100% | $0.0016 | 10.1s | |
| Gemma 4 31B | 100% | $0.0012 | 29.3s | |
| Grok 4 Fast | 100% | $0.0021 | 12.1s | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0010 | 36.2s | |
| Stealth: Hunter Alpha | 97% | $0.0000 | 32.5s | |
| Qwen 3.5 Flash | 98% | $0.0036 | 47.1s | |
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | |
| Xiaomi MIMO v2.5 | 97% | $0.0045 | 16.2s | |
| DeepSeek V4 Pro (Reasoning) | 89% | $0.0098 | 1.8m | |
| Xiaomi MIMO v2.5 Pro | 99% | $0.0082 | 28.3s | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | |
| Z.AI GLM 4.7 | 98% | $0.0070 | 28.9s | |
| Gemma 4 26B (Reasoning) | 99% | $0.0024 | 1.5m | |
| Nemotron 3 Super | 97% | $0.0000 | 2.8m | |
| Gemma 4 31B (Reasoning) | 99% | $0.0019 | 2.6m | |
| Inception Mercury 2 | 94% | $0.0029 | 4.4s | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | |
| GPT-5 Mini | 99% | $0.0075 | 1.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| GPT-5.5 | 100% | 100% | 100% | |
| Qwen 3.6 35B | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Gemma 4 31B | 100% | 100% | 100% | |
| Grok 4 Fast | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 98% | 98% | |
| Qwen 3.5 35B | 100% | 98% | 98% | |
| Z.AI GLM 5.1 | 99% | 97% | 97% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.1 Fast | 100% | $0.0016 | 10.1s | 100% | |
| Grok 4 Fast | 100% | $0.0021 | 12.1s | 100% | |
| Gemma 4 31B | 100% | $0.0012 | 29.3s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | 98% | |
| Stealth: Healer Alpha | 98% | $0.0000 | 16.1s | 95% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.017 | 11.0s | 100% | |
| GPT-5.4 | 99% | $0.011 | 7.9s | 96% | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0010 | 36.2s | 95% | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | 94% | |
| GPT-5.5 | 100% | $0.022 | 5.8s | 100% | |
| Grok 4.20 (Reasoning) | 100% | $0.015 | 33.6s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | 91% | |
| Xiaomi MIMO v2.5 Pro | 99% | $0.0082 | 28.3s | 95% | |
| Xiaomi MIMO v2.5 | 97% | $0.0045 | 16.2s | 93% | |
| GPT-5.4 Mini (Reasoning) | 99% | $0.013 | 18.6s | 97% | |
| Qwen 3.6 35B | 100% | $0.013 | 48.4s | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | $0.022 | 13.0s | 100% | |
| Qwen 3.6 Flash | 99% | $0.010 | 26.2s | 95% | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | 90% | |
| Stealth: Hunter Alpha | 97% | $0.0000 | 32.5s | 91% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 84.5% | Accuracy (recall) | ||
| 94.7% | Precision | ||
| 100.0% | Structural validity |