Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.4 | 97% | $0.013 | 8.8s | |
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | $0.0100 | 27.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | |
| Gemma 4 31B | 97% | $0.0009 | 37.9s | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 95% | $0.011 | 3.6s | |
| DeepSeek V4 Flash (Reasoning) | 97% | $0.0010 | 40.3s | |
| Xiaomi MIMO v2.5 | 95% | $0.0058 | 21.8s | |
| Qwen 3.6 Flash | 96% | $0.011 | 29.9s | |
| Gemini 2.5 Flash | 91% | $0.0025 | 2.8s | |
| GPT-5.4 (Reasoning, Low) | 96% | $0.020 | 13.5s | |
| Inception Mercury 2 | 92% | $0.0030 | 4.6s | |
| Gemini 3 Flash (Preview) | 94% | $0.0031 | 4.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0055 | 6.7s | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0091 | 37.0s | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 17.6s | |
| GPT-5.5 | 98% | $0.030 | 7.7s | |
| Claude Sonnet 4.5 | 97% | $0.024 | 8.9s | |
| Grok 4.20 (Reasoning) | 98% | $0.015 | 45.4s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
12 low-scoring outliers hidden: Mistral Small 4 (62.0%), Ministral 3 3B (61.5%), Cohere Command R+ (Aug. 2024) (61.4%), Hermes 3 70B (60.5%), Ministral 8B (59.6%), Ministral 3B (59.5%), Mistral NeMO (53.6%), GPT-5.4 Nano (Reasoning, Low) (42.6%), Gemma 3 4B (39.2%), WizardLM 2 8x22b (36.9%), GPT-5.4 Nano (36.6%), GPT-4.1 Nano (21.9%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Claude Opus 4.5 | 99% | 97% | 97% | |
| Claude Opus 4.6 (Reasoning) | 99% | 96% | 96% | |
| Gemini 2.5 Pro | 99% | 95% | 95% | |
| Grok 4.20 (Reasoning) | 98% | 95% | 95% | |
| Qwen3.6 Max Preview | 98% | 95% | 95% | |
| Gemini 3.5 Flash (Reasoning) | 98% | 94% | 94% | |
| Claude Opus 4.6 | 99% | 94% | 94% | |
| Qwen 3.5 27B | 98% | 94% | 94% | |
| Z.AI GLM 5.1 | 98% | 94% | 94% | |
| GPT-5.5 | 98% | 93% | 93% | |
| GPT-5.5 (Reasoning) | 97% | 93% | 93% | |
| Qwen 3.5 Plus (2026-04-20) | 97% | 93% | 93% | |
| GPT-5.5 (Reasoning, Low) | 97% | 93% | 93% | |
| Gemma 4 31B (Reasoning) | 97% | 93% | 93% | |
| Qwen3.7 Max | 98% | 93% | 93% | |
| Gemini 2.5 Flash (Reasoning) | 97% | 92% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 92% | 92% | |
| Z.AI GLM 5 Turbo | 97% | 92% | 92% | |
| Claude Opus 4.7 (Reasoning) | 97% | 92% | 92% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash (Reasoning) | 97% | $0.0079 | 12.5s | 92% | |
| Z.AI GLM 5 Turbo | 97% | $0.0072 | 17.1s | 92% | |
| GPT-5.4 | 97% | $0.013 | 8.8s | 91% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.011 | 18.0s | 92% | |
| Gemini 3 Flash (Preview) | 94% | $0.0031 | 4.5s | 83% | |
| Gemma 4 31B | 97% | $0.0009 | 37.9s | 90% | |
| DeepSeek V4 Flash (Reasoning) | 97% | $0.0010 | 40.3s | 90% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 95% | $0.011 | 3.6s | 86% | |
| Inception Mercury 2 | 92% | $0.0030 | 4.6s | 83% | |
| Gemini 2.5 Flash | 91% | $0.0025 | 2.8s | 82% | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | $0.0100 | 27.9s | 90% | |
| Gemini 2.5 Flash Lite (Reasoning) | 93% | $0.0020 | 17.6s | 84% | |
| GPT-5.4 Mini (Reasoning, Low) | 93% | $0.0055 | 6.7s | 82% | |
| GPT-5.5 | 98% | $0.030 | 7.7s | 93% | |
| Grok 4.20 (Reasoning) | 98% | $0.015 | 45.4s | 95% | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0091 | 37.0s | 89% | |
| Claude Sonnet 4.5 | 97% | $0.024 | 8.9s | 89% | |
| GPT-5.4 (Reasoning, Low) | 96% | $0.020 | 13.5s | 88% | |
| Xiaomi MIMO v2.5 | 95% | $0.0058 | 21.8s | 82% | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0019 | 2.2s | 75% | |
| matrix | tiers | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Total â–¼ | Small codex (7 entries), short passage (165 words) | Large codex (40 entries), short passage (165 words) | Small codex (7 entries), long passage (734 words) | Large codex (40 entries), long passage (1,019 words) | 5 codex entries | 10 codex entries | 20 codex entries | 40 codex entries |
| Claude Opus 4.5 | 99% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 98% |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | 99% | 96% | 100% | 100% | 99% | 100% |
| Claude Opus 4.6 (Reasoning) | 99% | 100% | 97% | 98% | 96% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 99% | 99% | 100% | 99% | 96% | 100% | 99% | 97% | 99% |
| Claude Opus 4.6 | 99% | 100% | 98% | 99% | 91% | 100% | 100% | 100% | 100% |
| Qwen3.6 Max Preview | 98% | 100% | 97% | 98% | 98% | 100% | 100% | 94% | 100% |
| Qwen3.7 Max | 98% | 100% | 98% | 95% | 98% | 100% | 100% | 95% | 100% |
| Grok 4.20 (Reasoning) | 98% | 100% | 97% | 97% | 98% | 100% | 97% | 96% | 100% |
| Gemini 3.5 Flash (Reasoning) | 98% | 100% | 98% | 98% | 95% | 100% | 99% | 95% | 100% |
| Z.AI GLM 5.1 | 98% | 100% | 98% | 96% | 96% | 100% | 99% | 93% | 99% |
| GPT-5.5 | 98% | 100% | 100% | 94% | 95% | 100% | 99% | 92% | 100% |
| Qwen 3.5 27B | 98% | 100% | 94% | 98% | 96% | 99% | 100% | 94% | 100% |
| Grok 4.3 (Reasoning) | 97% | 100% | 97% | 97% | 96% | 100% | 96% | 95% | 99% |
| GPT-5.5 (Reasoning) | 97% | 100% | 97% | 92% | 97% | 100% | 100% | 93% | 100% |
| Z.AI GLM 5 Turbo | 97% | 100% | 98% | 94% | 94% | 99% | 100% | 95% | 100% |
matrix
Small codex (7 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite | 100% | $0.0012 | 2.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0011 | 6.5s | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | |
| Mistral Medium 3.1 | 97% | $0.0018 | 5.6s | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | |
| DeepSeek V4 Flash | 96% | $0.0002 | 6.6s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0022 | 9.4s | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 11.9s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0012 | 6.9s | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | |
| Gemma 4 26B | 97% | $0.0003 | 24.6s | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0007 | 36.3s | |
| Gemini 2.5 Flash | 97% | $0.0017 | 2.1s | |
| Inception Mercury 2 | 97% | $0.0018 | 2.6s | |
| Gemma 3 12B | 95% | $0.0001 | 14.1s | |
| Gemma 4 31B | 97% | $0.0004 | 23.8s | |
| DeepSeek V4 Flash (Reasoning) | 98% | $0.0006 | 27.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.7 Max | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.8 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.8 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MiniMax M3 | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 1.9s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0025 | 3.6s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0012 | 2.0s | 97% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0011 | 14.8s | 98% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0052 | 7.8s | 100% | |
| Gemini 2.5 Flash | 97% | $0.0017 | 2.1s | 97% | |
| Inception Mercury 2 | 97% | $0.0018 | 2.6s | 97% | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0041 | 4.4s | 97% | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 11.9s | 97% | |
| Z.AI GLM 5 Turbo | 100% | $0.0047 | 10.5s | 98% | |
| DeepSeek-V2 Chat | 99% | $0.0011 | 12.7s | 95% | |
| DeepSeek V4 Flash | 96% | $0.0002 | 6.6s | 93% | |
| Xiaomi MIMO v2.5 Pro | 100% | $0.0051 | 20.3s | 100% | |
| DeepSeek V3.2 | 98% | $0.0006 | 16.2s | 94% | |
| Z.AI GLM 4.5 | 98% | $0.0019 | 15.2s | 95% | |
| Llama 3.1 70B | 97% | $0.0009 | 14.9s | 95% | |
| GPT-4.1 | 100% | $0.0065 | 7.7s | 97% | |
| GPT-4o, Aug. 6th (temp=0) | 99% | $0.0080 | 3.4s | 97% | |
| Mistral Medium 3.1 | 97% | $0.0018 | 5.6s | 92% | |
| Gemma 4 26B | 97% | $0.0003 | 24.6s | 97% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 96.0% | Accuracy (recall) | ||
| 100.0% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), short passage (165 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | |
| Grok 4.20 | 94% | $0.0081 | 8.6s | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | |
| DeepSeek V4 Flash | 92% | $0.0005 | 12.9s | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | |
| Gemma 4 31B | 94% | $0.0013 | 1.2m | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0015 | 58.3s | |
| Gemini 2.5 Flash | 90% | $0.0038 | 4.0s | |
| GPT-5.4 | 98% | $0.018 | 11.4s | |
| Xiaomi MIMO v2.5 | 94% | $0.0071 | 25.0s | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0093 | 35.3s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0027 | 3.2s | |
| Qwen 3.6 Flash | 97% | $0.012 | 33.9s | |
| GPT-4.1 | 94% | $0.011 | 15.1s | |
| MiniMax M2.5 | 91% | $0.0023 | 30.1s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0098 | 14.4s | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | $0.012 | 35.5s | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0031 | 3.0s | |
| Gemini 3 Flash (Preview, Reasoning) | 96% | $0.012 | 19.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 100% | 100% | 100% | |
| GPT-5.5 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 98% | 98% | |
| Gemini 2.5 Pro | 100% | 97% | 97% | |
| GPT-5.5 (Reasoning, Low) | 99% | 97% | 96% | |
| Gemini 3.5 Flash (Reasoning) | 98% | 97% | 96% | |
| Z.AI GLM 5.1 | 98% | 97% | 96% | |
| GPT-5.5 (Reasoning) | 97% | 98% | 96% | |
| Qwen3.7 Max | 98% | 97% | 95% | |
| MiniMax M3 | 98% | 97% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| Claude Opus 4.6 | 98% | 97% | 95% | |
| MoonshotAI: Kimi K2.6 | 98% | 97% | 95% | |
| Qwen3.6 Max Preview | 97% | 97% | 94% | |
| DeepSeek V4 Pro (Reasoning) | 97% | 96% | 94% | |
| Claude Sonnet 4.6 | 98% | 96% | 94% | |
| Grok 4.20 (Reasoning) | 97% | 98% | 94% | |
| Z.AI GLM 5 Turbo | 98% | 96% | 94% | |
| Qwen 3.6 Flash | 97% | 96% | 94% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 98% | $0.0097 | 19.4s | 94% | |
| GPT-5.4 | 98% | $0.018 | 11.4s | 93% | |
| Gemini 3 Flash (Preview) | 93% | $0.0039 | 6.2s | 88% | |
| Claude Sonnet 4.5 | 100% | $0.036 | 11.3s | 100% | |
| Inception Mercury 2 | 92% | $0.0040 | 5.5s | 88% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 94% | $0.016 | 5.0s | 92% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0098 | 14.4s | 90% | |
| Grok 4.20 | 94% | $0.0081 | 8.6s | 87% | |
| GPT-5.5 | 100% | $0.042 | 10.0s | 100% | |
| Qwen 3.6 Flash | 97% | $0.012 | 33.9s | 94% | |
| Gemini 3 Flash (Preview, Reasoning) | 96% | $0.012 | 19.4s | 91% | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0070 | 6.8s | 86% | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | $0.012 | 35.5s | 92% | |
| DeepSeek V4 Flash | 92% | $0.0005 | 12.9s | 84% | |
| Xiaomi MIMO v2.5 | 94% | $0.0071 | 25.0s | 89% | |
| Gemini 3.1 Flash Lite (Preview) | 90% | $0.0031 | 3.0s | 84% | |
| Xiaomi MIMO v2.5 Pro | 96% | $0.0093 | 35.3s | 90% | |
| GPT-4.1 | 94% | $0.011 | 15.1s | 86% | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0027 | 3.2s | 82% | |
| Gemini 2.5 Flash | 90% | $0.0038 | 4.0s | 83% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 83.6% | Accuracy (recall) | ||
| 93.4% | Precision | ||
| 100.0% | Structural validity |
Small codex (7 entries), long passage (734 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 4 31B | 97% | $0.0005 | 44.4s | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 96% | $0.0087 | 2.9s | |
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | |
| Gemma 4 26B | 90% | $0.0005 | 15.6s | |
| DeepSeek V4 Flash | 89% | $0.0002 | 8.0s | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0010 | 45.3s | |
| Z.AI GLM 4.5 Air | 92% | $0.0026 | 39.2s | |
| DeepSeek-V2 Chat | 85% | $0.0013 | 9.8s | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | |
| Gemini 2.5 Flash | 87% | $0.0020 | 2.4s | |
| Mistral Medium 3.1 | 79% | $0.0021 | 5.0s | |
| Xiaomi MIMO v2.5 | 93% | $0.0057 | 22.2s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | |
| Inception Mercury 2 | 88% | $0.0032 | 4.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.012 | 20.1s | |
| Z.AI GLM 4.5 | 85% | $0.0018 | 13.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| Claude Opus 4.7 | 97% | 100% | 97% | |
| Gemini 2.5 Pro | 99% | 96% | 96% | |
| Qwen 3.5 27B | 98% | 97% | 95% | |
| Claude Opus 4.6 | 99% | 95% | 95% | |
| Claude Opus 4.8 (Reasoning, Low) | 99% | 95% | 95% | |
| Gemini 3.5 Flash (Reasoning) | 98% | 95% | 95% | |
| Qwen 3.5 397B A17B | 98% | 95% | 95% | |
| Qwen3.6 Max Preview | 98% | 97% | 94% | |
| Claude Opus 4.6 (Reasoning) | 98% | 93% | 93% | |
| Claude Haiku 4.5 | 93% | 100% | 93% | |
| Gemma 4 31B | 97% | 96% | 93% | |
| GPT-5.2 | 96% | 97% | 93% | |
| Qwen 3.5 Flash | 98% | 96% | 93% | |
| o4 Mini | 95% | 95% | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 94% | 92% | |
| Qwen 3.6 27B | 95% | 94% | 91% | |
| Qwen 3.5 Plus (2026-04-20) | 96% | 94% | 91% | |
| Grok 4.20 (Reasoning) | 97% | 95% | 91% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 93% | $0.0051 | 4.6s | 93% | |
| Gemma 4 31B | 97% | $0.0005 | 44.4s | 93% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 96% | $0.0087 | 2.9s | 89% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | $0.012 | 20.1s | 92% | |
| Qwen 3.5 Plus (2026-02-15) | 95% | $0.0025 | 21.6s | 84% | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0010 | 45.3s | 88% | |
| Claude Opus 4.5 | 100% | $0.030 | 9.1s | 100% | |
| Xiaomi MIMO v2.5 | 93% | $0.0057 | 22.2s | 87% | |
| Gemma 4 26B | 90% | $0.0005 | 15.6s | 83% | |
| Qwen 3.5 Flash | 98% | $0.0044 | 1.3m | 93% | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.0094 | 16.9s | 87% | |
| Inception Mercury 2 | 88% | $0.0032 | 4.9s | 86% | |
| Gemini 3 Flash (Preview) | 88% | $0.0022 | 3.6s | 83% | |
| Z.AI GLM 5 Turbo | 94% | $0.0077 | 19.5s | 85% | |
| MiniMax M2.7 | 91% | $0.0021 | 28.0s | 85% | |
| DeepSeek V4 Flash | 89% | $0.0002 | 8.0s | 80% | |
| Qwen 3.6 Flash | 95% | $0.012 | 33.6s | 89% | |
| GPT-4.1 Mini | 87% | $0.0011 | 4.4s | 82% | |
| Z.AI GLM 4.5 Air | 92% | $0.0026 | 39.2s | 85% | |
| GPT-5 Mini | 96% | $0.011 | 52.1s | 91% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 77.0% | Accuracy (recall) | ||
| 89.6% | Precision | ||
| 100.0% | Structural validity |
Large codex (40 entries), long passage (1,019 words)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | |
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | |
| Gemini 2.5 Flash | 87% | $0.0051 | 5.7s | |
| Gemma 4 31B | 92% | $0.0018 | 55.9s | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | |
| GPT-5.4 | 95% | $0.028 | 20.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0022 | 1.5m | |
| GPT-5.4 Mini (Reasoning, Low) | 87% | $0.010 | 9.6s | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 89% | $0.023 | 7.8s | |
| Gemini 3.1 Flash Lite (Preview) | 83% | $0.0038 | 3.7s | |
| Xiaomi MIMO v2.5 | 86% | $0.014 | 51.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | |
| Qwen 3.6 Flash | 94% | $0.020 | 54.5s | |
| Mistral Large 3 | 85% | $0.0060 | 25.8s | |
| Gemini 3.1 Flash Lite (Reasoning) | 81% | $0.0035 | 4.1s | |
| Qwen 3.5 Flash | 93% | $0.0058 | 1.4m | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | |
| Gemma 4 26B | 84% | $0.0012 | 54.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 98% | 99% | 97% | |
| Qwen3.7 Max | 98% | 98% | 96% | |
| Claude Sonnet 4.6 (Reasoning) | 97% | 97% | 95% | |
| GPT-5 | 97% | 98% | 95% | |
| GPT-5.5 (Reasoning) | 97% | 97% | 95% | |
| Claude Opus 4.6 (Reasoning) | 96% | 97% | 94% | |
| Grok 4.20 (Reasoning) | 98% | 97% | 94% | |
| Gemini 3.1 Pro (Preview) | 96% | 98% | 94% | |
| Qwen 3.5 27B | 96% | 99% | 94% | |
| Claude Opus 4.5 | 96% | 98% | 94% | |
| Claude Opus 4.7 (Reasoning) | 96% | 98% | 94% | |
| Claude Sonnet 5 (Reasoning, Low) | 97% | 97% | 94% | |
| GPT-5.5 | 95% | 98% | 94% | |
| Claude Sonnet 5 (Reasoning) | 96% | 97% | 93% | |
| GPT-5.5 (Reasoning, Low) | 96% | 96% | 93% | |
| GPT-5.4 | 95% | 98% | 93% | |
| GPT-5.2 | 96% | 96% | 93% | |
| DeepSeek V4 Flash (Reasoning) | 96% | 96% | 93% | |
| GPT-5 Mini | 95% | 97% | 93% | |
| Gemini 2.5 Pro | 96% | 96% | 93% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash (Reasoning) | 95% | $0.016 | 24.6s | 92% | |
| GPT-5.4 | 95% | $0.028 | 20.6s | 93% | |
| Z.AI GLM 5 Turbo | 94% | $0.016 | 36.7s | 92% | |
| Gemini 3 Flash (Preview) | 90% | $0.0068 | 9.3s | 87% | |
| Gemma 4 31B | 92% | $0.0018 | 55.9s | 90% | |
| DeepSeek V4 Flash (Reasoning) | 96% | $0.0022 | 1.5m | 93% | |
| Inception Mercury 2 | 89% | $0.0066 | 10.4s | 85% | |
| Qwen 3.6 Flash | 94% | $0.020 | 54.5s | 91% | |
| GPT-5.4 (Reasoning, Low) | 95% | $0.042 | 30.5s | 92% | |
| Gemini 3 Flash (Preview, Reasoning) | 92% | $0.021 | 33.3s | 88% | |
| GPT-5 Mini | 95% | $0.017 | 1.3m | 93% | |
| GPT-5.4 Mini (Reasoning, Low) | 87% | $0.010 | 9.6s | 84% | |
| Grok 4.20 (Reasoning) | 98% | $0.030 | 1.4m | 94% | |
| Gemini 2.5 Flash | 87% | $0.0051 | 5.7s | 81% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 89% | $0.023 | 7.8s | 85% | |
| Gemini 2.5 Flash Lite (Reasoning) | 89% | $0.0041 | 42.5s | 85% | |
| GPT-5.5 | 95% | $0.068 | 17.5s | 94% | |
| o4 Mini | 92% | $0.031 | 42.1s | 89% | |
| GPT-5.2 | 96% | $0.052 | 55.4s | 93% | |
| Z.AI GLM 5.2 (Reasoning, High) | 95% | $0.029 | 1.3m | 91% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 70.8% | Accuracy (recall) | ||
| 91.6% | Precision | ||
| 100.0% | Structural validity |
tiers
5 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| DeepSeek V4 Flash | 98% | $0.0001 | 3.4s | |
| Gemini 3.1 Flash Lite (Reasoning) | 96% | $0.0008 | 1.4s | |
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | |
| Gemma 4 26B | 100% | $0.0002 | 8.4s | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | |
| DeepSeek-V2 Chat | 96% | $0.0007 | 6.2s | |
| DeepSeek V3 (2024-12-26) | 96% | $0.0007 | 7.3s | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0003 | 12.7s | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | |
| Gemma 4 31B | 100% | $0.0003 | 23.4s | |
| GPT-5.4 Nano (Reasoning) | 89% | $0.0014 | 6.3s | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0014 | 12.5s | |
| DeepSeek V4 Pro | 97% | $0.0010 | 8.1s | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | |
| Llama 3.1 Nemotron 70B | 96% | $0.0020 | 8.2s | |
| Z.AI GLM 4.5 Air | 95% | $0.0011 | 13.4s | |
| Z.AI GLM 4.6 | 97% | $0.0014 | 23.7s | |
| Gemini 3.1 Flash Lite | 96% | $0.0008 | 3.6s | |
| Hermes 3 405B | 100% | $0.0016 | 10.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.7 Max | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.8 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.3s | 100% | |
| Gemma 4 26B | 100% | $0.0002 | 8.4s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0032 | 3.8s | 100% | |
| GPT-4.1 | 100% | $0.0035 | 3.2s | 100% | |
| DeepSeek V4 Flash | 98% | $0.0001 | 3.4s | 94% | |
| Qwen 3 32B | 100% | $0.0005 | 13.1s | 100% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 100% | $0.0044 | 1.7s | 100% | |
| Hermes 3 405B | 100% | $0.0016 | 10.8s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 97% | $0.0008 | 1.3s | 92% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0037 | 5.9s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 98% | $0.0008 | 5.2s | 94% | |
| Gemini 3.1 Flash Lite (Reasoning) | 96% | $0.0008 | 1.4s | 91% | |
| GPT-5.4 | 100% | $0.0051 | 3.7s | 100% | |
| DeepSeek V3 (2025-03-24) | 99% | $0.0006 | 9.6s | 95% | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0003 | 12.7s | 95% | |
| Z.AI GLM 4.5 | 99% | $0.0013 | 11.9s | 95% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0055 | 7.3s | 100% | |
| Gemma 4 31B | 100% | $0.0003 | 23.4s | 100% | |
| Z.AI GLM 5 Turbo | 99% | $0.0031 | 7.6s | 95% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0055 | 8.9s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 97.0% | Accuracy (recall) | ||
| 98.3% | Precision | ||
| 100.0% | Structural validity |
10 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 94% | $0.0011 | 1.5s | |
| Gemma 4 26B | 100% | $0.0003 | 10.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | |
| DeepSeek V4 Pro | 98% | $0.0027 | 11.8s | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | |
| Gemma 4 31B | 100% | $0.0005 | 32.7s | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | |
| Qwen 3.5 Plus (2026-02-15) | 94% | $0.0018 | 12.4s | |
| DeepSeek V4 Flash (Reasoning) | 98% | $0.0005 | 17.7s | |
| Mistral Small 4 (Reasoning) | 93% | $0.0014 | 11.4s | |
| MiniMax M2.5 | 97% | $0.0014 | 17.9s | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | |
| Gemini 3.1 Flash Lite (Reasoning) | 89% | $0.0010 | 2.2s | |
| Writer: Palmyra X5 | 94% | $0.0034 | 6.6s | |
| GPT-5.4 Nano (Reasoning) | 87% | $0.0018 | 7.4s | |
| ByteDance Seed 1.6 Flash | 94% | $0.0005 | 6.5s | |
| MiniMax M2.7 | 98% | $0.0014 | 19.7s | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 12.0s | |
| GPT-5.4 | 100% | $0.0061 | 4.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.7 Max | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5.2 (Reasoning, High) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| Claude Opus 4.8 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.8 (Reasoning, Low) | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Claude Sonnet 5 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 5 (Reasoning, Low) | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| Gemma 4 26B | 100% | $0.0003 | 10.1s | 100% | |
| Mistral Medium 3.1 | 99% | $0.0015 | 3.7s | 96% | |
| Z.AI GLM 5 Turbo | 100% | $0.0035 | 9.0s | 100% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 100% | $0.0065 | 2.2s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0049 | 7.5s | 100% | |
| GPT-5.4 | 100% | $0.0061 | 4.2s | 100% | |
| Inception Mercury 2 | 98% | $0.0020 | 3.2s | 94% | |
| Z.AI GLM 5.2 (Reasoning, High) | 100% | $0.0044 | 11.2s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0067 | 11.1s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 98% | $0.0005 | 17.7s | 94% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0096 | 6.3s | 100% | |
| Gemma 4 31B | 100% | $0.0005 | 32.7s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 96% | $0.0004 | 10.3s | 90% | |
| ByteDance Seed 1.6 Flash | 94% | $0.0005 | 6.5s | 90% | |
| Xiaomi MIMO v2.5 | 99% | $0.0031 | 12.0s | 92% | |
| Claude Sonnet 4 | 100% | $0.012 | 4.6s | 100% | |
| Llama 3.1 Nemotron 70B | 97% | $0.0030 | 9.7s | 91% | |
| DeepSeek V4 Pro | 98% | $0.0027 | 11.8s | 91% | |
| Z.AI GLM 4.7 | 100% | $0.0047 | 26.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 91.7% | Accuracy (recall) | ||
| 96.4% | Precision | ||
| 100.0% | Structural validity |
20 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4.20 | 99% | $0.0047 | 5.5s | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | |
| DeepSeek V4 Flash | 88% | $0.0003 | 8.1s | |
| Gemma 4 31B | 92% | $0.0007 | 20.8s | |
| ByteDance Seed 1.6 Flash | 87% | $0.0009 | 11.8s | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0009 | 34.1s | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0018 | 2.1s | |
| Gemini 3.1 Flash Lite | 89% | $0.0018 | 2.6s | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | |
| GPT-4.1 | 75% | $0.0076 | 8.2s | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | |
| GPT-5.4 | 95% | $0.011 | 8.4s | |
| Claude Haiku 4.5 | 83% | $0.0070 | 5.0s | |
| Inception Mercury 2 | 86% | $0.0026 | 4.0s | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0076 | 12.3s | |
| Gemini 2.5 Flash Lite | 71% | $0.0004 | 1.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.20 | 99% | 98% | 98% | |
| Gemini 3.1 Pro (Preview) | 99% | 97% | 97% | |
| GPT-5 | 97% | 97% | 94% | |
| Grok 4.20 (Reasoning) | 96% | 96% | 93% | |
| GPT-5.1 | 96% | 96% | 93% | |
| GPT-5.4 | 95% | 97% | 93% | |
| Claude Sonnet 4.5 | 97% | 93% | 93% | |
| Gemma 4 31B | 92% | 100% | 92% | |
| Gemini 3 Flash (Preview) | 92% | 100% | 92% | |
| GPT-5.2 | 95% | 96% | 92% | |
| Gemini 2.5 Pro | 97% | 96% | 92% | |
| MoonshotAI: Kimi K2.6 | 97% | 94% | 92% | |
| ByteDance Seed 2.0 Lite | 97% | 95% | 91% | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | 94% | 91% | |
| Qwen3.7 Max | 95% | 95% | 91% | |
| Gemini 3 Pro (Preview) | 95% | 95% | 91% | |
| GPT-5.5 (Reasoning) | 93% | 98% | 91% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4.20 | 99% | $0.0047 | 5.5s | 98% | |
| Gemini 3 Flash (Preview) | 92% | $0.0031 | 4.0s | 92% | |
| Gemini 2.5 Flash | 93% | $0.0023 | 2.5s | 88% | |
| Gemma 4 31B | 92% | $0.0007 | 20.8s | 92% | |
| Mistral Medium 3.1 | 92% | $0.0027 | 8.1s | 88% | |
| GPT-5.4 | 95% | $0.011 | 8.4s | 93% | |
| Gemini 3.1 Flash Lite (Preview) | 88% | $0.0019 | 2.2s | 88% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 92% | $0.012 | 3.3s | 90% | |
| Gemini 3.1 Flash Lite (Reasoning) | 90% | $0.0018 | 2.1s | 83% | |
| Gemini 3.1 Flash Lite | 89% | $0.0018 | 2.6s | 83% | |
| GPT-4o, Aug. 6th (temp=0) | 91% | $0.011 | 3.8s | 89% | |
| Z.AI GLM 5.2 (Reasoning, High) | 97% | $0.0091 | 27.8s | 91% | |
| Z.AI GLM 5 Turbo | 95% | $0.0077 | 20.7s | 89% | |
| Gemini 2.5 Flash (Reasoning) | 93% | $0.0076 | 12.3s | 86% | |
| GPT-5.4 Mini (Reasoning, Low) | 91% | $0.0054 | 8.4s | 85% | |
| Gemini 2.5 Flash Lite (Reasoning) | 91% | $0.0022 | 17.1s | 86% | |
| Claude Sonnet 4.5 | 97% | $0.022 | 8.9s | 93% | |
| Claude Opus 4.6 | 100% | $0.035 | 8.0s | 100% | |
| Claude Opus 4.5 | 100% | $0.037 | 8.6s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 94% | $0.0009 | 34.1s | 85% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 80.9% | Accuracy (recall) | ||
| 90.1% | Precision | ||
| 100.0% | Structural validity |
40 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 4 31B | 100% | $0.0012 | 29.3s | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0010 | 36.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | |
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | |
| Xiaomi MIMO v2.5 | 97% | $0.0045 | 16.2s | |
| MiniMax M3 | 98% | $0.0025 | 38.8s | |
| Z.AI GLM 5.2 (Reasoning, High) | 100% | $0.0076 | 17.0s | |
| Inception Mercury 2 | 94% | $0.0029 | 4.4s | |
| Xiaomi MIMO v2.5 Pro | 99% | $0.0082 | 28.3s | |
| GPT-5.4 | 99% | $0.011 | 7.9s | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | |
| GPT-5.4 Mini (Reasoning, Low) | 94% | $0.0054 | 6.2s | |
| Qwen 3.5 Flash | 98% | $0.0036 | 47.1s | |
| Mistral Large 3 | 94% | $0.0040 | 6.6s | |
| Z.AI GLM 4.7 | 98% | $0.0070 | 28.9s | |
| Mistral Small 4 (Reasoning) | 94% | $0.0020 | 15.1s | |
| Gemini 3 Flash (Preview, Reasoning) | 98% | $0.0093 | 14.7s | |
| GPT-5 Mini | 99% | $0.0075 | 1.3m | |
| Qwen 3.6 Flash | 99% | $0.010 | 26.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.7 Max | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5.2 (Reasoning, High) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| GPT-5.5 | 100% | 100% | 100% | |
| Qwen 3.6 35B | 100% | 100% | 100% | |
| Gemma 4 31B | 100% | 100% | 100% | |
| Gemini 3.5 Flash (Reasoning) | 100% | 98% | 98% | |
| Z.AI GLM 5 Turbo | 100% | 98% | 98% | |
| Qwen 3.5 35B | 100% | 98% | 98% | |
| Z.AI GLM 5.1 | 99% | 97% | 97% | |
| Qwen 3.5 27B | 100% | 97% | 97% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 100% | $0.0057 | 13.3s | 98% | |
| Z.AI GLM 5.2 (Reasoning, High) | 100% | $0.0076 | 17.0s | 100% | |
| Gemma 4 31B | 100% | $0.0012 | 29.3s | 100% | |
| GPT-5.4 | 99% | $0.011 | 7.9s | 96% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.017 | 11.0s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0072 | 10.7s | 94% | |
| GPT-5.5 | 100% | $0.022 | 5.8s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 97% | $0.0016 | 11.6s | 91% | |
| Gemini 2.5 Flash | 96% | $0.0027 | 3.3s | 90% | |
| Xiaomi MIMO v2.5 | 97% | $0.0045 | 16.2s | 93% | |
| GPT-5.4 Mini (Reasoning) | 99% | $0.013 | 18.6s | 97% | |
| DeepSeek V4 Flash (Reasoning) | 99% | $0.0010 | 36.2s | 95% | |
| Inception Mercury 2 | 94% | $0.0029 | 4.4s | 88% | |
| Gemini 3 Flash (Preview, Reasoning) | 98% | $0.0093 | 14.7s | 92% | |
| Xiaomi MIMO v2.5 Pro | 99% | $0.0082 | 28.3s | 95% | |
| Grok 4.20 (Reasoning) | 100% | $0.015 | 33.6s | 100% | |
| Qwen 3.6 Flash | 99% | $0.010 | 26.2s | 95% | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 96% | $0.010 | 3.1s | 88% | |
| MiniMax M3 | 98% | $0.0025 | 38.8s | 93% | |
| Mistral Large 3 | 94% | $0.0040 | 6.6s | 87% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 86.8% | Accuracy (recall) | ||
| 95.1% | Precision | ||
| 100.0% | Structural validity |