Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
| Model | Total â–¼ | Small codex, short passage | Large codex, short passage | Small codex, long passage | Large codex, long passage | 5 codex entries | 10 codex entries | 20 codex entries | 40 codex entries |
|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | 99% | 100% | 100% | 95% | 95% | 100% | 100% | 100% | 98% |
| Claude Opus 4.6 | 98% | 100% | 98% | 95% | 94% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 98% | 99% | 100% | 96% | 95% | 100% | 99% | 97% | 99% |
| Claude Sonnet 4.5 | 98% | 100% | 100% | 96% | 90% | 98% | 100% | 97% | 99% |
| MoonshotAI: Kimi K2.5 | 97% | 99% | 98% | 95% | 91% | 100% | 100% | 93% | 99% |
| Z.AI GLM 5 | 97% | 100% | 95% | 98% | 91% | 100% | 99% | 93% | 97% |
| Z.AI GLM 4.7 | 97% | 100% | 94% | 97% | 93% | 100% | 100% | 91% | 98% |
| Gemini 3 Pro (Preview) | 97% | 99% | 93% | 99% | 89% | 100% | 100% | 95% | 98% |
| Claude Sonnet 4 | 96% | 100% | 97% | 95% | 93% | 100% | 100% | 92% | 96% |
| GPT-5.1 | 96% | 100% | 94% | 97% | 91% | 100% | 95% | 96% | 98% |
| o4 Mini High | 96% | 98% | 94% | 96% | 93% | 100% | 95% | 93% | 98% |
| GPT-5.2 | 96% | 100% | 95% | 91% | 92% | 100% | 99% | 95% | 96% |
| GPT-5 | 96% | 100% | 96% | 94% | 92% | 96% | 95% | 97% | 98% |
| Gemini 3 Flash (Preview) | 96% | 100% | 93% | 99% | 92% | 100% | 100% | 92% | 90% |
| o4 Mini | 96% | 98% | 93% | 97% | 92% | 100% | 96% | 92% | 98% |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Small codex, short passage
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
Large codex, short passage
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
Small codex, long passage
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
Large codex, long passage
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
5 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
10 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
20 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
40 codex entries
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.