Codex Violation Detection

Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
Grok 4 Fast96%$0.001812.8s
Grok 4.1 Fast97%$0.002121.1s
Z.AI GLM 5 Turbo97%$0.007217.1s
Gemini 2.5 Flash (Reasoning)97%$0.007912.5s
Stealth: Healer Alpha95%$0.000021.8s
GPT-5.497%$0.0138.8s
Gemini 3 Flash (Preview, Reasoning)97%$0.01118.0s
Qwen 3.5 Flash96%$0.00381.0m
Stealth: Hunter Alpha95%$0.000044.0s
ByteDance Seed 2.0 Lite96%$0.00671.1m
ByteDance Seed 1.696%$0.00671.0m
Z.AI GLM 4.796%$0.00911.0m
Gemini 2.5 Flash Lite (Reasoning)93%$0.002017.6s
Aion 2.095%$0.00841.3m
Z.AI GLM 597%$0.01356.4s
GPT-5.4 (Reasoning, Low)96%$0.02013.5s
GPT-5.4 Mini (Reasoning, Low)93%$0.00556.7s
Gemini 2.5 Flash91%$0.00252.8s
Inception Mercury 292%$0.00304.6s
MiniMax M2.792%$0.002229.4s
0.700.800.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

8 low-scoring outliers hidden: Llama 3.1 8B (46.2%), GPT-5.4 Nano (Reasoning, Low) (42.6%), Gemma 3 4B (39.2%), WizardLM 2 8x22b (36.9%), GPT-5.4 Nano (36.6%), Rocinante 12B (36.1%), GPT-4.1 Nano (21.9%), LFM2 24B (14.2%).

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
Gemini 3.1 Pro (Preview)99%97%97%
Claude Opus 4.599%97%97%
Claude Opus 4.6 (Reasoning)99%96%96%
Gemini 2.5 Pro99%95%95%
Grok 4.20 (Beta, Reasoning)98%95%95%
Claude Opus 4.699%94%94%
Qwen 3.5 27B98%94%94%
Gemini 2.5 Flash (Reasoning)97%92%92%
Gemini 3 Flash (Preview, Reasoning)97%92%92%
Z.AI GLM 5 Turbo97%92%92%
Grok 4.1 Fast97%92%92%
GPT-5.297%94%92%
GPT-5.497%91%91%
Qwen 3.5 35B97%92%91%
Claude Sonnet 4.6 (Reasoning)97%91%91%
Z.AI GLM 597%92%91%
GPT-5.197%93%90%
Qwen 3.5 122B96%91%90%
GPT-597%93%90%
Grok 497%91%89%
80%90%100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Grok 4.1 Fast97%$0.002121.1s92%
Gemini 2.5 Flash (Reasoning)97%$0.007912.5s92%
Z.AI GLM 5 Turbo97%$0.007217.1s92%
GPT-5.497%$0.0138.8s91%
Gemini 3 Flash (Preview, Reasoning)97%$0.01118.0s92%
Gemini 3 Flash (Preview)94%$0.00314.5s83%
Stealth: Healer Alpha95%$0.000021.8s86%
Inception Mercury 292%$0.00304.6s83%
Gemini 2.5 Flash91%$0.00252.8s82%
GPT-5.4 Mini (Reasoning, Low)93%$0.00556.7s82%
Gemini 2.5 Flash Lite (Reasoning)93%$0.002017.6s84%
Grok 4 Fast96%$0.001812.8s77%
Grok 4.20 (Beta, Reasoning)98%$0.02616.8s95%
Gemini 3.1 Flash Lite (Preview)90%$0.00192.2s75%
Claude Sonnet 4.597%$0.0248.9s89%
GPT-5.4 (Reasoning, Low)96%$0.02013.5s88%
Stealth: Hunter Alpha95%$0.000044.0s86%
Claude Sonnet 495%$0.0239.0s87%
Claude Opus 4.599%$0.0419.7s97%
MiniMax M2.792%$0.002229.4s83%
10%20%30%40%50%60%70%80%90%100%
matrixtiers
Model Total â–¼Small codex (7 entries), short passage (165 words)Large codex (40 entries), short passage (165 words)Small codex (7 entries), long passage (734 words)Large codex (40 entries), long passage (1,019 words)5 codex entries10 codex entries20 codex entries40 codex entries
Claude Opus 4.599%100%100%100%96%100%100%100%98%
Gemini 3.1 Pro (Preview)99%100%99%99%96%100%100%99%100%
Claude Opus 4.6 (Reasoning)99%100%97%98%96%100%100%100%100%
Gemini 2.5 Pro99%99%100%99%96%100%99%97%99%
Claude Opus 4.699%100%98%99%91%100%100%100%100%
Grok 4.20 (Beta, Reasoning)98%100%97%100%96%100%98%96%100%
Qwen 3.5 27B98%100%94%98%96%99%100%94%100%
Z.AI GLM 5 Turbo97%100%98%94%94%99%100%95%100%
Grok 4.1 Fast97%99%96%93%95%100%100%95%100%
GPT-5.497%100%98%91%95%100%100%95%99%
Grok 497%97%99%90%98%97%99%96%100%
Gemini 3 Flash (Preview, Reasoning)97%100%96%97%92%100%100%95%98%
GPT-5.297%100%95%96%96%100%99%95%96%
Gemini 2.5 Flash (Reasoning)97%100%95%95%95%100%100%93%98%
Qwen 3.5 35B97%99%96%93%93%100%100%94%100%
1–15 of 118
Page 1 / 8

matrix

Small codex (7 entries), short passage (165 words)

Large codex (40 entries), short passage (165 words)

Small codex (7 entries), long passage (734 words)

Large codex (40 entries), long passage (1,019 words)

tiers

5 codex entries

10 codex entries

20 codex entries

40 codex entries