Codex Violation Detection

Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.

Model Total â–¼Small codex, short passageLarge codex, short passageSmall codex, long passageLarge codex, long passage5 codex entries10 codex entries20 codex entries40 codex entries
Claude Opus 4.599%100%100%95%95%100%100%100%98%
Claude Opus 4.698%100%98%95%94%100%100%100%100%
Gemini 2.5 Pro98%99%100%96%95%100%99%97%99%
Claude Sonnet 4.598%100%100%96%90%98%100%97%99%
MoonshotAI: Kimi K2.597%99%98%95%91%100%100%93%99%
Z.AI GLM 597%100%95%98%91%100%99%93%97%
Z.AI GLM 4.797%100%94%97%93%100%100%91%98%
Gemini 3 Pro (Preview)97%99%93%99%89%100%100%95%98%
Claude Sonnet 496%100%97%95%93%100%100%92%96%
GPT-5.196%100%94%97%91%100%95%96%98%
o4 Mini High96%98%94%96%93%100%95%93%98%
GPT-5.296%100%95%91%92%100%99%95%96%
GPT-596%100%96%94%92%96%95%97%98%
Gemini 3 Flash (Preview)96%100%93%99%92%100%100%92%90%
o4 Mini96%98%93%97%92%100%96%92%98%
1–15 of 61
Page 1 / 5
Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

Small codex, short passage

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Large codex, short passage

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Small codex, long passage

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Large codex, long passage

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

5 codex entries

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

10 codex entries

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

20 codex entries

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

40 codex entries

0-shot ToolingUtilityLogicRule following

Performance Score Distribution (Top 20)

Click a model name to view its detail page.