Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Long text (~1594 words), big codex (51 detailed entries)

Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Score
Claude Opus 4.6 (Reasoning)100%
GPT-5 Mini100%
GPT-5.1100%
Claude Opus 4.6100%
GPT-5100%
Z.AI GLM 5100%
o4 Mini High100%
Grok 4.1 Fast100%
Aion 2.0100%
GPT-4.1100%
o4 Mini100%
Ministral 3 8B100%
Ministral 8B100%
Ministral 3 14B93%
Claude Sonnet 4.6 (Reasoning)93%
GPT-5.293%
Grok 4 Fast93%
Gemini 2.5 Flash Lite (Reasoning)93%
GPT-5 Nano93%
MoonshotAI: Kimi K2.592%
20%30%40%50%60%70%80%90%100%