Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Short text (~524 words), big codex (51 detailed entries)

Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Score
GPT-5 Mini100%
Claude Opus 4.6100%
Claude Sonnet 4.6100%
ByteDance Seed 1.6100%
o4 Mini High100%
GPT-5.2100%
Claude Opus 4.5100%
Grok 4.1 Fast100%
GPT-4.1100%
o4 Mini100%
Grok 4100%
Grok 4 Fast100%
Mistral Large 3100%
GPT-5 Nano100%
Mistral Large 2100%
Mistral Large100%
Ministral 3 8B100%
Arcee AI: Trinity Mini100%
Ministral 8B100%
Z.AI GLM 593%
30%40%50%60%70%80%90%100%