Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
GPT-4.1 Nano63%$0.00054.6s
Ministral 8B68%$0.00074.4s
Ministral 3 8B78%$0.001317.9s
ByteDance Seed 1.6 Flash94%$0.00089.1s
Arcee AI: Trinity Mini73%$0.000926.3s
Grok 4.1 Fast97%$0.001912.5s
GPT-4.186%$0.00481.2s
Grok 4 Fast73%$0.001919.2s
Gemini 2.5 Flash Lite (Reasoning)92%$0.002316.6s
Gemini 2.5 Flash (Reasoning)87%$0.008514.2s
Minimax M2.579%$0.003225.9s
ByteDance Seed 1.685%$0.004332.7s
GPT-5.285%$0.01314.5s
GPT-5 Mini93%$0.005937.8s
o4 Mini96%$0.01425.0s
GPT-5 Nano93%$0.00351.1m
GPT-5.195%$0.02526.1s
Z.AI GLM 4.671%$0.01559.0s
o4 Mini High97%$0.02752.5s
Aion 2.088%$0.00961.3m
0.500.600.700.800.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
o4 Mini High97%72%72%
Grok 4.1 Fast97%70%70%
o4 Mini96%67%67%
GPT-5.195%64%64%
Claude Opus 4.6 (Reasoning)94%60%60%
ByteDance Seed 1.6 Flash94%59%59%
Z.AI GLM 593%56%56%
GPT-5 Mini93%56%56%
GPT-5 Nano93%55%55%
Gemini 2.5 Flash Lite (Reasoning)92%53%53%
Aion 2.088%45%45%
Gemini 2.5 Flash (Reasoning)87%43%43%
GPT-4.186%42%42%
ByteDance Seed 1.685%39%39%
GPT-5.285%39%39%
Claude Opus 4.681%37%37%
Claude Sonnet 4.6 (Reasoning)82%35%35%
Claude Sonnet 4.670%34%34%
Minimax M2.579%32%32%
GPT-576%29%29%
0%10%20%30%40%50%60%70%80%90%100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Grok 4.1 Fast97%$0.001912.5s70%
o4 Mini96%$0.01425.0s67%
ByteDance Seed 1.6 Flash94%$0.00089.1s59%
o4 Mini High97%$0.02752.5s72%
GPT-5.195%$0.02526.1s64%
Gemini 2.5 Flash Lite (Reasoning)92%$0.002316.6s53%
GPT-5 Mini93%$0.005937.8s56%
GPT-5 Nano93%$0.00351.1m55%
GPT-4.186%$0.00481.2s42%
Gemini 2.5 Flash (Reasoning)87%$0.008514.2s43%
Z.AI GLM 593%$0.0171.4m56%
ByteDance Seed 1.685%$0.004332.7s39%
GPT-5.285%$0.01314.5s39%
Aion 2.088%$0.00961.3m45%
Minimax M2.579%$0.003225.9s32%
Ministral 3 8B78%$0.001317.9s24%
Grok 4 Fast73%$0.001919.2s25%
Claude Opus 4.681%$0.04913.2s37%
Arcee AI: Trinity Mini73%$0.000926.3s22%
Ministral 8B68%$0.00074.4s21%
0%10%20%30%40%50%60%70%80%90%100%
basic entriesdetailed entries
Model Total ▼Short text (~524 words), small codex (11 entries)Short text (~524 words), big codex (51 entries)Long text (~1594 words), small codex (11 entries)Long text (~1594 words), big codex (51 entries)Short text (~524 words), small codex (11 detailed entries)Short text (~524 words), big codex (51 detailed entries)Long text (~1594 words), small codex (11 detailed entries)Long text (~1594 words), big codex (51 detailed entries)
o4 Mini High97%93%100%100%85%100%100%100%100%
Grok 4.1 Fast97%100%100%100%92%100%100%85%100%
o4 Mini96%100%100%100%85%93%100%93%100%
GPT-5.195%93%93%100%100%93%85%100%100%
Claude Opus 4.6 (Reasoning)94%100%100%100%100%85%70%100%100%
ByteDance Seed 1.6 Flash94%100%100%100%93%84%93%100%84%
Z.AI GLM 593%93%78%100%83%100%93%100%100%
GPT-5 Mini93%100%100%100%45%100%100%100%100%
GPT-5 Nano93%85%78%100%100%93%100%93%93%
Gemini 2.5 Flash Lite (Reasoning)92%93%100%93%93%100%85%78%93%
Aion 2.088%78%93%93%100%93%93%55%100%
Gemini 2.5 Flash (Reasoning)87%100%93%100%85%100%93%40%85%
GPT-4.186%41%100%100%45%100%100%100%100%
ByteDance Seed 1.685%100%100%100%51%100%100%48%85%
GPT-5.285%76%75%100%92%85%100%62%93%
1–15 of 84
Page 1 / 6

basic entries

Short text (~524 words), small codex (11 entries)

Hallucination

Short text (~524 words), big codex (51 entries)

Hallucination

Long text (~1594 words), small codex (11 entries)

Hallucination

Long text (~1594 words), big codex (51 entries)

Hallucination

detailed entries

Short text (~524 words), small codex (11 detailed entries)

Hallucination

Short text (~524 words), big codex (51 detailed entries)

Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Score
GPT-5 Mini100%
Claude Opus 4.6100%
Claude Sonnet 4.6100%
ByteDance Seed 1.6100%
o4 Mini High100%
GPT-5.2100%
Claude Opus 4.5100%
Grok 4.1 Fast100%
GPT-4.1100%
o4 Mini100%
Grok 4100%
Grok 4 Fast100%
Mistral Large 3100%
GPT-5 Nano100%
Mistral Large 2100%
Mistral Large100%
Ministral 3 8B100%
Arcee AI: Trinity Mini100%
Ministral 8B100%
Z.AI GLM 593%
30%40%50%60%70%80%90%100%

Long text (~1594 words), small codex (11 detailed entries)

Hallucination

Long text (~1594 words), big codex (51 detailed entries)

Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Score
Claude Opus 4.6 (Reasoning)100%
GPT-5 Mini100%
GPT-5.1100%
Claude Opus 4.6100%
GPT-5100%
Z.AI GLM 5100%
o4 Mini High100%
Grok 4.1 Fast100%
Aion 2.0100%
GPT-4.1100%
o4 Mini100%
Ministral 3 8B100%
Ministral 8B100%
Ministral 3 14B93%
Claude Sonnet 4.6 (Reasoning)93%
GPT-5.293%
Grok 4 Fast93%
Gemini 2.5 Flash Lite (Reasoning)93%
GPT-5 Nano93%
MoonshotAI: Kimi K2.592%
20%30%40%50%60%70%80%90%100%