Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
LFM2 24B72%$0.000439.0s
GPT-4.1 Nano63%$0.00054.6s
GPT-5.4 Nano68%$0.00051.5s
Ministral 8B68%$0.00074.4s
Ministral 3 8B78%$0.001317.9s
GPT-5.4 Nano (Reasoning, Low)97%$0.00083.9s
Inception Mercury98%$0.00048.0s
Inception Mercury 296%$0.00273.8s
GPT-5.4 Nano (Reasoning)94%$0.001711.4s
GPT-5.4 Mini (Reasoning, Low)95%$0.00344.0s
ByteDance Seed 1.6 Flash94%$0.00089.1s
Arcee AI: Trinity Mini73%$0.000926.3s
Grok 4.1 Fast97%$0.001912.5s
GPT-4.186%$0.00481.2s
Stealth: Healer Alpha84%$0.000021.5s
Grok 4 Fast73%$0.001919.2s
GPT-5.4 Mini (Reasoning)89%$0.009010.8s
Gemini 2.5 Flash Lite (Reasoning)92%$0.002316.6s
Mistral Small 4 (Reasoning)83%$0.002522.6s
Z.AI GLM 5 Turbo96%$0.007116.0s
0.500.600.700.800.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
Nemotron 3 Super99%83%83%
Inception Mercury98%77%77%
o4 Mini High97%72%72%
Grok 4.1 Fast97%70%70%
GPT-5.4 Nano (Reasoning, Low)97%70%70%
o4 Mini96%67%67%
Inception Mercury 296%67%67%
Z.AI GLM 5 Turbo96%65%65%
GPT-5.195%64%64%
GPT-5.4 Mini (Reasoning, Low)95%64%64%
Claude Opus 4.6 (Reasoning)94%60%60%
ByteDance Seed 1.6 Flash94%59%59%
GPT-5.4 Nano (Reasoning)94%57%57%
Z.AI GLM 593%56%56%
GPT-5 Mini93%56%56%
GPT-5 Nano93%55%55%
Gemini 2.5 Flash Lite (Reasoning)92%53%53%
MiniMax M2.790%48%48%
GPT-5.4 Mini (Reasoning)89%46%46%
Aion 2.088%45%45%
10%20%30%40%50%60%70%80%90%100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Inception Mercury98%$0.00048.0s77%
Nemotron 3 Super99%$0.00001.3m83%
GPT-5.4 Nano (Reasoning, Low)97%$0.00083.9s70%
Grok 4.1 Fast97%$0.001912.5s70%
Inception Mercury 296%$0.00273.8s67%
GPT-5.4 Mini (Reasoning, Low)95%$0.00344.0s64%
Z.AI GLM 5 Turbo96%$0.007116.0s65%
ByteDance Seed 1.6 Flash94%$0.00089.1s59%
o4 Mini96%$0.01425.0s67%
GPT-5.4 Nano (Reasoning)94%$0.001711.4s57%
o4 Mini High97%$0.02752.5s72%
Gemini 2.5 Flash Lite (Reasoning)92%$0.002316.6s53%
GPT-5.195%$0.02526.1s64%
GPT-5 Mini93%$0.005937.8s56%
GPT-5 Nano93%$0.00351.1m55%
MiniMax M2.790%$0.004734.6s48%
GPT-5.4 Mini (Reasoning)89%$0.009010.8s46%
GPT-4.186%$0.00481.2s42%
Gemini 2.5 Flash (Reasoning)87%$0.008514.2s43%
Z.AI GLM 593%$0.0171.4m56%
10%20%30%40%50%60%70%80%90%100%
basic entriesdetailed entries
Model Total ▼Short text (~524 words), small codex (11 entries)Short text (~524 words), big codex (51 entries)Long text (~1594 words), small codex (11 entries)Long text (~1594 words), big codex (51 entries)Short text (~524 words), small codex (11 detailed entries)Short text (~524 words), big codex (51 detailed entries)Long text (~1594 words), small codex (11 detailed entries)Long text (~1594 words), big codex (51 detailed entries)
Nemotron 3 Super99%100%100%100%93%100%100%100%100%
Inception Mercury98%100%93%100%93%100%100%100%100%
o4 Mini High97%93%100%100%85%100%100%100%100%
Grok 4.1 Fast97%100%100%100%92%100%100%85%100%
GPT-5.4 Nano (Reasoning, Low)97%100%90%100%100%93%93%100%100%
o4 Mini96%100%100%100%85%93%100%93%100%
Inception Mercury 296%100%100%100%100%85%100%100%85%
Z.AI GLM 5 Turbo96%100%100%100%68%100%100%100%100%
GPT-5.195%93%93%100%100%93%85%100%100%
GPT-5.4 Mini (Reasoning, Low)95%93%85%100%85%100%100%100%100%
Claude Opus 4.6 (Reasoning)94%100%100%100%100%85%70%100%100%
ByteDance Seed 1.6 Flash94%100%100%100%93%84%93%100%84%
GPT-5.4 Nano (Reasoning)94%100%100%100%58%100%100%93%100%
Z.AI GLM 593%93%78%100%83%100%93%100%100%
GPT-5 Mini93%100%100%100%45%100%100%100%100%
1–15 of 116
Page 1 / 8

basic entries

Short text (~524 words), small codex (11 entries)

Short text (~524 words), big codex (51 entries)

Long text (~1594 words), small codex (11 entries)

Long text (~1594 words), big codex (51 entries)

detailed entries

Short text (~524 words), small codex (11 detailed entries)

Short text (~524 words), big codex (51 detailed entries)

Long text (~1594 words), small codex (11 detailed entries)

Long text (~1594 words), big codex (51 detailed entries)