Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
Gemini 3 Flash (Preview)97%$0.00273.9s
Grok 4 Fast96%$0.00128.7s
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s
Mistral Medium 3.196%$0.00265.8s
Mistral Small Creative94%$0.00063.9s
Mistral Large 394%$0.00278.2s
Grok 4.1 Fast97%$0.001722.1s
Gemini 2.5 Flash94%$0.00232.5s
Gemini 3.1 Flash Lite (Preview)94%$0.00172.0s
Z.AI GLM 5 Turbo97%$0.006816.0s
Z.AI GLM 4.596%$0.002816.8s
Grok 4.20 (Beta)95%$0.00492.0s
Ministral 3 8B94%$0.00063.3s
DeepSeek V3.191%$0.001226.1s
DeepSeek-V2 Chat95%$0.001914.8s
Claude Haiku 4.595%$0.00734.5s
Mistral Small 3.2 24B93%$0.00054.4s
Stealth: Healer Alpha95%$0.000024.6s
Gemini 3 Flash (Preview, Reasoning)98%$0.009622.2s
DeepSeek V3 (2024-12-26)94%$0.001713.5s
0.700.800.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

6 low-scoring outliers hidden: Gemma 3 12B (77.7%), GPT-4.1 Nano (75.3%), Llama 3.1 8B (71.5%), LFM2 24B (49.0%), Rocinante 12B (47.9%), Mistral NeMO (26.1%).

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
Claude Opus 4.6 (Reasoning)98%99%97%
Claude Opus 4.599%99%97%
Claude Opus 4.698%98%97%
Grok 498%98%97%
Claude Sonnet 4.698%98%96%
GPT-598%97%96%
Claude Opus 498%98%96%
Claude Sonnet 4.6 (Reasoning)97%98%96%
Gemini 3 Flash (Preview, Reasoning)98%97%96%
Z.AI GLM 597%97%95%
Aion 2.097%97%95%
Grok 4.20 (Beta, Reasoning)97%98%95%
Qwen 3.5 Plus (2026-02-15)98%98%95%
Z.AI GLM 5 Turbo97%97%95%
o4 Mini High97%97%95%
Gemini 3 Flash (Preview)97%97%95%
Gemini 2.5 Pro97%97%94%
Gemini 3 Pro (Preview)97%96%94%
GPT-5.4 (Reasoning)97%97%94%
GPT-5.4 Mini (Reasoning)97%97%94%
100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Gemini 3 Flash (Preview)97%$0.00273.9s95%
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s95%
Grok 4 Fast96%$0.00128.7s93%
Gemini 3.1 Flash Lite (Preview)94%$0.00172.0s92%
Mistral Medium 3.196%$0.00265.8s92%
Grok 4.20 (Beta)95%$0.00492.0s92%
Mistral Small Creative94%$0.00063.9s89%
Grok 4.1 Fast97%$0.001722.1s94%
Ministral 3 8B94%$0.00063.3s88%
Z.AI GLM 5 Turbo97%$0.006816.0s95%
Gemini 2.5 Flash94%$0.00232.5s89%
Mistral Large 394%$0.00278.2s91%
Z.AI GLM 4.596%$0.002816.8s92%
Mistral Small 3.2 24B93%$0.00054.4s89%
Claude Haiku 4.595%$0.00734.5s91%
DeepSeek-V2 Chat95%$0.001914.8s90%
MiniMax M2.796%$0.002221.7s92%
GPT-5.4 Mini93%$0.00312.2s88%
Gemini 3 Flash (Preview, Reasoning)98%$0.009622.2s96%
Inception Mercury 292%$0.00223.5s88%
80%90%100%
Model Total â–¼Short: The Rusty Lantern (Explicit)Medium: Through the Thornveil (Scattered)Medium: The Hollow (Inferred)Long: The Spire of Echoes (Dense)
Claude Opus 4.599%99%98%99%98%
Claude Opus 4.6 (Reasoning)98%99%99%99%98%
Grok 498%99%99%98%97%
Claude Opus 4.698%99%97%99%98%
GPT-598%99%98%99%96%
Gemini 3 Flash (Preview, Reasoning)98%98%97%98%99%
Claude Opus 498%99%97%97%98%
Claude Sonnet 4.698%99%97%98%97%
Qwen 3.5 Plus (2026-02-15)98%99%97%97%97%
Aion 2.097%98%97%97%97%
Claude Sonnet 4.6 (Reasoning)97%97%98%97%97%
Grok 4.20 (Beta, Reasoning)97%99%97%97%97%
Z.AI GLM 5 Turbo97%98%98%96%96%
Gemini 3 Pro (Preview)97%95%98%97%99%
Z.AI GLM 597%97%98%96%98%
1–15 of 116
Page 1 / 8

Short: The Rusty Lantern (Explicit)

Medium: Through the Thornveil (Scattered)

Medium: The Hollow (Inferred)

Long: The Spire of Echoes (Dense)