Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
Gemini 3 Flash (Preview)97%$0.00273.9s
DeepSeek V4 Flash95%$0.00037.7s
Grok 4 Fast96%$0.00128.7s
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s
Mistral Medium 3.196%$0.00265.8s
Xiaomi MIMO v2.597%$0.003413.4s
Mistral Small Creative94%$0.00063.9s
Gemini 3.1 Flash Lite (Reasoning)95%$0.00182.0s
Mistral Large 394%$0.00278.2s
Grok 4.1 Fast97%$0.001722.1s
Gemini 2.5 Flash94%$0.00232.5s
Gemini 3.5 Flash (Reasoning, Minimal)98%$0.0103.1s
Gemini 3.1 Flash Lite (Preview)94%$0.00172.0s
Z.AI GLM 5 Turbo97%$0.006816.0s
Z.AI GLM 4.596%$0.002816.8s
DeepSeek V4 Pro95%$0.002116.0s
Grok 4.20 (Beta)95%$0.00492.0s
Ministral 3 8B94%$0.00063.3s
Gemini 3.1 Flash Lite94%$0.00175.1s
Xiaomi MIMO v2.5 Pro96%$0.004818.7s
0.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

8 low-scoring outliers hidden: Gemma 4 26B (80.5%), Gemma 3 12B (77.7%), GPT-4.1 Nano (75.3%), Llama 3.1 8B (71.5%), Skyfall 36B V2 (67.9%), LFM2 24B (49.0%), Rocinante 12B (47.9%), Mistral NeMO (26.1%).

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
Claude Opus 4.6 (Reasoning)98%99%97%
Claude Opus 4.599%99%97%
Claude Opus 4.698%98%97%
Grok 498%98%97%
Grok 4.20 (Reasoning)98%98%96%
Gemini 3.5 Flash (Reasoning)98%98%96%
Claude Sonnet 4.698%98%96%
GPT-598%97%96%
Claude Opus 498%98%96%
Z.AI GLM 5.198%98%96%
Claude Sonnet 4.6 (Reasoning)97%98%96%
Gemini 3 Flash (Preview, Reasoning)98%97%96%
Gemini 3.5 Flash (Reasoning, Minimal)98%98%95%
Qwen3.6 Max Preview97%97%95%
Z.AI GLM 597%97%95%
MoonshotAI: Kimi K2.697%98%95%
Aion 2.097%97%95%
Grok 4.20 (Beta, Reasoning)97%98%95%
Qwen 3.5 Plus (2026-02-15)98%98%95%
Z.AI GLM 5 Turbo97%97%95%
100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Gemini 3 Flash (Preview)97%$0.00273.9s95%
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s95%
Grok 4 Fast96%$0.00128.7s93%
Gemini 3.1 Flash Lite (Reasoning)95%$0.00182.0s92%
Gemini 3.1 Flash Lite (Preview)94%$0.00172.0s92%
Mistral Medium 3.196%$0.00265.8s92%
Gemini 3.5 Flash (Reasoning, Minimal)98%$0.0103.1s95%
Xiaomi MIMO v2.597%$0.003413.4s94%
Gemini 3.1 Flash Lite94%$0.00175.1s92%
Grok 4.20 (Beta)95%$0.00492.0s92%
Mistral Small Creative94%$0.00063.9s89%
Grok 4.1 Fast97%$0.001722.1s94%
Ministral 3 8B94%$0.00063.3s88%
Z.AI GLM 5 Turbo97%$0.006816.0s95%
Gemini 2.5 Flash94%$0.00232.5s89%
Grok 4.2095%$0.00484.9s91%
Mistral Large 394%$0.00278.2s91%
DeepSeek V4 Pro95%$0.002116.0s92%
Z.AI GLM 4.596%$0.002816.8s92%
Mistral Small 3.2 24B93%$0.00054.4s89%
80%90%100%
Model Total â–¼Short: The Rusty Lantern (Explicit)Medium: Through the Thornveil (Scattered)Medium: The Hollow (Inferred)Long: The Spire of Echoes (Dense)
Claude Opus 4.599%99%98%99%98%
Claude Opus 4.6 (Reasoning)98%99%99%99%98%
Grok 498%99%99%98%97%
Gemini 3.5 Flash (Reasoning)98%98%99%98%99%
Claude Opus 4.698%99%97%99%98%
Z.AI GLM 5.198%99%99%97%97%
GPT-598%99%98%99%96%
Gemini 3 Flash (Preview, Reasoning)98%98%97%98%99%
Gemini 3.5 Flash (Reasoning, Minimal)98%98%99%97%97%
Claude Opus 498%99%97%97%98%
Grok 4.20 (Reasoning)98%99%98%98%97%
Claude Sonnet 4.698%99%97%98%97%
Qwen 3.5 Plus (2026-02-15)98%99%97%97%97%
Qwen3.6 Max Preview97%99%97%96%98%
Aion 2.097%98%97%97%97%
1–15 of 151
Page 1 / 11

Short: The Rusty Lantern (Explicit)

Medium: Through the Thornveil (Scattered)

Medium: The Hollow (Inferred)

Long: The Spire of Echoes (Dense)