Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Price-Performance Score Distribution (Top 20)

Click a model name to view its detail page.

ScoreCostTime
Gemini 3 Flash (Preview)97%$0.00273.9s
Grok 4 Fast96%$0.00128.7s
Mistral Small Creative94%$0.00063.9s
Mistral Medium 3.196%$0.00265.8s
Gemini 2.5 Flash94%$0.00232.5s
Ministral 3 8B94%$0.00063.3s
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s
Mistral Large 394%$0.00278.2s
Z.AI GLM 4.596%$0.002816.8s
Mistral Small 3.2 24B93%$0.00054.4s
Gemini 2.5 Flash Lite91%$0.00051.9s
Grok 4.1 Fast97%$0.001722.1s
Claude Haiku 4.595%$0.00734.5s
Ministral 8B92%$0.00043.8s
DeepSeek-V2 Chat95%$0.001914.8s
DeepSeek V3 (2024-12-26)94%$0.001713.5s
Gemini 2.5 Flash Lite (Reasoning)94%$0.002115.6s
Arcee AI: Trinity Large (Preview)84%$0.000020.5s
Ministral 3 14B91%$0.00096.0s
GPT-4.1 Mini91%$0.00156.6s
0.700.800.901.00

Cost vs Performance

Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

5 low-scoring outliers hidden: Gemma 3 12B (77.7%), GPT-4.1 Nano (75.3%), Llama 3.1 8B (71.5%), Rocinante 12B (47.9%), Mistral NeMO (26.1%).

Most Stable Models (Top 20)

Ranked by stability (median × consistency). Click a model name to view its detail page.

ScoreConsistencyStability
Claude Opus 4.6 (Reasoning)98%99%97%
Claude Opus 4.599%99%97%
Claude Opus 4.698%98%97%
Grok 498%98%97%
Claude Sonnet 4.698%98%96%
GPT-598%97%96%
Claude Opus 498%98%96%
Claude Sonnet 4.6 (Reasoning)97%98%96%
Gemini 3 Flash (Preview, Reasoning)98%97%96%
Z.AI GLM 597%97%95%
Aion 2.097%97%95%
Qwen 3.5 Plus (2026-02-15)98%98%95%
o4 Mini High97%97%95%
Gemini 3 Flash (Preview)97%97%95%
Gemini 2.5 Pro97%97%94%
Gemini 3 Pro (Preview)97%96%94%
MoonshotAI: Kimi K2.596%97%94%
Claude 3.7 Sonnet97%98%94%
Claude 3.5 Sonnet96%98%94%
Grok 4.1 Fast97%96%94%
90%100%

Top Overall Models (Top 20)

Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.

ScoreCostSpeedStability
Gemini 3 Flash (Preview)97%$0.00273.9s95%
Qwen 3.5 Plus (2026-02-15)98%$0.003010.6s95%
Grok 4 Fast96%$0.00128.7s93%
Mistral Medium 3.196%$0.00265.8s92%
Mistral Small Creative94%$0.00063.9s89%
Ministral 3 8B94%$0.00063.3s88%
Gemini 2.5 Flash94%$0.00232.5s89%
Mistral Large 394%$0.00278.2s91%
Mistral Small 3.2 24B93%$0.00054.4s89%
Claude Haiku 4.595%$0.00734.5s91%
Grok 4.1 Fast97%$0.001722.1s94%
Z.AI GLM 4.596%$0.002816.8s92%
Ministral 8B92%$0.00043.8s87%
Gemini 2.5 Flash Lite91%$0.00051.9s86%
DeepSeek-V2 Chat95%$0.001914.8s90%
GPT-4o, Aug. 6th (temp=0)93%$0.01004.0s92%
DeepSeek V3 (2024-12-26)94%$0.001713.5s89%
Ministral 3 14B91%$0.00096.0s87%
Gemini 3 Flash (Preview, Reasoning)98%$0.009622.2s96%
Claude Sonnet 4.698%$0.0227.4s96%
80%90%100%
Model Total â–¼Short: The Rusty Lantern (Explicit)Medium: Through the Thornveil (Scattered)Medium: The Hollow (Inferred)Long: The Spire of Echoes (Dense)
Claude Opus 4.599%99%98%99%98%
Claude Opus 4.6 (Reasoning)98%99%99%99%98%
Grok 498%99%99%98%97%
Claude Opus 4.698%99%97%99%98%
GPT-598%99%98%99%96%
Gemini 3 Flash (Preview, Reasoning)98%98%97%98%99%
Claude Opus 498%99%97%97%98%
Claude Sonnet 4.698%99%97%98%97%
Qwen 3.5 Plus (2026-02-15)98%99%97%97%97%
Aion 2.097%98%97%97%97%
Claude Sonnet 4.6 (Reasoning)97%97%98%97%97%
Gemini 3 Pro (Preview)97%95%98%97%99%
Z.AI GLM 597%97%98%96%98%
Grok 4.1 Fast97%99%98%96%95%
Gemini 3 Flash (Preview)97%98%98%95%98%
1–15 of 84
Page 1 / 6

Short: The Rusty Lantern (Explicit)

ToolingReasoning

Medium: Through the Thornveil (Scattered)

ToolingReasoning

Medium: The Hollow (Inferred)

ToolingReasoning

Long: The Spire of Echoes (Dense)

ToolingReasoning