Relationship tree
Extracts a deterministic XML family and relationship tree from cumulative literary prose.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5.5 (Reasoning, Low) | 87% | $0.097 | 42.6s | |
| Gemini 3 Flash (Preview) | 81% | $0.0087 | 7.1s | |
| Gemini 2.5 Pro | 89% | $0.099 | 58.2s | |
| Gemini 2.5 Flash | 80% | $0.0075 | 6.1s | |
| GPT-5.4 (Reasoning, Low) | 88% | $0.053 | 35.5s | |
| GPT-5.5 | 88% | $0.105 | 37.1s | |
| Xiaomi MIMO v2.5 | 84% | $0.0027 | 1.1m | |
| Gemini 3.1 Flash Lite (Preview) | 78% | $0.0046 | 3.2s | |
| Grok 4.20 (Beta, Reasoning) | 83% | $0.032 | 30.1s | |
| GPT-5.2 | 87% | $0.062 | 50.7s | |
| Gemini 3.1 Flash Lite (Reasoning) | 78% | $0.0033 | 3.2s | |
| Gemma 4 26B | 79% | $0.0018 | 27.2s | |
| DeepSeek V4 Pro (Reasoning) | 85% | $0.020 | 2.3m | |
| Qwen 3.6 Flash | 81% | $0.015 | 51.7s | |
| Gemini 3.1 Flash Lite | 76% | $0.0031 | 3.3s | |
| Claude Opus 4.7 (Reasoning) | 89% | $0.184 | 25.0s | |
| GPT-5.4 | 81% | $0.033 | 17.4s | |
| Gemini 3 Flash (Preview, Reasoning) | 85% | $0.030 | 40.3s | |
| Claude Sonnet 4.6 | 83% | $0.081 | 23.3s | |
| DeepSeek V3.1 | 74% | $0.0055 | 52.0s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
2 low-scoring outliers hidden: Llama 3.1 8B (38.0%), Gemma 3 12B (37.1%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| GPT-5.4 (Reasoning) | 96% | 93% | 90% | |
| Claude Opus 4.6 (Reasoning) | 95% | 93% | 88% | |
| GPT-5.5 (Reasoning) | 94% | 93% | 87% | |
| Claude Opus 4.8 (Reasoning) | 94% | 90% | 86% | |
| Claude Sonnet 4.6 (Reasoning) | 92% | 88% | 83% | |
| MoonshotAI: Kimi K2.6 | 92% | 89% | 82% | |
| GPT-5.4 (Reasoning, Low) | 88% | 93% | 80% | |
| Claude Opus 4.6 | 91% | 90% | 80% | |
| GPT-5.2 | 87% | 92% | 79% | |
| Z.AI GLM 5.1 | 91% | 88% | 79% | |
| MiniMax M3 | 87% | 91% | 78% | |
| Claude Opus 4.5 | 85% | 92% | 78% | |
| Claude Opus 4.8 (Reasoning, Low) | 90% | 86% | 77% | |
| Z.AI GLM 5 Turbo | 87% | 86% | 77% | |
| GPT-5.4 | 81% | 95% | 77% | |
| Grok 4.20 (Beta, Reasoning) | 83% | 92% | 77% | |
| Claude Sonnet 4.5 | 79% | 97% | 77% | |
| Claude Opus 4.7 (Reasoning) | 89% | 83% | 76% | |
| Qwen 3.6 Flash | 81% | 94% | 76% | |
| Gemini 3 Flash (Preview) | 81% | 94% | 76% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.4 (Reasoning, Low) | 88% | $0.053 | 35.5s | 80% | |
| Gemini 3 Flash (Preview) | 81% | $0.0087 | 7.1s | 76% | |
| Gemini 3.1 Flash Lite (Preview) | 78% | $0.0046 | 3.2s | 76% | |
| Gemini 2.5 Flash | 80% | $0.0075 | 6.1s | 75% | |
| GPT-5.2 | 87% | $0.062 | 50.7s | 79% | |
| Grok 4.20 (Beta, Reasoning) | 83% | $0.032 | 30.1s | 77% | |
| Gemma 4 26B | 79% | $0.0018 | 27.2s | 76% | |
| GPT-5.4 | 81% | $0.033 | 17.4s | 77% | |
| Gemini 3.1 Flash Lite (Reasoning) | 78% | $0.0033 | 3.2s | 74% | |
| Qwen 3.6 Flash | 81% | $0.015 | 51.7s | 76% | |
| Xiaomi MIMO v2.5 | 84% | $0.0027 | 1.1m | 73% | |
| Claude Opus 4.6 | 91% | $0.154 | 31.4s | 80% | |
| Gemini 3.1 Flash Lite | 76% | $0.0031 | 3.3s | 71% | |
| GPT-5.4 (Reasoning) | 96% | $0.175 | 2.6m | 90% | |
| Claude Sonnet 4.6 | 83% | $0.081 | 23.3s | 76% | |
| GPT-5.5 | 88% | $0.105 | 37.1s | 75% | |
| Inception Mercury 2 | 76% | $0.0096 | 12.8s | 71% | |
| ByteDance Seed 1.6 | 81% | $0.014 | 1.2m | 72% | |
| Grok 4.20 (Reasoning) | 81% | $0.027 | 40.0s | 71% | |
| GPT-5 Mini | 84% | $0.019 | 2.0m | 75% | |
| Model | Total ▼ | Core relationship tree | Family relationship tree |
|---|---|---|---|
| GPT-5.4 (Reasoning) | 96% | 99% | 93% |
| Claude Opus 4.6 (Reasoning) | 95% | 98% | 92% |
| GPT-5.5 (Reasoning) | 94% | 98% | 90% |
| Claude Opus 4.8 (Reasoning) | 94% | 99% | 89% |
| Claude Sonnet 4.6 (Reasoning) | 92% | 98% | 87% |
| MoonshotAI: Kimi K2.6 | 92% | 97% | 88% |
| Claude Opus 4.6 | 91% | 95% | 88% |
| Z.AI GLM 5.1 | 91% | 96% | 87% |
| Claude Opus 4.8 (Reasoning, Low) | 90% | 90% | 90% |
| GPT-5 | 90% | 98% | 82% |
| Claude Opus 4.7 (Reasoning) | 89% | 97% | 82% |
| Gemini 2.5 Pro | 89% | 97% | 81% |
| Claude Opus 4.7 | 89% | 95% | 83% |
| GPT-5.4 (Reasoning, Low) | 88% | 91% | 85% |
| MoonshotAI: Kimi K2.5 | 88% | 93% | 83% |
Core relationship tree
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 90% | $0.024 | 27.7s | |
| Xiaomi MIMO v2.5 | 88% | $0.0017 | 47.0s | |
| Gemini 3 Flash (Preview) | 83% | $0.0065 | 5.0s | |
| GPT-5.4 (Reasoning, Low) | 91% | $0.033 | 20.9s | |
| GPT-5 Mini | 87% | $0.013 | 1.4m | |
| Gemma 4 26B | 81% | $0.0010 | 22.3s | |
| DeepSeek V4 Flash (Reasoning) | 84% | $0.0022 | 1.9m | |
| Z.AI GLM 5 Turbo | 92% | $0.019 | 1.7m | |
| Gemini 3.5 Flash (Reasoning, Minimal) | 83% | $0.021 | 5.7s | |
| ByteDance Seed 1.6 | 85% | $0.010 | 59.2s | |
| Gemini 2.5 Flash | 81% | $0.0050 | 5.3s | |
| Gemini 3.1 Flash Lite (Preview) | 79% | $0.0029 | 2.6s | |
| GPT-5.2 | 90% | $0.039 | 32.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 76% | $0.0076 | 9.4s | |
| GPT-4o, Aug. 6th (temp=0) | 81% | $0.020 | 7.2s | |
| DeepSeek V3.1 | 73% | $0.0028 | 37.1s | |
| DeepSeek V4 Pro (Reasoning) | 88% | $0.011 | 1.4m | |
| GPT-5.1 | 82% | $0.016 | 11.2s | |
| Grok 4.20 (Beta, Reasoning) | 86% | $0.023 | 30.1s | |
| Gemini 3.1 Flash Lite (Reasoning) | 79% | $0.0024 | 2.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Sonnet 4.6 (Reasoning) | 98% | 100% | 98% | |
| Claude Opus 4.8 (Reasoning) | 99% | 99% | 97% | |
| GPT-5.5 (Reasoning) | 98% | 98% | 96% | |
| GPT-5.5 (Reasoning, Low) | 97% | 98% | 96% | |
| Claude Opus 4.6 (Reasoning) | 98% | 97% | 96% | |
| GPT-5.4 (Reasoning) | 99% | 98% | 96% | |
| Gemini 3.1 Pro (Preview) | 97% | 98% | 95% | |
| GPT-5 | 98% | 97% | 94% | |
| MoonshotAI: Kimi K2.6 | 97% | 95% | 93% | |
| Claude Opus 4.7 (Reasoning) | 97% | 97% | 93% | |
| Gemini 2.5 Pro | 97% | 95% | 93% | |
| GPT-5.5 | 94% | 96% | 91% | |
| Claude Opus 4.7 | 95% | 92% | 90% | |
| Z.AI GLM 5 Turbo | 92% | 96% | 89% | |
| Z.AI GLM 5.1 | 96% | 89% | 88% | |
| Claude Opus 4.6 | 95% | 90% | 86% | |
| GPT-5.2 | 90% | 95% | 85% | |
| Qwen3.7 Max | 92% | 88% | 85% | |
| GPT-5.4 (Reasoning, Low) | 91% | 94% | 85% | |
| MiniMax M3 | 90% | 91% | 84% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.5 (Reasoning, Low) | 97% | $0.070 | 24.6s | 96% | |
| GPT-5.5 | 94% | $0.062 | 16.3s | 91% | |
| GPT-5.4 (Reasoning, Low) | 91% | $0.033 | 20.9s | 85% | |
| Gemini 3 Flash (Preview) | 83% | $0.0065 | 5.0s | 81% | |
| Z.AI GLM 5 Turbo | 92% | $0.019 | 1.7m | 89% | |
| Xiaomi MIMO v2.5 | 88% | $0.0017 | 47.0s | 78% | |
| GPT-5.2 | 90% | $0.039 | 32.6s | 85% | |
| Gemini 2.5 Pro | 97% | $0.082 | 51.0s | 93% | |
| Grok 4.20 (Beta, Reasoning) | 86% | $0.023 | 30.1s | 83% | |
| Gemma 4 26B | 81% | $0.0010 | 22.3s | 80% | |
| Gemini 3 Flash (Preview, Reasoning) | 90% | $0.024 | 27.7s | 77% | |
| Gemini 2.5 Flash | 81% | $0.0050 | 5.3s | 78% | |
| GPT-5.4 (Reasoning) | 99% | $0.092 | 1.4m | 96% | |
| Claude Opus 4.7 (Reasoning) | 97% | $0.111 | 17.6s | 93% | |
| ByteDance Seed 1.6 | 85% | $0.010 | 59.2s | 80% | |
| Gemini 3.1 Flash Lite (Preview) | 79% | $0.0029 | 2.6s | 76% | |
| GPT-5.4 | 82% | $0.020 | 12.5s | 79% | |
| Claude Opus 4.6 | 95% | $0.088 | 19.0s | 86% | |
| GPT-5 | 98% | $0.077 | 2.0m | 94% | |
| Grok 4.20 (Reasoning) | 85% | $0.024 | 43.9s | 80% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 78.2% | Alias accuracy | ||
| 100.0% | Character precision | ||
| 100.0% | Character recall | ||
| 100.0% | Isolated character handling | ||
| 100.0% | Red-herring resistance | ||
| 21.0% | Relationship category recall | ||
| 100.0% | Relationship endpoint integrity | ||
| 81.1% | Relationship precision | ||
| 17.2% | Relationship recall | ||
| 78.6% | Relationship type accuracy | ||
| 100.0% | XML structure |
Family relationship tree
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash | 78% | $0.010 | 6.9s | |
| Gemini 3.1 Flash Lite (Preview) | 78% | $0.0062 | 3.9s | |
| GPT-5.4 (Reasoning, Low) | 85% | $0.072 | 50.1s | |
| Gemini 3 Flash (Preview) | 79% | $0.011 | 9.2s | |
| Gemini 3.1 Flash Lite | 76% | $0.0037 | 4.2s | |
| Qwen 3.6 Flash | 80% | $0.019 | 1.1m | |
| Grok 4.20 (Beta, Reasoning) | 80% | $0.041 | 30.2s | |
| Gemini 3.1 Flash Lite (Reasoning) | 77% | $0.0042 | 3.9s | |
| GPT-5.4 | 80% | $0.047 | 22.4s | |
| Gemini 3 Flash (Preview, Reasoning) | 79% | $0.037 | 53.0s | |
| Grok 4.3 | 75% | $0.012 | 8.3s | |
| GPT-5.4 Mini (Reasoning, Low) | 75% | $0.020 | 19.5s | |
| DeepSeek V3.2 | 78% | $0.0070 | 40.9s | |
| Gemma 4 26B | 77% | $0.0026 | 32.0s | |
| GPT-5.4 Nano (Reasoning) | 78% | $0.013 | 59.0s | |
| DeepSeek V4 Flash | 74% | $0.0028 | 14.4s | |
| Grok 4.3 (Reasoning) | 75% | $0.018 | 14.1s | |
| GPT-5.2 | 84% | $0.086 | 1.1m | |
| Inception Mercury 2 | 75% | $0.013 | 16.6s | |
| Gemma 4 31B | 79% | $0.0040 | 2.3m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 92% | 98% | 90% | |
| GPT-5.4 (Reasoning) | 93% | 95% | 88% | |
| GPT-5.5 (Reasoning) | 90% | 98% | 88% | |
| Claude Opus 4.6 | 88% | 97% | 85% | |
| Claude Opus 4.8 (Reasoning, Low) | 90% | 95% | 85% | |
| GPT-5.4 (Reasoning, Low) | 85% | 98% | 83% | |
| MoonshotAI: Kimi K2.6 | 88% | 94% | 83% | |
| Claude Opus 4.8 (Reasoning) | 89% | 95% | 83% | |
| GPT-5.2 | 84% | 97% | 82% | |
| Z.AI GLM 5.1 | 87% | 94% | 82% | |
| DeepSeek V4 Flash (Reasoning) | 84% | 97% | 82% | |
| Gemini 3.5 Flash (Reasoning) | 84% | 97% | 82% | |
| Claude Sonnet 4.6 (Reasoning) | 87% | 95% | 81% | |
| GPT-5 | 82% | 99% | 81% | |
| MiniMax M3 | 85% | 94% | 80% | |
| Claude Opus 4.5 | 83% | 95% | 80% | |
| Qwen3.6 Max Preview | 82% | 97% | 80% | |
| MoonshotAI: Kimi K2.5 | 83% | 92% | 78% | |
| Gemma 4 31B | 79% | 99% | 78% | |
| GPT-5.5 | 81% | 96% | 78% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.4 (Reasoning, Low) | 85% | $0.072 | 50.1s | 83% | |
| Gemini 3 Flash (Preview) | 79% | $0.011 | 9.2s | 76% | |
| Gemini 3.1 Flash Lite (Preview) | 78% | $0.0062 | 3.9s | 76% | |
| GPT-5.2 | 84% | $0.086 | 1.1m | 82% | |
| Gemini 2.5 Flash | 78% | $0.010 | 6.9s | 75% | |
| Qwen 3.6 Flash | 80% | $0.019 | 1.1m | 77% | |
| DeepSeek V3.2 | 78% | $0.0070 | 40.9s | 76% | |
| Grok 4.20 (Beta, Reasoning) | 80% | $0.041 | 30.2s | 76% | |
| Claude Opus 4.6 | 88% | $0.221 | 43.8s | 85% | |
| GPT-5.4 | 80% | $0.047 | 22.4s | 76% | |
| Gemini 3.5 Flash (Reasoning) | 84% | $0.136 | 58.4s | 82% | |
| Gemini 3.1 Flash Lite (Reasoning) | 77% | $0.0042 | 3.9s | 73% | |
| Gemma 4 26B | 77% | $0.0026 | 32.0s | 74% | |
| GPT-5.4 Nano (Reasoning) | 78% | $0.013 | 59.0s | 74% | |
| Gemini 3.1 Flash Lite | 76% | $0.0037 | 4.2s | 72% | |
| Gemma 4 31B | 79% | $0.0040 | 2.3m | 78% | |
| Gemini 3 Flash (Preview, Reasoning) | 79% | $0.037 | 53.0s | 75% | |
| Grok 4.20 (Reasoning) | 76% | $0.030 | 36.0s | 75% | |
| Inception Mercury 2 | 75% | $0.013 | 16.6s | 73% | |
| GPT-5 Mini | 81% | $0.024 | 2.5m | 78% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 74.1% | Alias accuracy | ||
| 100.0% | Character precision | ||
| 97.2% | Character recall | ||
| 100.0% | Isolated character handling | ||
| 100.0% | Red-herring resistance | ||
| 9.1% | Relationship category recall | ||
| 100.0% | Relationship endpoint integrity | ||
| 70.5% | Relationship precision | ||
| 2.9% | Relationship recall | ||
| 81.8% | Relationship type accuracy | ||
| 100.0% | XML structure |