Data extraction
Extract key details from a given block of text.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 92% | $0.0000 | 303ms | |
| Mistral Small Creative | 93% | $0.0000 | 362ms | |
| Ministral 3B | 73% | $0.0000 | 308ms | |
| Ministral 8B | 75% | $0.0000 | 331ms | |
| Gemini 2.5 Flash Lite | 92% | $0.0000 | 357ms | |
| Ministral 3 3B | 78% | $0.0000 | 416ms | |
| Llama 3.1 8B | 85% | $0.0000 | 441ms | |
| Inception Mercury | 91% | $0.0000 | 528ms | |
| Ministral 3 14B | 88% | $0.0000 | 448ms | |
| Gemma 3 12B | 92% | $0.0000 | 542ms | |
| Ministral 3 8B | 71% | $0.0000 | 382ms | |
| Mistral Small 3.2 24B | 83% | $0.0000 | 691ms | |
| Mistral Small 4 | 88% | $0.0000 | 539ms | |
| Gemini 2.5 Flash | 83% | $0.0000 | 473ms | |
| Gemma 3 27B | 92% | $0.0000 | 780ms | |
| LFM2 24B | 79% | $0.0000 | 1.4s | |
| Stealth: Aurora Alpha | 92% | — | 1.6s | |
| GPT-5.4 Nano | 93% | $0.0000 | 768ms | |
| Mistral Medium 3.1 | 88% | $0.0000 | 655ms | |
| Arcee AI: Trinity Large (Preview) | 81% | $0.0000 | 1.1s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
11 low-scoring outliers hidden: Arcee AI: Trinity Large (Preview) (80.8%), LFM2 24B (79.2%), Ministral 3 3B (78.3%), Rocinante 12B (77.9%), Ministral 8B (75.0%), WizardLM 2 8x22b (73.3%), Ministral 3B (72.9%), Grok 4.20 (Beta, Reasoning) (71.7%), Ministral 3 8B (70.8%), Mistral Large (70.4%), Cohere Command R+ (Aug. 2024) (63.3%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 99% | 82% | 82% | |
| Claude Sonnet 4 | 96% | 72% | 72% | |
| GPT-4o Mini (temp=0) | 96% | 72% | 72% | |
| GPT-4o Mini (temp=1) | 94% | 63% | 63% | |
| Z.AI GLM 4.6 | 96% | 60% | 60% | |
| Mistral Small Creative | 93% | 59% | 59% | |
| DeepSeek V3 (2025-03-24) | 91% | 55% | 55% | |
| GPT-5.4 Nano | 93% | 55% | 55% | |
| Gemini 2.5 Pro | 94% | 53% | 53% | |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | 53% | 53% | |
| ByteDance Seed 2.0 Lite | 94% | 53% | 53% | |
| Claude Opus 4 | 92% | 53% | 53% | |
| Gemini 3.1 Pro (Preview) | 93% | 50% | 50% | |
| Z.AI GLM 5 | 93% | 50% | 50% | |
| MoonshotAI: Kimi K2.5 | 93% | 50% | 50% | |
| Gemini 2.5 Flash (Reasoning) | 93% | 50% | 50% | |
| GPT-5.4 Mini | 93% | 50% | 50% | |
| GPT-5 Mini | 93% | 47% | 47% | |
| GPT-5.2 | 93% | 47% | 47% | |
| MiniMax M2.5 | 93% | 47% | 47% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Claude Sonnet 4 | 96% | $0.0004 | 1.6s | 72% | |
| Gemini 3 Flash (Preview, Reasoning) | 99% | $0.0026 | 7.0s | 82% | |
| GPT-4o Mini (temp=0) | 96% | $0.0000 | 8.1s | 72% | |
| Mistral Small Creative | 93% | $0.0000 | 362ms | 59% | |
| GPT-5.4 Nano | 93% | $0.0000 | 768ms | 55% | |
| GPT-5.4 Mini | 93% | $0.0001 | 658ms | 50% | |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | $0.0004 | 3.8s | 53% | |
| DeepSeek V3 (2025-03-24) | 91% | $0.0000 | 2.4s | 55% | |
| Gemma 3 4B | 92% | $0.0000 | 303ms | 45% | |
| Gemini 2.5 Flash Lite | 92% | $0.0000 | 357ms | 45% | |
| Gemma 3 12B | 92% | $0.0000 | 542ms | 45% | |
| Inception Mercury 2 | 92% | $0.0002 | 471ms | 45% | |
| Gemma 3 27B | 92% | $0.0000 | 780ms | 45% | |
| Gemini 3 Flash (Preview) | 92% | $0.0001 | 835ms | 45% | |
| GPT-5.4 | 92% | $0.0003 | 696ms | 45% | |
| Gemini 3.1 Flash Lite (Preview) | 91% | $0.0000 | 708ms | 44% | |
| GPT-5.4 Nano (Reasoning, Low) | 92% | $0.0001 | 1.9s | 45% | |
| DeepSeek V3 (2024-12-26) | 90% | $0.0000 | 1.3s | 47% | |
| GPT-5.4 Mini (Reasoning, Low) | 92% | $0.0003 | 2.2s | 45% | |
| GPT-4o Mini (temp=1) | 94% | $0.0000 | 16.4s | 63% | |
| Model | Total â–¼ | Who's the tallest? | What's the color of the car? | What instrument does Lucy play? | Guess the pet | What's the correct time? | Who's the sister? | Contextual pronoun | Indirect birth year | Fruits excluding citrus | Future event time | Highest-rated movie | All valid emails |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 99% | 100% | 100% | 100% | 100% | 90% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Z.AI GLM 4.6 | 96% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Claude Sonnet 4 | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| GPT-4o Mini (temp=0) | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| Gemini 2.5 Pro | 94% | 100% | 100% | 100% | 100% | 40% | 100% | 100% | 100% | 90% | 100% | 100% | 100% |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | 100% | 100% | 100% | 100% | 30% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| ByteDance Seed 2.0 Lite | 94% | 100% | 100% | 100% | 100% | 30% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=1) | 94% | 100% | 100% | 100% | 100% | 80% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| Gemini 3.1 Pro (Preview) | 93% | 100% | 100% | 100% | 100% | 20% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Z.AI GLM 5 | 93% | 100% | 100% | 100% | 100% | 20% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| MoonshotAI: Kimi K2.5 | 93% | 100% | 100% | 100% | 100% | 20% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Flash (Reasoning) | 93% | 100% | 80% | 100% | 100% | 60% | 100% | 100% | 100% | 80% | 100% | 100% | 100% |
| GPT-5.4 Mini | 93% | 100% | 100% | 100% | 100% | 20% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Mistral Small Creative | 93% | 100% | 100% | 100% | 100% | 70% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| GPT-5.4 Nano | 93% | 100% | 100% | 90% | 100% | 60% | 100% | 100% | 100% | 100% | 65% | 100% | 100% |
Who's the tallest?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | |
| Ministral 3B | 100% | $0.0000 | 269ms | |
| Gemma 3 12B | 100% | $0.0000 | 322ms | |
| Ministral 8B | 90% | $0.0000 | 261ms | |
| Ministral 3 3B | 100% | $0.0000 | 272ms | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | |
| Ministral 3 8B | 100% | $0.0000 | 368ms | |
| LFM2 24B | 100% | $0.0000 | 374ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 363ms | |
| Ministral 3 14B | 100% | $0.0000 | 392ms | |
| Llama 3.1 8B | 100% | $0.0000 | 382ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 389ms | |
| Mistral NeMO | 80% | $0.0000 | 318ms | |
| Inception Mercury | 100% | $0.0000 | 402ms | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 594ms | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | |
| Mistral Small 4 | 100% | $0.0000 | 463ms | |
| Llama 3.1 Nemotron 70B | 100% | $0.0000 | 481ms | |
| Gemini 2.5 Flash | 100% | $0.0000 | 399ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 574ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 269ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 272ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 322ms | 100% | |
| LFM2 24B | 100% | $0.0000 | 374ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 363ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 368ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 389ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 382ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 392ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 402ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 399ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 463ms | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0000 | 481ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 574ms | 100% | |
| Llama 3.1 70B | 100% | $0.0000 | 419ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 579ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 594ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
What's the color of the car?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 221ms | |
| Ministral 3B | 100% | $0.0000 | 273ms | |
| Ministral 8B | 100% | $0.0000 | 256ms | |
| LFM2 24B | 100% | $0.0000 | 733ms | |
| Mistral NeMO | 100% | $0.0000 | 536ms | |
| Ministral 3 3B | 100% | $0.0000 | 293ms | |
| Mistral Small Creative | 100% | $0.0000 | 412ms | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0000 | 758ms | |
| Ministral 3 8B | 100% | $0.0000 | 361ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 406ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | |
| Llama 3.1 8B | 100% | $0.0000 | 308ms | |
| Ministral 3 14B | 100% | $0.0000 | 362ms | |
| Gemma 3 12B | 100% | $0.0000 | 403ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 554ms | |
| Gemma 3 27B | 100% | $0.0000 | 518ms | |
| Mistral Small 4 | 100% | $0.0000 | 488ms | |
| Inception Mercury | 100% | $0.0000 | 508ms | |
| Stealth: Aurora Alpha | 100% | — | 533ms | |
| Mistral Medium 3.1 | 100% | $0.0000 | 441ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 221ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 273ms | 100% | |
| Ministral 8B | 100% | $0.0000 | 256ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 293ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 308ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 361ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 403ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 362ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 406ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 412ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 554ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 518ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 488ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 508ms | 100% | |
| Mistral NeMO | 100% | $0.0000 | 536ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 441ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 533ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 549ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 578ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
What instrument does Lucy play?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 430ms | |
| Gemma 3 4B | 100% | $0.0000 | 217ms | |
| Mistral Small Creative | 100% | $0.0000 | 269ms | |
| Mistral NeMO | 100% | $0.0000 | 577ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 335ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 335ms | |
| Ministral 3 14B | 100% | $0.0000 | 515ms | |
| Gemma 3 12B | 100% | $0.0000 | 362ms | |
| Llama 3.1 8B | 90% | $0.0000 | 382ms | |
| Mistral Small 4 | 100% | $0.0000 | 543ms | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | |
| Inception Mercury | 100% | $0.0000 | 523ms | |
| Stealth: Aurora Alpha | 100% | — | 809ms | |
| Mistral Medium 3.1 | 100% | $0.0000 | 350ms | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 547ms | |
| Gemini 2.5 Flash | 100% | $0.0000 | 499ms | |
| Llama 3.1 Nemotron 70B | 95% | $0.0000 | 465ms | |
| GPT-5.4 Nano | 90% | $0.0000 | 550ms | |
| Hermes 3 70B | 100% | $0.0000 | 507ms | |
| Mistral Large 3 | 100% | $0.0000 | 547ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 269ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 335ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 335ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 362ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 430ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 350ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 523ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 515ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 499ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 543ms | 100% | |
| Mistral NeMO | 100% | $0.0000 | 577ms | 100% | |
| Hermes 3 70B | 100% | $0.0000 | 507ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 547ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | 100% | |
| Mistral Large 3 | 100% | $0.0000 | 547ms | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 680ms | 100% | |
| Inception Mercury 2 | 100% | $0.0001 | 390ms | 100% | |
| Qwen 2.5 72B | 100% | $0.0000 | 729ms | 100% | |
| GPT-5.4 Mini | 100% | $0.0001 | 621ms | 100% | |