Data extraction
Extract key details from a given block of text.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 92% | $0.0000 | 303ms | |
| Mistral Small Creative | 93% | $0.0000 | 362ms | |
| Ministral 3B | 73% | $0.0000 | 308ms | |
| Ministral 8B | 75% | $0.0000 | 331ms | |
| Gemini 2.5 Flash Lite | 92% | $0.0000 | 357ms | |
| Ministral 3 3B | 78% | $0.0000 | 416ms | |
| Llama 3.1 8B | 85% | $0.0000 | 441ms | |
| Inception Mercury | 91% | $0.0000 | 528ms | |
| Ministral 3 14B | 88% | $0.0000 | 448ms | |
| Gemma 3 12B | 92% | $0.0000 | 542ms | |
| Ministral 3 8B | 71% | $0.0000 | 382ms | |
| Mistral Small 3.2 24B | 83% | $0.0000 | 691ms | |
| Mistral Small 4 | 88% | $0.0000 | 539ms | |
| Gemini 2.5 Flash | 83% | $0.0000 | 473ms | |
| Gemma 3 27B | 92% | $0.0000 | 780ms | |
| LFM2 24B | 79% | $0.0000 | 1.4s | |
| Stealth: Aurora Alpha | 92% | — | 1.6s | |
| GPT-5.4 Nano | 93% | $0.0000 | 768ms | |
| Mistral Medium 3.1 | 88% | $0.0000 | 655ms | |
| Arcee AI: Trinity Large (Preview) | 81% | $0.0000 | 1.1s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
13 low-scoring outliers hidden: Arcee AI: Trinity Large (Preview) (80.8%), LFM2 24B (79.2%), Ministral 3 3B (78.3%), Rocinante 12B (77.9%), Ministral 8B (75.0%), WizardLM 2 8x22b (73.3%), Ministral 3B (72.9%), Grok 4.20 (Beta, Reasoning) (71.7%), DeepSeek V4 Flash (71.3%), Ministral 3 8B (70.8%), Mistral Large (70.4%), Grok 4.20 (Reasoning) (66.3%), Cohere Command R+ (Aug. 2024) (63.3%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 99% | 82% | 82% | |
| Gemma 4 26B (Reasoning) | 98% | 74% | 74% | |
| Claude Sonnet 4 | 96% | 72% | 72% | |
| GPT-4o Mini (temp=0) | 96% | 72% | 72% | |
| Gemma 4 31B (Reasoning) | 98% | 69% | 69% | |
| Gemma 4 31B | 97% | 64% | 64% | |
| GPT-4o Mini (temp=1) | 94% | 63% | 63% | |
| Claude Opus 4.7 | 96% | 60% | 60% | |
| Z.AI GLM 4.6 | 96% | 60% | 60% | |
| Mistral Small Creative | 93% | 59% | 59% | |
| Gemma 4 26B | 95% | 56% | 56% | |
| DeepSeek V3 (2025-03-24) | 91% | 55% | 55% | |
| GPT-5.4 Nano | 93% | 55% | 55% | |
| Z.AI GLM 5.1 | 94% | 53% | 53% | |
| Gemini 2.5 Pro | 94% | 53% | 53% | |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | 53% | 53% | |
| ByteDance Seed 2.0 Lite | 94% | 53% | 53% | |
| Claude Opus 4 | 92% | 53% | 53% | |
| Qwen3.6 Max Preview | 93% | 50% | 50% | |
| Gemini 3.1 Pro (Preview) | 93% | 50% | 50% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 99% | $0.0026 | 7.0s | 82% | |
| Claude Sonnet 4 | 96% | $0.0004 | 1.6s | 72% | |
| GPT-4o Mini (temp=0) | 96% | $0.0000 | 8.1s | 72% | |
| Gemma 4 31B | 97% | $0.0000 | 5.1s | 64% | |
| Claude Opus 4.7 | 96% | $0.0008 | 1.0s | 60% | |
| Mistral Small Creative | 93% | $0.0000 | 362ms | 59% | |
| Gemma 4 26B | 95% | $0.0000 | 1.8s | 56% | |
| GPT-5.4 Nano | 93% | $0.0000 | 768ms | 55% | |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | $0.0004 | 3.8s | 53% | |
| GPT-5.4 Mini | 93% | $0.0001 | 658ms | 50% | |
| Gemma 4 26B (Reasoning) | 98% | $0.0003 | 33.2s | 74% | |
| DeepSeek V3 (2025-03-24) | 91% | $0.0000 | 2.4s | 55% | |
| GPT-4o Mini (temp=1) | 94% | $0.0000 | 16.4s | 63% | |
| ByteDance Seed 2.0 Lite | 94% | $0.0008 | 10.2s | 53% | |
| Gemma 3 4B | 92% | $0.0000 | 303ms | 45% | |
| Gemini 2.5 Flash Lite | 92% | $0.0000 | 357ms | 45% | |
| Gemma 3 12B | 92% | $0.0000 | 542ms | 45% | |
| Gemma 3 27B | 92% | $0.0000 | 780ms | 45% | |
| Gemini 3.1 Flash Lite | 92% | $0.0000 | 757ms | 45% | |
| Inception Mercury 2 | 92% | $0.0002 | 471ms | 45% | |
| Model | Total â–¼ | Who's the tallest? | What's the color of the car? | What instrument does Lucy play? | Guess the pet | What's the correct time? | Who's the sister? | Contextual pronoun | Indirect birth year | Fruits excluding citrus | Future event time | Highest-rated movie | All valid emails |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3 Flash (Preview, Reasoning) | 99% | 100% | 100% | 100% | 100% | 90% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemma 4 26B (Reasoning) | 98% | 100% | 100% | 100% | 100% | 80% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemma 4 31B (Reasoning) | 98% | 100% | 100% | 100% | 100% | 70% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemma 4 31B | 97% | 100% | 100% | 100% | 100% | 60% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Claude Opus 4.7 | 96% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Z.AI GLM 4.6 | 96% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Claude Sonnet 4 | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| GPT-4o Mini (temp=0) | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| Gemma 4 26B | 95% | 100% | 100% | 100% | 100% | 40% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Z.AI GLM 5.1 | 94% | 100% | 100% | 100% | 100% | 30% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 94% | 100% | 100% | 100% | 100% | 40% | 100% | 100% | 100% | 90% | 100% | 100% | 100% |
| Gemini 2.5 Flash Lite (Reasoning) | 94% | 100% | 100% | 100% | 100% | 30% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| ByteDance Seed 2.0 Lite | 94% | 100% | 100% | 100% | 100% | 30% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=1) | 94% | 100% | 100% | 100% | 100% | 80% | 100% | 100% | 100% | 100% | 50% | 100% | 100% |
| Qwen3.6 Max Preview | 93% | 100% | 100% | 100% | 100% | 20% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
Who's the tallest?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | |
| Ministral 3B | 100% | $0.0000 | 269ms | |
| Gemma 3 12B | 100% | $0.0000 | 322ms | |
| Ministral 8B | 90% | $0.0000 | 261ms | |
| Ministral 3 3B | 100% | $0.0000 | 272ms | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | |
| LFM2 24B | 100% | $0.0000 | 374ms | |
| Ministral 3 8B | 100% | $0.0000 | 368ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 363ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 389ms | |
| Ministral 3 14B | 100% | $0.0000 | 392ms | |
| Llama 3.1 8B | 100% | $0.0000 | 382ms | |
| Mistral NeMO | 80% | $0.0000 | 318ms | |
| Inception Mercury | 100% | $0.0000 | 402ms | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | |
| Mistral Small 4 | 100% | $0.0000 | 463ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 574ms | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 594ms | |
| Gemma 4 26B | 100% | $0.0000 | 1.8s | |
| Gemini 2.5 Flash | 100% | $0.0000 | 399ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 269ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 272ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 322ms | 100% | |
| LFM2 24B | 100% | $0.0000 | 374ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 363ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 368ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 389ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 382ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 392ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 402ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 399ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 463ms | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0000 | 481ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 574ms | 100% | |
| Llama 3.1 70B | 100% | $0.0000 | 419ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 579ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 594ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
What's the color of the car?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 221ms | |
| Ministral 3B | 100% | $0.0000 | 273ms | |
| Ministral 8B | 100% | $0.0000 | 256ms | |
| LFM2 24B | 100% | $0.0000 | 733ms | |
| Mistral NeMO | 100% | $0.0000 | 536ms | |
| Ministral 3 3B | 100% | $0.0000 | 293ms | |
| Mistral Small Creative | 100% | $0.0000 | 412ms | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0000 | 758ms | |
| Ministral 3 8B | 100% | $0.0000 | 361ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 406ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | |
| Llama 3.1 8B | 100% | $0.0000 | 308ms | |
| Gemma 4 26B | 100% | $0.0000 | 1.5s | |
| Ministral 3 14B | 100% | $0.0000 | 362ms | |
| Gemma 3 12B | 100% | $0.0000 | 403ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 554ms | |
| Gemma 3 27B | 100% | $0.0000 | 518ms | |
| Gemma 4 31B | 100% | $0.0000 | 2.0s | |
| Mistral Small 4 | 100% | $0.0000 | 488ms | |
| Inception Mercury | 100% | $0.0000 | 508ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 221ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 273ms | 100% | |
| Ministral 8B | 100% | $0.0000 | 256ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 293ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 308ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 361ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 362ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 403ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 406ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 412ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 488ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 554ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 518ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 508ms | 100% | |
| Mistral NeMO | 100% | $0.0000 | 536ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 441ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 533ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 549ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 578ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
What instrument does Lucy play?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 430ms | |
| Gemma 3 4B | 100% | $0.0000 | 217ms | |
| Mistral Small Creative | 100% | $0.0000 | 269ms | |
| Mistral NeMO | 100% | $0.0000 | 577ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 335ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 335ms | |
| Gemma 3 12B | 100% | $0.0000 | 362ms | |
| Ministral 3 14B | 100% | $0.0000 | 515ms | |
| Llama 3.1 8B | 90% | $0.0000 | 382ms | |
| Mistral Small 4 | 100% | $0.0000 | 543ms | |
| Gemma 4 26B | 100% | $0.0000 | 566ms | |
| Gemma 4 31B | 100% | $0.0000 | 12.3s | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | |
| Inception Mercury | 100% | $0.0000 | 523ms | |
| Stealth: Aurora Alpha | 100% | — | 809ms | |
| Mistral Medium 3.1 | 100% | $0.0000 | 350ms | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 547ms | |
| Gemini 2.5 Flash | 100% | $0.0000 | 499ms | |
| Llama 3.1 Nemotron 70B | 95% | $0.0000 | 465ms | |
| GPT-5.4 Nano | 90% | $0.0000 | 550ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 217ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 269ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 335ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 335ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 362ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 430ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 350ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 523ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 515ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 499ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 543ms | 100% | |
| Gemma 4 26B | 100% | $0.0000 | 566ms | 100% | |
| Mistral NeMO | 100% | $0.0000 | 577ms | 100% | |
| Hermes 3 70B | 100% | $0.0000 | 507ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 547ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | 100% | |
| Mistral Large 3 | 100% | $0.0000 | 547ms | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 680ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 709ms | 100% | |
| Inception Mercury 2 | 100% | $0.0001 | 390ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
Guess the pet
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 261ms | |
| Ministral 3B | 100% | $0.0000 | 267ms | |
| Ministral 8B | 100% | $0.0000 | 282ms | |
| Ministral 3 3B | 100% | $0.0000 | 656ms | |
| Mistral NeMO | 100% | $0.0000 | 647ms | |
| Mistral Small Creative | 100% | $0.0000 | 601ms | |
| LFM2 24B | 100% | $0.0000 | 1.6s | |
| Gemma 3 12B | 100% | $0.0000 | 327ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 361ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 345ms | |
| Ministral 3 8B | 100% | $0.0000 | 377ms | |
| Stealth: Aurora Alpha | 100% | — | 445ms | |
| Inception Mercury | 100% | $0.0000 | 342ms | |
| Ministral 3 14B | 100% | $0.0000 | 438ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 686ms | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | |
| Llama 3.1 8B | 100% | $0.0000 | 332ms | |
| Mistral Small 4 | 100% | $0.0000 | 441ms | |
| Gemma 4 31B | 100% | $0.0000 | 1.2s | |
| DeepSeek V4 Flash | 100% | $0.0000 | 3.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 261ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 267ms | 100% | |
| Ministral 8B | 100% | $0.0000 | 282ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 327ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 345ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 361ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 342ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 332ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 377ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 441ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 438ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 467ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 445ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 466ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 601ms | 100% | |
| Mistral Large 3 | 100% | $0.0000 | 480ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 548ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 656ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 686ms | 100% | |
| Mistral NeMO | 100% | $0.0000 | 647ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | — | |
| 100.0% | Matches text |
What's the correct time?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Mistral Small Creative | 70% | $0.0000 | 307ms | |
| GPT-5.4 Nano | 60% | $0.0000 | 809ms | |
| Claude Sonnet 4 | 100% | $0.0003 | 1.4s | |
| GPT-4o Mini (temp=0) | 100% | $0.0000 | 3.1s | |
| Gemma 4 31B | 60% | $0.0000 | 968ms | |
| GPT-4o Mini (temp=1) | 80% | $0.0000 | 13.7s | |
| DeepSeek V3 (2025-03-24) | 70% | $0.0002 | 7.8s | |
| Gemini 3 Flash (Preview, Reasoning) | 90% | $0.024 | 45.0s | |
| Gemini 2.5 Flash (Reasoning) | 60% | $0.028 | 47.6s | |
| Gemma 4 26B (Reasoning) | 80% | $0.0028 | 4.9m | |
| Claude Opus 4.7 | 50% | $0.0006 | 1.5s | |
| Rocinante 12B | 50% | $0.0000 | 4.0s | |
| Claude Opus 4 | 50% | $0.0014 | 4.6s | |
| Gemma 4 31B (Reasoning) | 70% | $0.0034 | 7.6m | |
| Z.AI GLM 4.6 | 50% | $0.0076 | 2.5m | |
| Gemma 3 4B | 0% | $0.0000 | 321ms | |
| Ministral 3B | 20% | $0.0000 | 293ms | |
| Mistral NeMO | 0% | $0.0000 | 7.2s | |
| Ministral 8B | 10% | $0.0000 | 284ms | |
| Ministral 3 3B | 0% | $0.0000 | 319ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Sonnet 4 | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 90% | 40% | 40% | |
| Gemma 4 26B (Reasoning) | 80% | 20% | 20% | |
| GPT-4o Mini (temp=1) | 80% | 20% | 20% | |
| Gemma 4 31B (Reasoning) | 70% | 8% | 8% | |
| DeepSeek V3 (2025-03-24) | 70% | 8% | 8% | |
| Mistral Small Creative | 70% | 8% | 8% | |
| Gemma 4 31B | 60% | 2% | 2% | |
| Gemini 2.5 Flash (Reasoning) | 60% | 2% | 2% | |
| GPT-5.4 Nano | 60% | 2% | 2% | |
| Claude Opus 4.6 (Reasoning) | 0% | 100% | 0% | |
| Qwen3.6 Max Preview | 20% | 20% | 0% | |
| Gemini 3.1 Pro (Preview) | 20% | 20% | 0% | |
| Z.AI GLM 5.1 | 30% | 8% | 0% | |
| Z.AI GLM 5 Turbo | 0% | 100% | 0% | |
| Claude Sonnet 4.6 (Reasoning) | 0% | 100% | 0% | |
| Grok 4.3 (Reasoning) | 0% | 100% | 0% | |
| GPT-5.4 (Reasoning) | 0% | 100% | 0% | |
| Claude Opus 4.7 (Reasoning) | 0% | 100% | 0% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Claude Sonnet 4 | 100% | $0.0003 | 1.4s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0000 | 3.1s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 90% | $0.024 | 45.0s | 40% | |
| GPT-4o Mini (temp=1) | 80% | $0.0000 | 13.7s | 20% | |
| Mistral Small Creative | 70% | $0.0000 | 307ms | 8% | |
| DeepSeek V3 (2025-03-24) | 70% | $0.0002 | 7.8s | 8% | |
| GPT-5.4 Nano | 60% | $0.0000 | 809ms | 2% | |
| Gemma 4 31B | 60% | $0.0000 | 968ms | 2% | |
| Gemma 4 26B (Reasoning) | 80% | $0.0028 | 4.9m | 20% | |
| Claude Opus 4.7 | 50% | $0.0006 | 1.5s | 0% | |
| Rocinante 12B | 50% | $0.0000 | 4.0s | 0% | |
| Claude Opus 4 | 50% | $0.0014 | 4.6s | 0% | |
| Gemma 4 26B | 40% | $0.0000 | 2.3s | 0% | |
| Gemini 2.5 Flash (Reasoning) | 60% | $0.028 | 47.6s | 2% | |
| Mistral Small 4 (Reasoning) | 40% | $0.0027 | 25.0s | 0% | |
| DeepSeek V3 (2024-12-26) | 30% | $0.0002 | 4.8s | 0% | |
| Hermes 3 405B | 30% | $0.0000 | 7.8s | 0% | |
| Gemini 2.5 Flash Lite (Reasoning) | 30% | $0.0033 | 24.9s | 0% | |
| Z.AI GLM 4.6 | 50% | $0.0076 | 2.5m | 0% | |
| Ministral 3B | 20% | $0.0000 | 293ms | 0% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 0.0% | Matches Regex |
Who's the sister?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 4.2s | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 625ms | |
| Ministral 3B | 100% | $0.0000 | 289ms | |
| LFM2 24B | 100% | $0.0000 | 401ms | |
| Nemotron 3 Super | 100% | $0.0000 | 1.4s | |
| Gemma 3 4B | 100% | $0.0000 | 252ms | |
| Gemma 3 12B | 100% | $0.0000 | 392ms | |
| Stealth: Healer Alpha | 100% | $0.0000 | 1.8s | |
| Ministral 8B | 100% | $0.0000 | 276ms | |
| Ministral 3 3B | 100% | $0.0000 | 363ms | |
| Mistral Small Creative | 100% | $0.0000 | 306ms | |
| Inception Mercury | 100% | $0.0000 | 337ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 461ms | |
| Gemma 3 27B | 100% | $0.0000 | 533ms | |
| Gemma 4 26B | 100% | $0.0000 | 701ms | |
| Mistral NeMO | 100% | $0.0000 | 13.7s | |
| Ministral 3 8B | 100% | $0.0000 | 333ms | |
| GPT-4.1 Nano | 100% | $0.0000 | 1.2s | |
| Qwen3 235B A22B Instruct 2507 | 90% | $0.0000 | 1.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 252ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 289ms | 100% | |
| LFM2 24B | 100% | $0.0000 | 401ms | 100% | |
| Ministral 8B | 100% | $0.0000 | 276ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 306ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 392ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 625ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 363ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 375ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 333ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 461ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 337ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 533ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 360ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 315ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 451ms | 100% | |
| Gemma 4 26B | 100% | $0.0000 | 701ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 410ms | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 1.4s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 714ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
Contextual pronoun
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 238ms | |
| Ministral 3B | 80% | $0.0000 | 262ms | |
| LFM2 24B | 100% | $0.0000 | 537ms | |
| Ministral 3 3B | 100% | $0.0000 | 277ms | |
| Ministral 8B | 90% | $0.0000 | 276ms | |
| Mistral NeMO | 100% | $0.0000 | 3.9s | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 328ms | |
| Inception Mercury | 100% | $0.0000 | 327ms | |
| Ministral 3 8B | 100% | $0.0000 | 393ms | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 558ms | |
| Gemma 3 12B | 100% | $0.0000 | 509ms | |
| Ministral 3 14B | 100% | $0.0000 | 414ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 634ms | |
| Llama 3.1 8B | 95% | $0.0000 | 398ms | |
| Mistral Small 4 | 100% | $0.0000 | 413ms | |
| Stealth: Aurora Alpha | 100% | — | 2.4s | |
| Gemma 4 26B | 100% | $0.0000 | 920ms | |
| Gemma 3 27B | 100% | $0.0000 | 1.5s | |
| Gemini 2.5 Flash | 100% | $0.0000 | 430ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 238ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 277ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 276ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 328ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 327ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 393ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 414ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 413ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 509ms | 100% | |
| LFM2 24B | 100% | $0.0000 | 537ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 558ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 430ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 634ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 631ms | 100% | |
| Mistral Large 3 | 100% | $0.0000 | 499ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 551ms | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 682ms | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0000 | 691ms | 100% | |
| Hermes 3 70B | 100% | $0.0000 | 590ms | 100% | |
| Llama 3.1 70B | 100% | $0.0001 | 413ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
Indirect birth year
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 309ms | |
| LFM2 24B | 100% | $0.0000 | 1.5s | |
| Mistral Small Creative | 100% | $0.0000 | 329ms | |
| Gemma 3 12B | 100% | $0.0000 | 427ms | |
| Ministral 3 3B | 90% | $0.0000 | 361ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 350ms | |
| Llama 3.1 8B | 100% | $0.0000 | 582ms | |
| Stealth: Aurora Alpha | 100% | — | 871ms | |
| Ministral 3 14B | 100% | $0.0000 | 434ms | |
| Inception Mercury | 100% | $0.0000 | 335ms | |
| Mistral Small 4 | 90% | $0.0000 | 749ms | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 640ms | |
| Llama 3.1 Nemotron 70B | 100% | $0.0000 | 467ms | |
| Gemma 3 27B | 100% | $0.0000 | 790ms | |
| GPT-4.1 Nano | 100% | $0.0000 | 937ms | |
| GPT-5.4 Nano | 100% | $0.0000 | 704ms | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0000 | 1.9s | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 595ms | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0000 | 702ms | |
| Hermes 3 70B | 100% | $0.0000 | 447ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 309ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 329ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 350ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 335ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 427ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 434ms | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0000 | 467ms | 100% | |
| Hermes 3 70B | 100% | $0.0000 | 447ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 582ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 501ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 595ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0000 | 640ms | 100% | |
| Mistral Large 3 | 100% | $0.0000 | 533ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0000 | 560ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 704ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 790ms | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0000 | 702ms | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 702ms | 100% | |
| Qwen 2.5 72B | 100% | $0.0000 | 700ms | 100% | |
| GPT-4.1 Nano | 100% | $0.0000 | 937ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
Fruits excluding citrus
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 90% | $0.0000 | 344ms | |
| Gemma 3 4B | 100% | $0.0000 | 419ms | |
| Ministral 3 3B | 100% | $0.0000 | 387ms | |
| Ministral 8B | 90% | $0.0000 | 404ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 368ms | |
| Mistral Small Creative | 100% | $0.0000 | 419ms | |
| Mistral NeMO | 100% | $0.0000 | 1.7s | |
| Ministral 3 8B | 100% | $0.0000 | 436ms | |
| Ministral 3 14B | 100% | $0.0000 | 608ms | |
| Llama 3.1 8B | 90% | $0.0000 | 575ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 712ms | |
| Inception Mercury | 90% | $0.0000 | 542ms | |
| Mistral Small 4 | 100% | $0.0000 | 532ms | |
| LFM2 24B | 100% | $0.0000 | 2.0s | |
| Gemma 3 27B | 100% | $0.0000 | 1.1s | |
| Stealth: Aurora Alpha | 100% | — | 3.4s | |
| Gemma 4 31B | 100% | $0.0000 | 3.5s | |
| Gemma 3 12B | 100% | $0.0000 | 987ms | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0000 | 1.4s | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 647ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 419ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 387ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 368ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 419ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 436ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 532ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 608ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 712ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 647ms | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0000 | 742ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0001 | 559ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 987ms | 100% | |
| Mistral Large 3 | 100% | $0.0001 | 643ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 1.1s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0001 | 832ms | 100% | |
| Qwen 2.5 72B | 100% | $0.0000 | 1.1s | 100% | |
| Llama 3.1 70B | 100% | $0.0001 | 683ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0000 | 1.1s | 100% | |
| Inception Mercury 2 | 100% | $0.0002 | 489ms | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0001 | 1.3s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Contains a list of texts |
Future event time
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 257ms | |
| Inception Mercury | 100% | $0.0000 | 389ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 350ms | |
| Gemma 3 12B | 100% | $0.0000 | 503ms | |
| Stealth: Aurora Alpha | 100% | — | 630ms | |
| Gemma 3 27B | 100% | $0.0000 | 696ms | |
| Gemma 4 26B | 100% | $0.0000 | 1.2s | |
| Gemma 4 31B | 100% | $0.0000 | 2.4s | |
| Gemini 2.5 Flash | 100% | $0.0000 | 380ms | |
| Gemini 3.1 Flash Lite (Reasoning) | 95% | $0.0000 | 608ms | |
| GPT-4.1 Nano | 95% | $0.0000 | 958ms | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 670ms | |
| Gemini 3.1 Flash Lite (Preview) | 95% | $0.0000 | 728ms | |
| Grok 4.20 (Beta) | 90% | $0.0002 | 419ms | |
| Gemini 3 Flash (Preview) | 100% | $0.0001 | 789ms | |
| Inception Mercury 2 | 100% | $0.0001 | 337ms | |
| GPT-5.4 Mini | 100% | $0.0001 | 653ms | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0001 | 1.4s | |
| DeepSeek V4 Flash (Reasoning) | 95% | $0.0000 | 2.6s | |
| Nemotron 3 Super | 100% | $0.0000 | 2.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 257ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 350ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 389ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 503ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 380ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 630ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 696ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0000 | 670ms | 100% | |
| Inception Mercury 2 | 100% | $0.0001 | 337ms | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0001 | 789ms | 100% | |
| Gemma 4 26B | 100% | $0.0000 | 1.2s | 100% | |
| GPT-5.4 Mini | 100% | $0.0001 | 653ms | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0001 | 1.4s | 100% | |
| Gemma 4 31B | 100% | $0.0000 | 2.4s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 2.8s | 100% | |
| GPT-5.4 | 100% | $0.0003 | 681ms | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0001 | 2.0s | 100% | |
| Grok 4 Fast | 100% | $0.0002 | 2.0s | 100% | |
| GPT-5.2 | 100% | $0.0004 | 1.1s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0001 | 3.3s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | — | |
| 90.0% | Matches text |
Highest-rated movie
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 267ms | |
| Ministral 3B | 100% | $0.0000 | 273ms | |
| Ministral 8B | 100% | $0.0000 | 311ms | |
| Mistral Small Creative | 100% | $0.0000 | 339ms | |
| Ministral 3 8B | 100% | $0.0000 | 353ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 351ms | |
| Inception Mercury | 100% | $0.0000 | 376ms | |
| Ministral 3 14B | 100% | $0.0000 | 350ms | |
| Ministral 3 3B | 100% | $0.0000 | 388ms | |
| Llama 3.1 8B | 90% | $0.0000 | 369ms | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 414ms | |
| Gemma 3 12B | 100% | $0.0000 | 486ms | |
| LFM2 24B | 100% | $0.0000 | 2.0s | |
| Stealth: Aurora Alpha | 100% | — | 429ms | |
| Inception Mercury 2 | 100% | $0.0001 | 373ms | |
| Mistral Small 4 | 100% | $0.0000 | 523ms | |
| Gemini 2.5 Flash | 100% | $0.0000 | 465ms | |
| Mistral Medium 3.1 | 100% | $0.0001 | 556ms | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0000 | 1.0s | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemma 3 4B | 100% | $0.0000 | 267ms | 100% | |
| Ministral 3B | 100% | $0.0000 | 273ms | 100% | |
| Ministral 8B | 100% | $0.0000 | 311ms | 100% | |
| Mistral Small Creative | 100% | $0.0000 | 339ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 351ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 353ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 376ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 350ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 388ms | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 414ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 429ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 486ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 523ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0000 | 465ms | 100% | |
| Inception Mercury 2 | 100% | $0.0001 | 373ms | 100% | |
| Gemma 3 27B | 100% | $0.0000 | 621ms | 100% | |
| Mistral Medium 3.1 | 100% | $0.0001 | 556ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 725ms | 100% | |
| Mistral Large 3 | 100% | $0.0001 | 547ms | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0000 | 648ms | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Matches Regex | ||
| 100.0% | Matches text |
All valid emails
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 80% | $0.0000 | 380ms | |
| Stealth: Aurora Alpha | 100% | — | 613ms | |
| Ministral 3 3B | 100% | $0.0000 | 1.0s | |
| Mistral Small Creative | 100% | $0.0000 | 437ms | |
| Gemma 3 4B | 100% | $0.0000 | 656ms | |
| Ministral 3 8B | 100% | $0.0000 | 469ms | |
| Ministral 8B | 80% | $0.0000 | 521ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 398ms | |
| Llama 3.1 8B | 100% | $0.0000 | 440ms | |
| Inception Mercury | 100% | $0.0000 | 502ms | |
| Ministral 3 14B | 100% | $0.0000 | 610ms | |
| LFM2 24B | 100% | $0.0000 | 2.0s | |
| Gemma 4 26B | 100% | $0.0000 | 981ms | |
| Mistral Small 4 | 100% | $0.0000 | 670ms | |
| Gemma 3 12B | 100% | $0.0000 | 1.1s | |
| DeepSeek V4 Flash | 100% | $0.0000 | 2.0s | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 1.5s | |
| Mistral Small 3.2 24B | 100% | $0.0000 | 1.9s | |
| GPT-4.1 Nano | 100% | $0.0000 | 1.4s | |
| GPT-5.4 Nano | 100% | $0.0001 | 902ms | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small Creative | 100% | $0.0000 | 437ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0000 | 398ms | 100% | |
| Ministral 3 8B | 100% | $0.0000 | 469ms | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 440ms | 100% | |
| Stealth: Aurora Alpha | 100% | — | 613ms | 100% | |
| Gemma 3 4B | 100% | $0.0000 | 656ms | 100% | |
| Inception Mercury | 100% | $0.0000 | 502ms | 100% | |
| Ministral 3 14B | 100% | $0.0000 | 610ms | 100% | |
| Mistral Small 4 | 100% | $0.0000 | 670ms | 100% | |
| Ministral 3 3B | 100% | $0.0000 | 1.0s | 100% | |
| Gemma 4 26B | 100% | $0.0000 | 981ms | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 1.1s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 722ms | 100% | |
| GPT-5.4 Nano | 100% | $0.0001 | 902ms | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 764ms | 100% | |
| Gemini 2.5 Flash | 100% | $0.0001 | 467ms | 100% | |
| Inception Mercury 2 | 100% | $0.0001 | 354ms | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0001 | 984ms | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0001 | 824ms | 100% | |
| Arcee AI: Trinity Large (Preview) | 100% | $0.0000 | 1.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Contains a list of texts |