N-Length Sentences
Write sentences with exactly N words
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 1.5s | |
| Gemini 3.1 Flash Lite (Reasoning) | 98% | $0.0001 | 2.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | |
| Inception Mercury | 98% | $0.0001 | 1.6s | |
| Stealth: Aurora Alpha | 98% | — | 1.7s | |
| Gemma 4 26B | 96% | $0.0000 | 3.8s | |
| Gemini 3 Flash (Preview) | 99% | $0.0004 | 1.9s | |
| Inception Mercury 2 | 100% | $0.0007 | 1.2s | |
| Llama 3.1 Nemotron 70B | 81% | $0.0001 | 5.7s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0008 | 5.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 7.5s | |
| Llama 3.1 70B | 84% | $0.0002 | 2.1s | |
| Nemotron 3 Super | 100% | $0.0000 | 16.0s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0025 | 6.3s | |
| Gemma 4 31B | 89% | $0.0001 | 8.6s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0038 | 8.1s | |
| GPT-OSS 120B | 100% | $0.0004 | 20.3s | |
| Claude Opus 4.5 | 94% | $0.0052 | 6.8s | |
| Stealth: Healer Alpha | 79% | $0.0000 | 12.1s | |
| GPT-5 Nano | 100% | $0.0010 | 28.2s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Qwen 3.5 Flash | 100% | 100% | 100% | |
| Qwen 3.5 9B | 100% | 100% | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | 100% | 100% | |
| Gemini 3.1 Flash Lite | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 1.5s | 100% | |
| Inception Mercury 2 | 100% | $0.0007 | 1.2s | 98% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 7.5s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0025 | 6.3s | 100% | |
| Gemini 3 Flash (Preview) | 99% | $0.0004 | 1.9s | 95% | |
| Nemotron 3 Super | 100% | $0.0000 | 16.0s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0038 | 8.1s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0008 | 5.7s | 95% | |
| Inception Mercury | 98% | $0.0001 | 1.6s | 90% | |
| GPT-OSS 120B | 100% | $0.0004 | 20.3s | 98% | |
| GPT-5 Nano | 100% | $0.0010 | 28.2s | 100% | |
| Stealth: Aurora Alpha | 98% | — | 1.7s | 90% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0093 | 10.4s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 98% | $0.0001 | 2.0s | 85% | |
| o4 Mini | 100% | $0.0083 | 20.8s | 100% | |
| GPT-5.2 | 100% | $0.011 | 15.0s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.010 | 17.7s | 100% | |
| Qwen 3.5 Flash | 100% | $0.0024 | 38.6s | 100% | |
| GPT-5 Mini | 99% | $0.0043 | 26.2s | 94% | |
| Model | Total â–¼ | Write sentences with 5 words each | Write sentences with 10 words each | Write sentences with 20 words each |
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | 100% |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | 100% |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | 100% |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | 100% |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | 100% |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | 100% |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | 100% |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | 100% |
| Qwen 3.5 122B | 100% | 100% | 100% | 100% |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | 100% |
| Qwen 3.5 27B | 100% | 100% | 100% | 100% |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | 100% |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | 100% |
| o4 Mini High | 100% | 100% | 100% | 100% |
| GPT-5.2 | 100% | 100% | 100% | 100% |
Write sentences with 5 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Llama 3.1 8B | 100% | $0.0000 | 877ms | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 2.3s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 904ms | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0001 | 945ms | |
| Inception Mercury | 100% | $0.0001 | 1.2s | |
| Llama 3.1 70B | 100% | $0.0001 | 1.3s | |
| Gemini 3 Flash (Preview) | 100% | $0.0002 | 1.3s | |
| Mistral Medium 3.1 | 99% | $0.0002 | 1.9s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.1s | |
| Gemma 4 26B | 100% | $0.0000 | 4.3s | |
| Stealth: Aurora Alpha | 98% | — | 1.6s | |
| Qwen 2.5 72B | 89% | $0.0007 | 44.4s | |
| Gemma 4 31B | 100% | $0.0000 | 4.7s | |
| Grok 4.3 | 92% | $0.0002 | 1.5s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 3.8s | |
| DeepSeek V3 (2024-12-26) | 95% | $0.0001 | 4.3s | |
| DeepSeek V4 Pro | 98% | $0.0002 | 4.7s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 5.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0015 | 3.9s | |
| Claude Sonnet 4.6 | 100% | $0.0016 | 3.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 904ms | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0001 | 945ms | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0002 | 1.3s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 2.3s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.1s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 3.8s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 877ms | 99% | |
| Llama 3.1 70B | 100% | $0.0001 | 1.3s | 99% | |
| Gemma 4 26B | 100% | $0.0000 | 4.3s | 100% | |
| Gemma 4 31B | 100% | $0.0000 | 4.7s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 5.1s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 8.5s | 100% | |
| Inception Mercury | 100% | $0.0001 | 1.2s | 97% | |
| Claude Sonnet 4.6 | 100% | $0.0016 | 3.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0015 | 3.9s | 100% | |
| Mistral Medium 3.1 | 99% | $0.0002 | 1.9s | 97% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 7.1s | 100% | |
| GPT-5.4 Mini | 99% | $0.0004 | 871ms | 96% | |
| GPT-4o Mini (temp=0) | 98% | $0.0001 | 3.2s | 96% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0025 | 6.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.1% | Matches word count |
Write sentences with 10 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.4s | |
| Inception Mercury | 99% | $0.0001 | 1.5s | |
| Gemma 4 26B | 100% | $0.0000 | 3.6s | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 1.1s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0001 | 1.1s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | |
| Llama 3.1 70B | 96% | $0.0002 | 1.9s | |
| Gemini 3 Flash (Preview) | 100% | $0.0004 | 2.0s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.2s | |
| Llama 3.1 Nemotron 70B | 99% | $0.0001 | 6.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 5.4s | |
| Gemma 4 31B | 100% | $0.0001 | 13.0s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 6.1s | |
| GPT-OSS 120B | 100% | $0.0003 | 16.2s | |
| Nemotron 3 Super | 100% | $0.0000 | 20.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 6.5s | |
| GPT-5 Nano | 100% | $0.0008 | 23.8s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0029 | 6.9s | |
| Llama 3.1 8B | 95% | $0.0000 | 824ms | |
| Claude Sonnet 4.5 | 97% | $0.0029 | 5.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.4s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0001 | 1.1s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0001 | 1.1s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0004 | 2.0s | 100% | |
| Gemma 4 26B | 100% | $0.0000 | 3.6s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 5.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 6.1s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.2s | 97% | |
| Inception Mercury | 99% | $0.0001 | 1.5s | 97% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 6.5s | 100% | |
| Gemma 4 31B | 100% | $0.0001 | 13.0s | 98% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0029 | 6.9s | 100% | |
| Llama 3.1 Nemotron 70B | 99% | $0.0001 | 6.0s | 97% | |
| Nemotron 3 Super | 100% | $0.0000 | 20.5s | 100% | |
| GPT-OSS 120B | 100% | $0.0003 | 16.2s | 98% | |
| GPT-5 Nano | 100% | $0.0008 | 23.8s | 100% | |
| Claude Opus 4.5 | 100% | $0.0051 | 6.4s | 98% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0079 | 9.3s | 100% | |
| Llama 3.1 70B | 96% | $0.0002 | 1.9s | 89% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 92.4% | Matches word count |
Write sentences with 20 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.4s | |
| Gemini 3.1 Flash Lite | 100% | $0.0002 | 1.2s | |
| Gemini 3.1 Flash Lite (Reasoning) | 94% | $0.0001 | 4.1s | |
| Inception Mercury | 95% | $0.0001 | 2.0s | |
| Stealth: Aurora Alpha | 98% | — | 2.1s | |
| Inception Mercury 2 | 100% | $0.0008 | 1.4s | |
| Gemini 3 Flash (Preview) | 97% | $0.0005 | 2.3s | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0011 | 6.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0013 | 9.3s | |
| Nemotron 3 Super | 100% | $0.0000 | 19.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0040 | 8.4s | |
| GPT-OSS 120B | 100% | $0.0005 | 20.9s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0060 | 10.9s | |
| Gemma 4 26B | 88% | $0.0001 | 3.3s | |
| ByteDance Seed 1.6 Flash | 92% | $0.0009 | 16.4s | |
| Stealth: Healer Alpha | 78% | $0.0000 | 16.6s | |
| Nemotron 3 Nano | 100% | $0.0005 | 31.4s | |
| GPT-5 Nano | 100% | $0.0014 | 37.4s | |
| ByteDance Seed 1.6 | 100% | $0.0032 | 36.1s | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.012 | 19.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite | 100% | $0.0002 | 1.2s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.4s | 100% | |
| Inception Mercury 2 | 100% | $0.0008 | 1.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0013 | 9.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0040 | 8.4s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0060 | 10.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 19.1s | 100% | |
| GPT-OSS 120B | 100% | $0.0005 | 20.9s | 100% | |
| Gemini 3 Flash (Preview) | 97% | $0.0005 | 2.3s | 93% | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0011 | 6.7s | 91% | |
| Nemotron 3 Nano | 100% | $0.0005 | 31.4s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.014 | 13.9s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.012 | 19.9s | 100% | |
| GPT-5.2 | 100% | $0.015 | 18.4s | 100% | |
| GPT-5 Nano | 100% | $0.0014 | 37.4s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0032 | 36.1s | 100% | |
| o4 Mini | 100% | $0.011 | 25.9s | 100% | |
| Inception Mercury | 95% | $0.0001 | 2.0s | 85% | |
| Qwen 3.5 Flash | 100% | $0.0027 | 40.1s | 100% | |
| GPT-5.1 | 100% | $0.015 | 24.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 65.1% | Matches word count |