N-Length Sentences
Write sentences with exactly N words
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | |
| Inception Mercury | 98% | $0.0001 | 1.6s | |
| Stealth: Aurora Alpha | 98% | — | 1.7s | |
| Inception Mercury 2 | 100% | $0.0007 | 1.2s | |
| Gemini 3 Flash (Preview) | 99% | $0.0004 | 1.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0008 | 5.7s | |
| Llama 3.1 70B | 84% | $0.0002 | 2.1s | |
| Llama 3.1 Nemotron 70B | 81% | $0.0001 | 5.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 7.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0025 | 6.3s | |
| Nemotron 3 Super | 100% | $0.0000 | 16.0s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0038 | 8.1s | |
| Claude Opus 4.5 | 94% | $0.0052 | 6.8s | |
| Llama 3.1 8B | 80% | $0.0000 | 910ms | |
| GPT-5 Nano | 100% | $0.0010 | 28.2s | |
| Stealth: Healer Alpha | 79% | $0.0000 | 12.1s | |
| GPT-4o, May 13th (temp=0) | 87% | $0.0025 | 4.6s | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0093 | 10.4s | |
| Claude Opus 4.6 | 84% | $0.0055 | 7.6s | |
| GPT-5 Mini | 99% | $0.0043 | 26.2s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Qwen 3.5 Flash | 100% | 100% | 100% | |
| Qwen 3.5 9B | 100% | 100% | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Nemotron 3 Super | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 100% | 100% | |
| GPT-5 | 100% | 99% | 99% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | 100% | |
| Inception Mercury 2 | 100% | $0.0007 | 1.2s | 98% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0010 | 7.5s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0025 | 6.3s | 100% | |
| Gemini 3 Flash (Preview) | 99% | $0.0004 | 1.9s | 95% | |
| Nemotron 3 Super | 100% | $0.0000 | 16.0s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0038 | 8.1s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0008 | 5.7s | 95% | |
| Inception Mercury | 98% | $0.0001 | 1.6s | 90% | |
| GPT-5 Nano | 100% | $0.0010 | 28.2s | 100% | |
| Stealth: Aurora Alpha | 98% | — | 1.7s | 90% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0093 | 10.4s | 100% | |
| o4 Mini | 100% | $0.0083 | 20.8s | 100% | |
| GPT-5.2 | 100% | $0.011 | 15.0s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.010 | 17.7s | 100% | |
| Qwen 3.5 Flash | 100% | $0.0024 | 38.6s | 100% | |
| GPT-5 Mini | 99% | $0.0043 | 26.2s | 94% | |
| Z.AI GLM 5 Turbo | 100% | $0.011 | 26.6s | 100% | |
| GPT-5.4 (Reasoning) | 100% | $0.013 | 19.5s | 98% | |
| o4 Mini High | 100% | $0.011 | 27.6s | 100% | |
| Model | Total â–¼ | Write sentences with 5 words each | Write sentences with 10 words each | Write sentences with 20 words each |
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | 100% |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | 100% |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | 100% |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | 100% |
| Qwen 3.5 122B | 100% | 100% | 100% | 100% |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | 100% |
| Qwen 3.5 27B | 100% | 100% | 100% | 100% |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | 100% |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | 100% |
| o4 Mini High | 100% | 100% | 100% | 100% |
| GPT-5.2 | 100% | 100% | 100% | 100% |
| o4 Mini | 100% | 100% | 100% | 100% |
| Qwen 3.5 Flash | 100% | 100% | 100% | 100% |
| Qwen 3.5 9B | 100% | 100% | 100% | 100% |
| Gemini 3.1 Flash Lite (Preview) | 100% | 100% | 100% | 100% |
Write sentences with 5 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Llama 3.1 8B | 100% | $0.0000 | 877ms | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 904ms | |
| Inception Mercury | 100% | $0.0001 | 1.2s | |
| Llama 3.1 70B | 100% | $0.0001 | 1.3s | |
| Gemini 3 Flash (Preview) | 100% | $0.0002 | 1.3s | |
| Mistral Medium 3.1 | 99% | $0.0002 | 1.9s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.1s | |
| Stealth: Aurora Alpha | 98% | — | 1.6s | |
| Qwen 2.5 72B | 89% | $0.0007 | 44.4s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 3.8s | |
| DeepSeek V3 (2024-12-26) | 95% | $0.0001 | 4.3s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 5.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0015 | 3.9s | |
| Claude Sonnet 4.6 | 100% | $0.0016 | 3.3s | |
| Nemotron 3 Super | 100% | $0.0000 | 8.5s | |
| GPT-4o, May 13th (temp=1) | 99% | $0.0015 | 4.3s | |
| GPT-4.1 Nano | 97% | $0.0000 | 1.3s | |
| GPT-4o Mini (temp=1) | 98% | $0.0001 | 7.9s | |
| GPT-5.4 Mini | 99% | $0.0004 | 871ms | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 7.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| MiniMax M2.5 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 904ms | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0002 | 1.3s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.1s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 3.8s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 877ms | 99% | |
| Llama 3.1 70B | 100% | $0.0001 | 1.3s | 99% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0006 | 5.1s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 8.5s | 100% | |
| Inception Mercury | 100% | $0.0001 | 1.2s | 97% | |
| Claude Sonnet 4.6 | 100% | $0.0016 | 3.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0015 | 3.9s | 100% | |
| Mistral Medium 3.1 | 99% | $0.0002 | 1.9s | 97% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 7.1s | 100% | |
| GPT-5.4 Mini | 99% | $0.0004 | 871ms | 96% | |
| GPT-4o Mini (temp=0) | 98% | $0.0001 | 3.2s | 96% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0025 | 6.5s | 100% | |
| GPT-4o, May 13th (temp=0) | 99% | $0.0018 | 3.9s | 98% | |
| GPT-4.1 Mini | 98% | $0.0002 | 1.9s | 94% | |
| Gemma 3 27B | 98% | $0.0000 | 3.5s | 94% | |
| GPT-4o, May 13th (temp=1) | 99% | $0.0015 | 4.3s | 97% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 97.8% | Matches word count |
Write sentences with 10 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Inception Mercury | 99% | $0.0001 | 1.5s | |
| Stealth: Aurora Alpha | 100% | — | 1.4s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | |
| Llama 3.1 70B | 96% | $0.0002 | 1.9s | |
| Gemini 3 Flash (Preview) | 100% | $0.0004 | 2.0s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.2s | |
| Llama 3.1 Nemotron 70B | 99% | $0.0001 | 6.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 5.4s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 6.1s | |
| Claude 3.5 Haiku | 97% | $0.0007 | 2.7s | |
| Nemotron 3 Super | 100% | $0.0000 | 20.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 6.5s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0029 | 6.9s | |
| Llama 3.1 8B | 95% | $0.0000 | 824ms | |
| GPT-5 Nano | 100% | $0.0008 | 23.8s | |
| Claude Sonnet 4.5 | 97% | $0.0029 | 5.4s | |
| Claude Sonnet 4 | 92% | $0.0025 | 4.9s | |
| Stealth: Healer Alpha | 81% | $0.0000 | 9.7s | |
| Claude Opus 4.5 | 100% | $0.0051 | 6.4s | |
| GPT-4o, Aug. 6th (temp=1) | 95% | $0.0015 | 2.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Qwen 3.5 Flash | 100% | 100% | 100% | |
| Qwen 3.5 9B | 100% | 100% | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.4s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.2s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0004 | 2.0s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 5.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 6.1s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.2s | 97% | |
| Inception Mercury | 99% | $0.0001 | 1.5s | 97% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 6.5s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0029 | 6.9s | 100% | |
| Llama 3.1 Nemotron 70B | 99% | $0.0001 | 6.0s | 97% | |
| Nemotron 3 Super | 100% | $0.0000 | 20.5s | 100% | |
| GPT-5 Nano | 100% | $0.0008 | 23.8s | 100% | |
| Claude Opus 4.5 | 100% | $0.0051 | 6.4s | 98% | |
| Claude 3.5 Haiku | 97% | $0.0007 | 2.7s | 92% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.0079 | 9.3s | 100% | |
| Llama 3.1 70B | 96% | $0.0002 | 1.9s | 89% | |
| GPT-5 Mini | 100% | $0.0037 | 22.9s | 98% | |
| Qwen 3.5 Flash | 100% | $0.0022 | 39.8s | 100% | |
| Claude Sonnet 4.5 | 97% | $0.0029 | 5.4s | 91% | |
| GPT-4o, Aug. 6th (temp=1) | 95% | $0.0015 | 2.4s | 89% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 90.7% | Matches word count |
Write sentences with 20 words each
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.4s | |
| Inception Mercury | 95% | $0.0001 | 2.0s | |
| Stealth: Aurora Alpha | 98% | — | 2.1s | |
| Inception Mercury 2 | 100% | $0.0008 | 1.4s | |
| Gemini 3 Flash (Preview) | 97% | $0.0005 | 2.3s | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0011 | 6.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0013 | 9.3s | |
| Nemotron 3 Super | 100% | $0.0000 | 19.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0040 | 8.4s | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0060 | 10.9s | |
| ByteDance Seed 1.6 Flash | 92% | $0.0009 | 16.4s | |
| Stealth: Healer Alpha | 78% | $0.0000 | 16.6s | |
| Nemotron 3 Nano | 100% | $0.0005 | 31.4s | |
| GPT-5 Nano | 100% | $0.0014 | 37.4s | |
| ByteDance Seed 1.6 | 100% | $0.0032 | 36.1s | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.012 | 19.9s | |
| Qwen 3.5 Flash | 100% | $0.0027 | 40.1s | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.014 | 13.9s | |
| GPT-5 Mini | 98% | $0.0056 | 32.6s | |
| o4 Mini | 100% | $0.011 | 25.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Qwen 3.5 35B | 100% | 100% | 100% | |
| ByteDance Seed 2.0 Mini | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0001 | 1.4s | 100% | |
| Inception Mercury 2 | 100% | $0.0008 | 1.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0013 | 9.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0040 | 8.4s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0060 | 10.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 19.1s | 100% | |
| Gemini 3 Flash (Preview) | 97% | $0.0005 | 2.3s | 93% | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0011 | 6.7s | 91% | |
| Nemotron 3 Nano | 100% | $0.0005 | 31.4s | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | $0.014 | 13.9s | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.012 | 19.9s | 100% | |
| GPT-5.2 | 100% | $0.015 | 18.4s | 100% | |
| GPT-5 Nano | 100% | $0.0014 | 37.4s | 100% | |
| o4 Mini | 100% | $0.011 | 25.9s | 100% | |
| ByteDance Seed 1.6 | 100% | $0.0032 | 36.1s | 100% | |
| Inception Mercury | 95% | $0.0001 | 2.0s | 85% | |
| Qwen 3.5 Flash | 100% | $0.0027 | 40.1s | 100% | |
| GPT-5.1 | 100% | $0.015 | 24.4s | 100% | |
| MiniMax M2.7 | 100% | $0.0040 | 44.9s | 100% | |
| Stealth: Aurora Alpha | 98% | — | 2.1s | 85% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 59.4% | Matches word count |