Language Writing
Can the model generate text in different languages?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 2.0s | |
| Inception Mercury | 96% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 93% | $0.0001 | 4.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 4.8s | |
| Mistral NeMO | 67% | $0.0001 | 4.3s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | |
| GPT-4.1 Mini | 99% | $0.0004 | 3.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.6s | |
| Arcee AI: Trinity Mini | 81% | $0.0002 | 15.9s | |
| Claude 3 Haiku | 81% | $0.0007 | 3.8s | |
| Grok 4.3 | 94% | $0.0009 | 3.8s | |
| Gemini 3.1 Flash Lite (Preview) | 95% | $0.0011 | 3.7s | |
| Gemini 3.1 Flash Lite | 97% | $0.0011 | 6.2s | |
| Gemini 3.1 Flash Lite (Reasoning) | 98% | $0.0011 | 5.3s | |
| Nemotron 3 Nano | 95% | $0.0002 | 10.8s | |
| DeepSeek V4 Flash (Reasoning) | 90% | $0.0002 | 20.8s | |
| Nemotron 3 Super | 98% | $0.0000 | 21.7s | |
| DeepSeek V4 Flash | 87% | $0.0002 | 12.4s | |
| Mistral Small 3.2 24B | 71% | $0.0003 | 11.0s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 16.1s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
14 low-scoring outliers hidden: LFM2 24B (64.3%), WizardLM 2 8x22b (61.1%), Ministral 3B (59.5%), Ministral 8B (52.8%), Rocinante 12B (51.9%), Mistral Medium 3.1 (49.0%), Mistral Small 4 (48.9%), Ministral 3 8B (47.9%), Qwen3 235B A22B Instruct 2507 (46.7%), Writer: Palmyra X5 (43.2%), Ministral 3 3B (36.2%), Mistral Small Creative (33.7%), Llama 3.1 Nemotron 70B (33.6%), Ministral 3 14B (10.0%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemma 4 31B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | |
| DeepSeek-V2 Chat | 100% | 100% | 100% | |
| Stealth: Aurora Alpha | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 99% | 99% | |
| Z.AI GLM 5 Turbo | 100% | 98% | 98% | |
| GPT-5.5 (Reasoning) | 99% | 98% | 98% | |
| Claude Opus 4.7 | 100% | 98% | 98% | |
| Z.AI GLM 4.5 | 100% | 97% | 97% | |
| o4 Mini High | 100% | 97% | 97% | |
| Inception Mercury 2 | 100% | 96% | 96% | |
| Claude Opus 4.5 | 99% | 96% | 96% | |
| GPT-5.5 | 98% | 97% | 96% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 2.0s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 4.8s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | 96% | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.6s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.5s | 99% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 16.1s | 100% | |
| GPT-4.1 Mini | 99% | $0.0004 | 3.4s | 93% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0052 | 6.1s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0013 | 14.5s | 97% | |
| Z.AI GLM 5 Turbo | 100% | $0.0037 | 14.7s | 98% | |
| Hermes 3 405B | 99% | $0.0000 | 21.0s | 94% | |
| Gemma 4 31B | 100% | $0.0003 | 32.1s | 100% | |
| GPT-5.4 Mini | 97% | $0.0020 | 2.5s | 87% | |
| GPT-4o, Aug. 6th (temp=1) | 99% | $0.0056 | 6.5s | 94% | |
| GPT-OSS 120B | 99% | $0.0003 | 28.0s | 96% | |
| Gemini 3.1 Flash Lite (Reasoning) | 98% | $0.0011 | 5.3s | 84% | |
| Nemotron 3 Super | 98% | $0.0000 | 21.7s | 91% | |
| o4 Mini | 100% | $0.0071 | 16.7s | 100% | |
| GPT-5.4 Nano (Reasoning) | 98% | $0.0016 | 6.1s | 83% | |
| Model | Total â–¼ | Character dialogue (Spanish) in a story | Character dialogue (French) in a story | Character dialogue (German) in a story | Character dialogue (Italian) in a story | Character dialogue (Hindi) in a story |
|---|---|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | 100% | 100% | 100% |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | 100% | 100% | 100% |
| Claude Sonnet 4.6 | 100% | 100% | 100% | 100% | 100% | 100% |
| o4 Mini | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemma 4 31B | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | 100% | 100% | 100% |
| DeepSeek-V2 Chat | 100% | 100% | 100% | 100% | 100% | 100% |
| Stealth: Aurora Alpha | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | 99% | 100% | 100% |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | 100% | 99% | 100% |
| Z.AI GLM 4.5 | 100% | 100% | 100% | 100% | 98% | 100% |
| Claude Opus 4.7 | 100% | 99% | 100% | 99% | 100% | 100% |
Character dialogue (Spanish) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 90% | $0.0002 | 1.6s | |
| Llama 3.1 8B | 78% | $0.0001 | 3.0s | |
| Ministral 3 3B | 60% | $0.0001 | 2.3s | |
| GPT-4.1 Nano | 100% | $0.0001 | 3.7s | |
| Inception Mercury 2 | 100% | $0.0005 | 1.4s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.6s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.5s | |
| GPT-4.1 Mini | 100% | $0.0004 | 3.4s | |
| Arcee AI: Trinity Mini | 85% | $0.0002 | 5.0s | |
| Grok 4.3 | 100% | $0.0008 | 2.8s | |
| Claude 3 Haiku | 98% | $0.0007 | 4.0s | |
| Stealth: Healer Alpha | 99% | $0.0000 | 16.1s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.5s | |
| Nemotron 3 Super | 100% | $0.0000 | 12.5s | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 15.1s | |
| Gemini 3.1 Flash Lite | 100% | $0.0010 | 3.6s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 3.6s | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0002 | 13.1s | |
| Gemma 4 26B | 100% | $0.0003 | 28.0s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0005 | 1.4s | 100% | |
| GPT-4.1 Nano | 100% | $0.0001 | 3.7s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.6s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 3.4s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.5s | 100% | |
| Grok 4.3 | 100% | $0.0008 | 2.8s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.5s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0010 | 3.6s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 3.6s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0020 | 3.5s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0019 | 5.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 12.5s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0002 | 13.1s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.9s | 100% | |
| GPT-5.4 Nano (Reasoning) | 99% | $0.0014 | 5.3s | 98% | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0017 | 6.4s | 98% | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 15.1s | 100% | |
| Grok 4.20 | 100% | $0.0019 | 9.8s | 100% | |
| GPT-OSS 120B | 100% | $0.0002 | 17.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.8% | Parse dialogue |
Character dialogue (French) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 100% | $0.0002 | 1.2s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | |
| Mistral NeMO | 60% | $0.0001 | 4.9s | |
| Llama 3.1 8B | 78% | $0.0001 | 4.6s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | |
| GPT-4.1 Nano | 93% | $0.0001 | 4.9s | |
| Arcee AI: Trinity Mini | 100% | $0.0002 | 6.1s | |
| GPT-4.1 Mini | 96% | $0.0005 | 3.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 8.0s | |
| Grok 4.3 | 100% | $0.0008 | 3.5s | |
| DeepSeek V4 Flash | 100% | $0.0002 | 13.2s | |
| DeepSeek V4 Flash (Reasoning) | 80% | $0.0002 | 10.6s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 3.5s | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0005 | 13.9s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | |
| Nemotron 3 Nano | 97% | $0.0002 | 9.9s | |
| Grok 4 Fast | 96% | $0.0005 | 7.0s | |
| Gemma 3 4B | 96% | $0.0001 | 13.9s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury | 100% | $0.0002 | 1.2s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0002 | 6.1s | 100% | |
| Grok 4.3 | 100% | $0.0008 | 3.5s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 3.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 8.0s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0011 | 5.2s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 6.1s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.4s | 100% | |
| DeepSeek V4 Flash | 100% | $0.0002 | 13.2s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.4s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.5s | 100% | |
| GPT-5.4 Nano | 100% | $0.0018 | 6.8s | 100% | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0005 | 13.9s | 100% | |
| Hermes 3 70B | 100% | $0.0003 | 15.8s | 100% | |
| Z.AI GLM 4.5 Air | 100% | $0.0007 | 13.6s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0015 | 6.2s | 99% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 99.7% | Parse dialogue |
Character dialogue (German) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 98% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 92% | $0.0001 | 2.8s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.5s | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | |
| Nemotron 3 Super | 98% | $0.0000 | 9.2s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.4s | |
| Gemini 2.5 Flash Lite | 98% | $0.0005 | 4.6s | |
| Arcee AI: Trinity Mini | 100% | $0.0004 | 15.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | |
| Gemini 3.1 Flash Lite | 100% | $0.0011 | 3.7s | |
| Grok 4.3 | 98% | $0.0010 | 4.3s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 6.3s | |
| Arcee AI: Trinity Large (Preview) | 93% | $0.0000 | 14.1s | |
| Mistral Small 3.2 24B | 80% | $0.0003 | 9.3s | |
| GPT-5.4 Mini | 100% | $0.0020 | 2.4s | |
| Gemini 3 Flash (Preview) | 100% | $0.0016 | 4.6s | |
| Nemotron 3 Nano | 100% | $0.0002 | 10.5s | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0002 | 54.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.4s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0011 | 3.7s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | 100% | |
| GPT-5.4 Mini | 100% | $0.0020 | 2.4s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 6.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0016 | 4.6s | 100% | |
| Nemotron 3 Nano | 100% | $0.0002 | 10.5s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0023 | 5.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0019 | 7.5s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.8s | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0004 | 15.0s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0012 | 12.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0025 | 4.2s | 98% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0036 | 4.7s | 100% | |
| GPT-5.4 Nano | 99% | $0.0018 | 7.0s | 98% | |
| Grok 4.20 | 100% | $0.0023 | 11.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.7% | Parse dialogue |
Character dialogue (Italian) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.5s | |
| Inception Mercury | 100% | $0.0002 | 1.7s | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | |
| Inception Mercury 2 | 98% | $0.0006 | 1.4s | |
| Mistral NeMO | 80% | $0.0001 | 4.5s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.6s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.4s | |
| Claude 3 Haiku | 78% | $0.0005 | 3.5s | |
| GPT-4.1 Nano | 90% | $0.0001 | 4.6s | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 9.5s | |
| Ministral 3 8B | 77% | $0.0002 | 6.8s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 6.0s | |
| Grok 4.3 | 100% | $0.0010 | 4.5s | |
| Arcee AI: Trinity Mini | 60% | $0.0002 | 7.8s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 4.0s | |
| Gemini 3.1 Flash Lite | 100% | $0.0011 | 15.1s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 12.7s | |
| DeepSeek V4 Flash | 98% | $0.0002 | 11.0s | |
| LFM2 24B | 85% | $0.0001 | 11.7s | |
| Nemotron 3 Nano | 87% | $0.0002 | 9.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Grok 4.20 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.5s | 100% | |
| Inception Mercury | 100% | $0.0002 | 1.7s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.4s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.6s | 100% | |
| Grok 4.3 | 100% | $0.0010 | 4.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 4.0s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0011 | 6.0s | 100% | |
| GPT-5.4 Mini | 100% | $0.0022 | 2.5s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 9.5s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.2s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0017 | 6.7s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 12.7s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 6.1s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 5.8s | 98% | |
| Gemini 2.5 Flash | 100% | $0.0026 | 6.1s | 100% | |
| Hermes 3 405B | 100% | $0.0000 | 18.6s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0011 | 15.1s | 100% | |
| Inception Mercury 2 | 98% | $0.0006 | 1.4s | 93% | |
| DeepSeek V3.1 | 100% | $0.0008 | 17.3s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.9% | Parse dialogue |
Character dialogue (Hindi) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Qwen3.6 Max Preview | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| Grok 4.3 (Reasoning) | 100% | |
| GPT-5.5 (Reasoning) | 100% | |
| Claude Sonnet 4.6 | 100% | |
| o4 Mini High | 100% | |
| Claude Opus 4.7 | 100% | |
| o4 Mini | 100% | |
| Gemma 4 31B | 100% | |
| GPT-OSS 120B | 100% | |
| Z.AI GLM 4.5 | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | |
| Gemini 3 Flash (Preview) | 100% | |
| DeepSeek-V2 Chat | 100% | |
| Inception Mercury 2 | 100% | |
| Stealth: Aurora Alpha | 100% | |
| GPT-4.1 Mini | 100% | |
| GPT-5 Nano | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | |
| GPT-5.4 Mini | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 3.2s | |
| Inception Mercury | 91% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 90% | $0.0001 | 4.0s | |
| Mistral NeMO | 80% | $0.0001 | 3.7s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.3s | |
| GPT-4.1 Mini | 100% | $0.0005 | 5.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.9s | |
| Claude 3 Haiku | 100% | $0.0007 | 4.1s | |
| Gemini 3.1 Flash Lite | 84% | $0.0010 | 3.7s | |
| Nemotron 3 Nano | 100% | $0.0002 | 8.8s | |
| Gemini 3.1 Flash Lite (Reasoning) | 92% | $0.0010 | 6.9s | |
| Nemotron 3 Super | 96% | $0.0000 | 31.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.2s | |
| Gemma 4 31B | 100% | $0.0002 | 15.8s | |
| GPT-5.4 Mini | 100% | $0.0024 | 2.8s | |
| GPT-5.4 Nano (Reasoning) | 91% | $0.0015 | 5.7s | |
| GPT-OSS 120B | 100% | $0.0003 | 21.3s | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 5.8s | |
| DeepSeek V3.1 | 100% | $0.0011 | 16.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Claude Opus 4.7 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemma 4 31B | 100% | 100% | 100% | |
| GPT-OSS 120B | 100% | 100% | 100% | |
| Z.AI GLM 4.5 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | |
| DeepSeek-V2 Chat | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Stealth: Aurora Alpha | 100% | 100% | 100% | |
| GPT-4.1 Mini | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| GPT-5.4 Mini | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury 2 | 100% | $0.0006 | 1.3s | 100% | |
| Stealth: Aurora Alpha | 100% | — | 3.2s | 100% | |
| GPT-4.1 Mini | 100% | $0.0005 | 5.0s | 100% | |
| Claude 3 Haiku | 100% | $0.0007 | 4.1s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.9s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.3s | 100% | |
| Nemotron 3 Nano | 100% | $0.0002 | 8.8s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.2s | 100% | |
| GPT-5.4 Mini | 100% | $0.0024 | 2.8s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 5.8s | 100% | |
| Gemma 4 31B | 100% | $0.0002 | 15.8s | 100% | |
| DeepSeek V3.1 | 100% | $0.0011 | 16.3s | 100% | |
| GPT-OSS 120B | 100% | $0.0003 | 21.3s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0018 | 16.6s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0055 | 6.4s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0041 | 11.7s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0002 | 26.9s | 100% | |
| Hermes 3 405B | 98% | $0.0000 | 18.9s | 90% | |
| o4 Mini | 100% | $0.0079 | 19.4s | 100% | |
| GPT-4o, Aug. 6th (temp=1) | 97% | $0.0049 | 5.6s | 89% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 75.6% | Parse dialogue |