Language Writing
Can the model generate text in different languages?
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 2.0s | |
| Inception Mercury | 96% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 93% | $0.0001 | 4.0s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | |
| GPT-4.1 Mini | 99% | $0.0004 | 3.4s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 4.8s | |
| Mistral NeMO | 67% | $0.0001 | 4.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.6s | |
| Claude 3 Haiku | 81% | $0.0007 | 3.8s | |
| Arcee AI: Trinity Mini | 81% | $0.0002 | 15.9s | |
| Gemini 3.1 Flash Lite (Preview) | 95% | $0.0011 | 3.7s | |
| Nemotron 3 Nano | 95% | $0.0002 | 10.8s | |
| GPT-5.4 Mini | 97% | $0.0020 | 2.5s | |
| Nemotron 3 Super | 98% | $0.0000 | 21.7s | |
| Mistral Small 3.2 24B | 71% | $0.0003 | 11.0s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 16.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.5s | |
| GPT-5.4 Nano (Reasoning) | 98% | $0.0016 | 6.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.6s | |
| GPT-5.4 Nano | 92% | $0.0017 | 6.6s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
11 low-scoring outliers hidden: Ministral 8B (52.8%), Rocinante 12B (51.9%), Mistral Medium 3.1 (49.0%), Mistral Small 4 (48.9%), Ministral 3 8B (47.9%), Qwen3 235B A22B Instruct 2507 (46.7%), Writer: Palmyra X5 (43.2%), Ministral 3 3B (36.2%), Mistral Small Creative (33.7%), Llama 3.1 Nemotron 70B (33.6%), Ministral 3 14B (10.0%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | |
| DeepSeek-V2 Chat | 100% | 100% | 100% | |
| Stealth: Aurora Alpha | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 99% | 99% | |
| Z.AI GLM 5 Turbo | 100% | 98% | 98% | |
| Z.AI GLM 4.5 | 100% | 97% | 97% | |
| o4 Mini High | 100% | 97% | 97% | |
| Inception Mercury 2 | 100% | 96% | 96% | |
| Claude Opus 4.5 | 99% | 96% | 96% | |
| GPT-5 Nano | 99% | 96% | 96% | |
| GPT-4o, Aug. 6th (temp=1) | 99% | 94% | 94% | |
| Hermes 3 405B | 99% | 94% | 94% | |
| GPT-4.1 Mini | 99% | 93% | 93% | |
| Nemotron 3 Super | 98% | 91% | 91% | |
| GPT-4o, May 13th (temp=0) | 97% | 89% | 89% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 2.0s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 4.8s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | 96% | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.6s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.5s | 99% | |
| GPT-4.1 Mini | 99% | $0.0004 | 3.4s | 93% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 16.1s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0052 | 6.1s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0013 | 14.5s | 97% | |
| Z.AI GLM 5 Turbo | 100% | $0.0037 | 14.7s | 98% | |
| Hermes 3 405B | 99% | $0.0000 | 21.0s | 94% | |
| GPT-5.4 Mini | 97% | $0.0020 | 2.5s | 87% | |
| GPT-4o, Aug. 6th (temp=1) | 99% | $0.0056 | 6.5s | 94% | |
| o4 Mini | 100% | $0.0071 | 16.7s | 100% | |
| Nemotron 3 Super | 98% | $0.0000 | 21.7s | 91% | |
| GPT-5.4 Nano (Reasoning) | 98% | $0.0016 | 6.1s | 83% | |
| Grok 4.20 (Beta) | 97% | $0.0034 | 3.3s | 84% | |
| Gemini 2.5 Flash | 97% | $0.0026 | 6.0s | 83% | |
| Claude Sonnet 4.6 | 100% | $0.010 | 13.7s | 100% | |
| Model | Total â–¼ | Character dialogue (Spanish) in a story | Character dialogue (French) in a story | Character dialogue (German) in a story | Character dialogue (Italian) in a story | Character dialogue (Hindi) in a story |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 100% | 100% | 100% | 100% | 100% | 100% |
| o4 Mini | 100% | 100% | 100% | 100% | 100% | 100% |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | 100% | 100% | 100% |
| DeepSeek-V2 Chat | 100% | 100% | 100% | 100% | 100% | 100% |
| Stealth: Aurora Alpha | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | 100% | 100% | 100% |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | 99% | 100% | 100% |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | 100% | 99% | 100% |
| Z.AI GLM 4.5 | 100% | 100% | 100% | 100% | 98% | 100% |
| Inception Mercury 2 | 100% | 100% | 100% | 100% | 98% | 100% |
| o4 Mini High | 100% | 99% | 100% | 99% | 100% | 100% |
| GPT-4o, Aug. 6th (temp=1) | 99% | 100% | 100% | 100% | 100% | 97% |
| GPT-5 Nano | 99% | 98% | 99% | 100% | 100% | 100% |
Character dialogue (Spanish) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| Claude Opus 4.5 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Grok 4 | 100% | |
| Claude Sonnet 4.5 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 90% | $0.0002 | 1.6s | |
| Llama 3.1 8B | 78% | $0.0001 | 3.0s | |
| Ministral 3 3B | 60% | $0.0001 | 2.3s | |
| GPT-4.1 Nano | 100% | $0.0001 | 3.7s | |
| Inception Mercury 2 | 100% | $0.0005 | 1.4s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.6s | |
| GPT-4.1 Mini | 100% | $0.0004 | 3.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.5s | |
| Arcee AI: Trinity Mini | 85% | $0.0002 | 5.0s | |
| Claude 3 Haiku | 98% | $0.0007 | 4.0s | |
| Stealth: Healer Alpha | 99% | $0.0000 | 16.1s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.5s | |
| Nemotron 3 Super | 100% | $0.0000 | 12.5s | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 15.1s | |
| Z.AI GLM 5 Turbo | 100% | $0.0027 | 10.2s | |
| WizardLM 2 8x22b | 73% | $0.0006 | 10.7s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.9s | |
| Grok 4.1 Fast | 96% | $0.0006 | 11.7s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0020 | 3.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0005 | 1.4s | 100% | |
| GPT-4.1 Nano | 100% | $0.0001 | 3.7s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.6s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 3.4s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.5s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0020 | 3.5s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0019 | 5.9s | 100% | |
| Nemotron 3 Super | 100% | $0.0000 | 12.5s | 100% | |
| GPT-5.4 Nano (Reasoning) | 99% | $0.0014 | 5.3s | 98% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.9s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 99% | $0.0017 | 6.4s | 98% | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 15.1s | 100% | |
| GPT-5.4 Nano | 99% | $0.0016 | 6.5s | 97% | |
| Mistral Large | 100% | $0.0030 | 8.8s | 100% | |
| GPT-4.1 | 100% | $0.0041 | 5.7s | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0006 | 16.6s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0010 | 15.3s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0027 | 10.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.0% | Parse dialogue |
Character dialogue (French) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 100% | $0.0002 | 1.2s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | |
| Mistral NeMO | 60% | $0.0001 | 4.9s | |
| Llama 3.1 8B | 78% | $0.0001 | 4.6s | |
| GPT-4.1 Mini | 96% | $0.0005 | 3.4s | |
| Arcee AI: Trinity Mini | 100% | $0.0002 | 6.1s | |
| GPT-4.1 Nano | 93% | $0.0001 | 4.9s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 8.0s | |
| GPT-5.4 Mini | 96% | $0.0018 | 2.3s | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0005 | 13.9s | |
| Grok 4 Fast | 96% | $0.0005 | 7.0s | |
| Nemotron 3 Nano | 97% | $0.0002 | 9.9s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 6.1s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0015 | 6.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.5s | |
| Grok 4.1 Fast | 96% | $0.0006 | 11.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury | 100% | $0.0002 | 1.2s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.4s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0002 | 6.1s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 8.0s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0023 | 3.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 6.1s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0020 | 5.5s | 100% | |
| GPT-5.4 Nano | 100% | $0.0018 | 6.8s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0015 | 6.2s | 99% | |
| Grok 4.20 (Beta) | 100% | $0.0033 | 3.4s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.4s | 100% | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0005 | 13.9s | 100% | |
| Hermes 3 70B | 100% | $0.0003 | 15.8s | 100% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 18.6s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0013 | 16.0s | 100% | |
| Hermes 3 405B | 100% | $0.0000 | 20.9s | 100% | |
| GPT-4.1 | 100% | $0.0042 | 8.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.8% | Parse dialogue |
Character dialogue (German) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Qwen 3.5 122B | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
| Grok 4 | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | |
| Inception Mercury | 98% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 92% | $0.0001 | 2.8s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.5s | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | |
| Nemotron 3 Super | 98% | $0.0000 | 9.2s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.4s | |
| Gemini 2.5 Flash Lite | 98% | $0.0005 | 4.6s | |
| Arcee AI: Trinity Mini | 100% | $0.0004 | 15.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | |
| Arcee AI: Trinity Large (Preview) | 93% | $0.0000 | 14.1s | |
| Mistral Small 3.2 24B | 80% | $0.0003 | 9.3s | |
| GPT-5.4 Mini | 100% | $0.0020 | 2.4s | |
| Gemini 3 Flash (Preview) | 100% | $0.0016 | 4.6s | |
| Nemotron 3 Nano | 100% | $0.0002 | 10.5s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.8s | |
| Gemma 3 4B | 96% | $0.0001 | 12.8s | |
| Hermes 3 70B | 72% | $0.0002 | 13.5s | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0025 | 4.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0006 | 1.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.3s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.4s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 3.7s | 100% | |
| GPT-5.4 Mini | 100% | $0.0020 | 2.4s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0016 | 4.6s | 100% | |
| Nemotron 3 Nano | 100% | $0.0002 | 10.5s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0023 | 5.4s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0019 | 7.5s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 13.8s | 100% | |
| Arcee AI: Trinity Mini | 100% | $0.0004 | 15.0s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0012 | 12.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 99% | $0.0025 | 4.2s | 98% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0036 | 4.7s | 100% | |
| GPT-5.4 Nano | 99% | $0.0018 | 7.0s | 98% | |
| Inception Mercury | 98% | $0.0002 | 1.5s | 91% | |
| GPT-4.1 | 100% | $0.0042 | 6.4s | 100% | |
| Mistral Large | 100% | $0.0032 | 9.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.3% | Parse dialogue |
Character dialogue (Italian) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| o4 Mini | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.5s | |
| Inception Mercury | 100% | $0.0002 | 1.7s | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | |
| Inception Mercury 2 | 98% | $0.0006 | 1.4s | |
| Mistral NeMO | 80% | $0.0001 | 4.5s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.6s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.4s | |
| Claude 3 Haiku | 78% | $0.0005 | 3.5s | |
| GPT-4.1 Nano | 90% | $0.0001 | 4.6s | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 9.5s | |
| Ministral 3 8B | 77% | $0.0002 | 6.8s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 4.0s | |
| Arcee AI: Trinity Mini | 60% | $0.0002 | 7.8s | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 12.7s | |
| LFM2 24B | 85% | $0.0001 | 11.7s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.2s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 5.8s | |
| Nemotron 3 Nano | 87% | $0.0002 | 9.4s | |
| Grok 4 Fast | 96% | $0.0006 | 7.7s | |
| GPT-5.4 Mini | 100% | $0.0022 | 2.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 1.5s | 100% | |
| Inception Mercury | 100% | $0.0002 | 1.7s | 100% | |
| GPT-4.1 Mini | 100% | $0.0004 | 2.6s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 4.4s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 4.6s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0012 | 4.0s | 100% | |
| GPT-5.4 Mini | 100% | $0.0022 | 2.5s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0003 | 9.5s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0022 | 3.2s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0017 | 6.7s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 12.7s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 6.1s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0015 | 5.8s | 98% | |
| Gemini 2.5 Flash | 100% | $0.0026 | 6.1s | 100% | |
| Hermes 3 405B | 100% | $0.0000 | 18.6s | 100% | |
| Inception Mercury 2 | 98% | $0.0006 | 1.4s | 93% | |
| DeepSeek V3.1 | 100% | $0.0008 | 17.3s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0041 | 7.6s | 100% | |
| GPT-4.1 | 100% | $0.0043 | 7.5s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0053 | 5.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 98.5% | Parse dialogue |
Character dialogue (Hindi) in a story
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Z.AI GLM 5 Turbo | 100% | |
| Claude Sonnet 4.6 | 100% | |
| o4 Mini High | 100% | |
| o4 Mini | 100% | |
| Z.AI GLM 4.5 | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | |
| Gemini 3 Flash (Preview) | 100% | |
| DeepSeek-V2 Chat | 100% | |
| Inception Mercury 2 | 100% | |
| Stealth: Aurora Alpha | 100% | |
| GPT-4.1 Mini | 100% | |
| GPT-5 Nano | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | |
| GPT-5.4 Mini | 100% | |
| DeepSeek V3.1 | 100% | |
| GPT-4o Mini (temp=1) | 100% | |
| GPT-4o Mini (temp=0) | 100% | |
| Nemotron 3 Nano | 100% | |
| Claude 3 Haiku | 100% | |
| Claude Opus 4.5 | 99% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 100% | — | 3.2s | |
| Inception Mercury | 91% | $0.0002 | 1.5s | |
| GPT-4.1 Nano | 90% | $0.0001 | 4.0s | |
| Inception Mercury 2 | 100% | $0.0006 | 1.3s | |
| Mistral NeMO | 80% | $0.0001 | 3.7s | |
| GPT-4.1 Mini | 100% | $0.0005 | 5.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.9s | |
| Claude 3 Haiku | 100% | $0.0007 | 4.1s | |
| Nemotron 3 Nano | 100% | $0.0002 | 8.8s | |
| Nemotron 3 Super | 96% | $0.0000 | 31.4s | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.2s | |
| GPT-5.4 Mini | 100% | $0.0024 | 2.8s | |
| GPT-5.4 Nano (Reasoning) | 91% | $0.0015 | 5.7s | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 5.8s | |
| GPT-5.4 Nano | 81% | $0.0016 | 6.1s | |
| DeepSeek V3.1 | 100% | $0.0011 | 16.3s | |
| Hermes 3 405B | 98% | $0.0000 | 18.9s | |
| Z.AI GLM 4.5 | 100% | $0.0018 | 16.6s | |
| Grok 4.20 (Beta) | 92% | $0.0038 | 3.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| o4 Mini | 100% | 100% | 100% | |
| Z.AI GLM 4.5 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview) | 100% | 100% | 100% | |
| DeepSeek-V2 Chat | 100% | 100% | 100% | |
| Inception Mercury 2 | 100% | 100% | 100% | |
| Stealth: Aurora Alpha | 100% | 100% | 100% | |
| GPT-4.1 Mini | 100% | 100% | 100% | |
| GPT-5 Nano | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| GPT-5.4 Mini | 100% | 100% | 100% | |
| DeepSeek V3.1 | 100% | 100% | 100% | |
| GPT-4o Mini (temp=1) | 100% | 100% | 100% | |
| GPT-4o Mini (temp=0) | 100% | 100% | 100% | |
| Nemotron 3 Nano | 100% | 100% | 100% | |
| Claude 3 Haiku | 100% | 100% | 100% | |
| Claude Opus 4.5 | 99% | 96% | 96% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury 2 | 100% | $0.0006 | 1.3s | 100% | |
| Stealth: Aurora Alpha | 100% | — | 3.2s | 100% | |
| Claude 3 Haiku | 100% | $0.0007 | 4.1s | 100% | |
| GPT-4.1 Mini | 100% | $0.0005 | 5.0s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 5.9s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.3s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0021 | 3.2s | 100% | |
| Nemotron 3 Nano | 100% | $0.0002 | 8.8s | 100% | |
| GPT-5.4 Mini | 100% | $0.0024 | 2.8s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 5.8s | 100% | |
| DeepSeek V3.1 | 100% | $0.0011 | 16.3s | 100% | |
| Z.AI GLM 4.5 | 100% | $0.0018 | 16.6s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0055 | 6.4s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.0041 | 11.7s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0002 | 26.9s | 100% | |
| Hermes 3 405B | 98% | $0.0000 | 18.9s | 90% | |
| o4 Mini | 100% | $0.0079 | 19.4s | 100% | |
| GPT-4o, Aug. 6th (temp=1) | 97% | $0.0049 | 5.6s | 89% | |
| GPT-4o, May 13th (temp=0) | 98% | $0.0091 | 7.3s | 94% | |
| Claude Sonnet 4.6 | 100% | $0.013 | 16.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 74.6% | Parse dialogue |