Dialogue tags
Various tasks related to dialogue tags in text.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 83% | $0.0096 | 52.2s | |
| Z.AI GLM 5 Turbo | 87% | $0.030 | 1.3m | |
| Inception Mercury 2 | 71% | $0.0037 | 6.1s | |
| GPT-5 | 84% | $0.049 | 1.4m | |
| Qwen 3.5 27B | 66% | $0.023 | 1.8m | |
| Claude Opus 4.6 | 73% | $0.013 | 14.7s | |
| GPT-5.4 (Reasoning) | 62% | $0.027 | 36.1s | |
| GPT-5.5 (Reasoning) | 62% | $0.043 | 26.8s | |
| Gemini 3 Flash (Preview, Reasoning) | 61% | $0.017 | 30.3s | |
| Nemotron 3 Super | 74% | $0.0000 | 2.5m | |
| Qwen 3.6 35B | 63% | $0.012 | 1.2m | |
| Claude Opus 4.6 (Reasoning) | 80% | $0.070 | 37.7s | |
| Gemini 3.1 Pro (Preview) | 99% | $0.135 | 1.9m | |
| o4 Mini High | 79% | $0.045 | 1.8m | |
| Grok 4.3 (Reasoning) | 75% | $0.035 | 3.0m | |
| GPT-5.1 | 63% | $0.027 | 47.2s | |
| Claude Sonnet 4.6 | 67% | $0.0074 | 12.1s | |
| Z.AI GLM 5.1 | 83% | $0.042 | 3.5m | |
| GPT-5.5 | 59% | $0.019 | 17.7s | |
| GPT-5.2 | 60% | $0.024 | 34.2s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 90% | 90% | |
| GPT-5 Mini | 83% | 48% | 48% | |
| Z.AI GLM 5 Turbo | 87% | 48% | 48% | |
| o4 Mini High | 79% | 47% | 45% | |
| MiniMax M2.7 | 81% | 45% | 45% | |
| Claude Opus 4.6 (Reasoning) | 80% | 45% | 44% | |
| GPT-5 | 84% | 43% | 43% | |
| Z.AI GLM 5.1 | 83% | 43% | 43% | |
| MoonshotAI: Kimi K2.6 | 80% | 41% | 41% | |
| Claude Sonnet 4.6 (Reasoning) | 75% | 40% | 38% | |
| Claude Opus 4.6 | 73% | 42% | 35% | |
| Claude Sonnet 4.6 | 67% | 50% | 35% | |
| Nemotron 3 Super | 74% | 36% | 34% | |
| MiniMax M2.5 | 74% | 43% | 33% | |
| Grok 4.3 (Reasoning) | 75% | 33% | 33% | |
| Inception Mercury 2 | 71% | 38% | 31% | |
| Qwen3.6 Max Preview | 69% | 27% | 25% | |
| o4 Mini | 68% | 32% | 23% | |
| Gemini 3.1 Flash Lite (Reasoning) | 52% | 46% | 23% | |
| Claude Opus 4.7 (Reasoning) | 63% | 37% | 23% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5 Mini | 83% | $0.0096 | 52.2s | 48% | |
| Inception Mercury 2 | 71% | $0.0037 | 6.1s | 31% | |
| Z.AI GLM 5 Turbo | 87% | $0.030 | 1.3m | 48% | |
| Claude Opus 4.6 | 73% | $0.013 | 14.7s | 35% | |
| Claude Sonnet 4.6 | 67% | $0.0074 | 12.1s | 35% | |
| GPT-5 | 84% | $0.049 | 1.4m | 43% | |
| Gemini 3.1 Flash Lite (Reasoning) | 52% | $0.0006 | 4.1s | 23% | |
| Gemini 3.1 Pro (Preview) | 99% | $0.135 | 1.9m | 90% | |
| Gemini 3.1 Flash Lite (Preview) | 51% | $0.0006 | 2.7s | 23% | |
| Nemotron 3 Super | 74% | $0.0000 | 2.5m | 34% | |
| Claude Opus 4.5 | 64% | $0.013 | 13.5s | 19% | |
| GPT-4o Mini (temp=0) | 55% | $0.0003 | 7.8s | 19% | |
| Claude Opus 4.7 (Reasoning) | 63% | $0.019 | 10.6s | 23% | |
| GPT-4o, Aug. 6th (temp=0) | 57% | $0.0049 | 6.0s | 18% | |
| o4 Mini High | 79% | $0.045 | 1.8m | 45% | |
| Claude Opus 4.6 (Reasoning) | 80% | $0.070 | 37.7s | 44% | |
| Gemini 3.1 Flash Lite | 48% | $0.0006 | 4.7s | 19% | |
| GPT-4o, Aug. 6th (temp=1) | 51% | $0.0050 | 6.0s | 17% | |
| Claude Opus 4.7 | 57% | $0.019 | 11.3s | 20% | |
| o4 Mini | 68% | $0.023 | 58.0s | 23% | |
| Ungrouped | dialogue-200 | dialogue-500 | ||||||
|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Write unattributed dialogue | Write 200 words with 10% dialogue | Write 200 words with 50% dialogue | Write 200 words with 90% dialogue | Write 500 words with 30% dialogue | Write 500 words with 50% dialogue | Write 500 words with 70% dialogue |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 100% | 100% | 99% | 96% | 100% | 100% |
| Z.AI GLM 5 Turbo | 87% | 100% | 100% | 98% | 69% | 97% | 95% | 50% |
| GPT-5 | 84% | 100% | 100% | 95% | 63% | 84% | 95% | 54% |
| Z.AI GLM 5.1 | 83% | 100% | 100% | 98% | 79% | 86% | 81% | 39% |
| GPT-5 Mini | 83% | 90% | 100% | 90% | 62% | 87% | 79% | 70% |
| MiniMax M2.7 | 81% | 86% | 97% | 96% | 89% | 71% | 72% | 54% |
| Claude Opus 4.6 (Reasoning) | 80% | 100% | 100% | 90% | 94% | 68% | 53% | 57% |
| MoonshotAI: Kimi K2.6 | 80% | 100% | 100% | 86% | 98% | 61% | 81% | 31% |
| o4 Mini High | 79% | 96% | 100% | 85% | 74% | 80% | 57% | 62% |
| Grok 4.3 (Reasoning) | 75% | 100% | 94% | 95% | 97% | 71% | 37% | 33% |
| Claude Sonnet 4.6 (Reasoning) | 75% | 96% | 99% | 99% | 72% | 82% | 50% | 26% |
| Nemotron 3 Super | 74% | 56% | 90% | 90% | 82% | 71% | 57% | 74% |
| MiniMax M2.5 | 74% | 86% | 76% | 98% | 62% | 74% | 63% | 55% |
| Claude Opus 4.6 | 73% | 100% | 82% | 79% | 94% | 43% | 71% | 41% |
| Inception Mercury 2 | 71% | 80% | 90% | 97% | 83% | 58% | 56% | 34% |
dialogue-200
Write 200 words with 10% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 89% | — | 7.4s | |
| Inception Mercury | 85% | $0.0002 | 8.2s | |
| Inception Mercury 2 | 90% | $0.0025 | 4.0s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0041 | 24.5s | |
| Claude Opus 4.5 | 96% | $0.0077 | 9.3s | |
| GPT-5.5 | 96% | $0.0087 | 8.9s | |
| Claude Haiku 4.5 | 76% | $0.0016 | 4.4s | |
| GPT-OSS 120B | 89% | $0.0015 | 1.5m | |
| Claude Opus 4.6 | 82% | $0.0079 | 9.7s | |
| GPT-5 Mini | 100% | $0.0088 | 45.8s | |
| GPT-5.4 Nano (Reasoning, Low) | 81% | $0.0025 | 15.4s | |
| Claude Sonnet 4 | 83% | $0.0046 | 7.5s | |
| Qwen 3.6 35B | 89% | $0.0075 | 53.0s | |
| GPT-5.2 | 100% | $0.026 | 29.9s | |
| Qwen 3.6 Flash | 92% | $0.015 | 46.1s | |
| o4 Mini | 97% | $0.022 | 55.0s | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.025 | 40.7s | |
| GPT-5.4 (Reasoning, Low) | 91% | $0.025 | 23.9s | |
| Nemotron 3 Super | 90% | $0.0000 | 2.0m | |
| GPT-5.4 (Reasoning) | 100% | $0.034 | 38.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 99% | 99% | |
| Qwen 3.5 397B A17B | 100% | 99% | 99% | |
| MoonshotAI: Kimi K2.6 | 100% | 99% | 99% | |
| GPT-5.2 | 100% | 99% | 99% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 99% | 99% | |
| o4 Mini High | 100% | 98% | 98% | |
| Z.AI GLM 5.1 | 100% | 97% | 97% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 95% | 95% | |
| Gemma 4 31B (Reasoning) | 99% | 94% | 94% | |
| Grok 4.20 (Reasoning) | 97% | 90% | 90% | |
| GPT-5.5 | 96% | 91% | 90% | |
| Claude Opus 4.5 | 96% | 89% | 88% | |
| MoonshotAI: Kimi K2.5 | 97% | 87% | 86% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.4 Nano (Reasoning) | 100% | $0.0041 | 24.5s | 99% | |
| GPT-5 Mini | 100% | $0.0088 | 45.8s | 100% | |
| GPT-5.5 | 96% | $0.0087 | 8.9s | 90% | |
| Claude Opus 4.5 | 96% | $0.0077 | 9.3s | 88% | |
| GPT-5.2 | 100% | $0.026 | 29.9s | 99% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.025 | 40.7s | 99% | |
| GPT-5.4 (Reasoning) | 100% | $0.034 | 38.8s | 100% | |
| GPT-5.1 | 100% | $0.032 | 49.0s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.026 | 1.1m | 100% | |
| Claude Opus 4.7 | 88% | $0.011 | 7.5s | 82% | |
| Claude Sonnet 4.6 | 87% | $0.0045 | 8.0s | 71% | |
| o4 Mini High | 100% | $0.036 | 1.5m | 98% | |
| GPT-5 | 100% | $0.046 | 1.3m | 100% | |
| o4 Mini | 97% | $0.022 | 55.0s | 80% | |
| Inception Mercury 2 | 90% | $0.0025 | 4.0s | 60% | |
| Grok 4.20 (Reasoning) | 97% | $0.022 | 1.7m | 90% | |
| Claude Sonnet 4 | 83% | $0.0046 | 7.5s | 64% | |
| Claude Sonnet 4.5 | 84% | $0.0046 | 7.8s | 62% | |
| Claude Opus 4.7 (Reasoning) | 86% | $0.011 | 7.2s | 63% | |
| Qwen 3.6 Flash | 92% | $0.015 | 46.1s | 69% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 49.7% | Dialogue to Total Word Ratio | ||
| 73.6% | Matches word count |
Write 200 words with 50% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Inception Mercury 2 | 97% | $0.0031 | 5.3s | |
| Stealth: Aurora Alpha | 85% | — | 7.2s | |
| GPT-5.5 | 91% | $0.0094 | 8.9s | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0072 | 34.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 80% | $0.0027 | 15.2s | |
| GPT-5 Mini | 90% | $0.0084 | 50.9s | |
| Claude Opus 4.6 (Reasoning) | 90% | $0.034 | 20.9s | |
| GPT-OSS 120B | 97% | $0.0012 | 2.6m | |
| GPT-5.4 (Reasoning, Low) | 83% | $0.018 | 19.5s | |
| Gemini 3 Flash (Preview, Reasoning) | 87% | $0.024 | 39.7s | |
| Claude Opus 4.6 | 79% | $0.0081 | 9.6s | |
| Qwen 3.6 35B | 100% | $0.015 | 1.3m | |
| GPT-5.4 (Reasoning) | 100% | $0.042 | 47.1s | |
| GPT-5.1 | 95% | $0.038 | 1.1m | |
| GPT-5.5 (Reasoning, Low) | 98% | $0.041 | 23.6s | |
| Nemotron 3 Super | 90% | $0.0000 | 2.7m | |
| GPT-4o, Aug. 6th (temp=0) | 83% | $0.0031 | 4.0s | |
| Z.AI GLM 5 Turbo | 98% | $0.028 | 1.3m | |
| GPT-5.2 | 90% | $0.038 | 48.2s | |
| o4 Mini | 89% | $0.027 | 1.1m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 99% | 99% | |
| Qwen 3.6 35B | 100% | 99% | 99% | |
| GPT-5.5 (Reasoning, Low) | 98% | 96% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 96% | 95% | |
| Qwen 3.5 397B A17B | 99% | 92% | 92% | |
| Qwen 3.5 27B | 98% | 92% | 92% | |
| Qwen 3.5 35B | 97% | 91% | 91% | |
| Z.AI GLM 5 Turbo | 98% | 90% | 90% | |
| Z.AI GLM 5.1 | 98% | 90% | 90% | |
| Inception Mercury 2 | 97% | 86% | 86% | |
| MiniMax M2.5 | 98% | 86% | 86% | |
| GPT-OSS 120B | 97% | 83% | 83% | |
| Gemma 4 31B (Reasoning) | 96% | 82% | 82% | |
| MiniMax M2.7 | 96% | 77% | 77% | |
| GPT-5.5 | 91% | 75% | 74% | |
| Grok 4.20 (Beta, Reasoning) | 91% | 72% | 71% | |
| GPT-5.1 | 95% | 70% | 70% | |
| GPT-5.4 Nano (Reasoning) | 95% | 70% | 70% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury 2 | 97% | $0.0031 | 5.3s | 86% | |
| Qwen 3.6 35B | 100% | $0.015 | 1.3m | 99% | |
| GPT-5.5 (Reasoning, Low) | 98% | $0.041 | 23.6s | 95% | |
| GPT-5.4 (Reasoning) | 100% | $0.042 | 47.1s | 100% | |
| GPT-5.5 | 91% | $0.0094 | 8.9s | 74% | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0072 | 34.0s | 70% | |
| Z.AI GLM 5 Turbo | 98% | $0.028 | 1.3m | 90% | |
| GPT-4o, Aug. 6th (temp=0) | 83% | $0.0031 | 4.0s | 62% | |
| Qwen 3.5 35B | 97% | $0.032 | 1.7m | 91% | |
| Qwen 3.5 27B | 98% | $0.028 | 2.0m | 92% | |
| GPT-OSS 120B | 97% | $0.0012 | 2.6m | 83% | |
| GPT-5.5 (Reasoning) | 100% | $0.083 | 39.8s | 99% | |
| GPT-5 Mini | 90% | $0.0084 | 50.9s | 60% | |
| Claude Opus 4.6 (Reasoning) | 90% | $0.034 | 20.9s | 64% | |
| GPT-5.4 Nano (Reasoning, Low) | 80% | $0.0027 | 15.2s | 52% | |
| GPT-5.4 (Reasoning, Low) | 83% | $0.018 | 19.5s | 59% | |
| Z.AI GLM 5.1 | 98% | $0.036 | 2.5m | 90% | |
| GPT-5.1 | 95% | $0.038 | 1.1m | 70% | |
| Claude Sonnet 4.6 | 77% | $0.0048 | 7.9s | 48% | |
| Claude Opus 4.6 | 79% | $0.0081 | 9.6s | 49% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 41.0% | Dialogue to Total Word Ratio | ||
| 68.2% | Matches word count |
Write 200 words with 90% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4o Mini (temp=0) | 97% | $0.0002 | 4.6s | |
| Claude Opus 4.6 | 94% | $0.0084 | 10.9s | |
| Claude Opus 4.6 (Reasoning) | 94% | $0.011 | 11.8s | |
| Inception Mercury 2 | 83% | $0.0027 | 4.1s | |
| DeepSeek V3.2 | 73% | $0.0002 | 23.1s | |
| GPT-OSS 120B | 81% | $0.0013 | 56.3s | |
| Claude Opus 4.5 | 78% | $0.0085 | 9.8s | |
| Claude Opus 4.7 (Reasoning) | 86% | $0.013 | 7.2s | |
| Qwen 3.5 Flash | 82% | $0.0057 | 1.4m | |
| Nemotron 3 Super | 82% | $0.0000 | 1.2m | |
| Qwen 3.6 35B | 81% | $0.011 | 51.9s | |
| DeepSeek-V2 Chat | 73% | $0.0001 | 20.6s | |
| Gemini 3 Flash (Preview, Reasoning) | 74% | $0.014 | 23.7s | |
| Qwen 3.5 27B | 93% | $0.024 | 1.6m | |
| GPT-5.4 (Reasoning, Low) | 78% | $0.013 | 16.2s | |
| GPT-5.5 (Reasoning, Low) | 78% | $0.015 | 12.4s | |
| GPT-5.5 (Reasoning) | 93% | $0.054 | 27.7s | |
| GPT-5.4 (Reasoning) | 83% | $0.024 | 30.0s | |
| Grok 4 | 78% | $0.0083 | 19.1s | |
| MiniMax M2.7 | 89% | $0.015 | 2.5m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| MoonshotAI: Kimi K2.6 | 98% | 95% | 94% | |
| Gemini 3.1 Pro (Preview) | 99% | 94% | 94% | |
| GPT-4o Mini (temp=0) | 97% | 91% | 90% | |
| Grok 4.3 (Reasoning) | 97% | 89% | 88% | |
| Claude Opus 4.6 (Reasoning) | 94% | 85% | 83% | |
| Claude Opus 4.6 | 94% | 84% | 82% | |
| GPT-5.5 (Reasoning) | 93% | 80% | 79% | |
| Qwen 3.5 27B | 93% | 71% | 70% | |
| Claude Opus 4.7 (Reasoning) | 86% | 74% | 68% | |
| GPT-4o, Aug. 6th (temp=0) | 68% | 99% | 67% | |
| GPT-5.2 | 88% | 69% | 67% | |
| Gemma 4 26B (Reasoning) | 85% | 66% | 62% | |
| GPT-4o, Aug. 6th (temp=1) | 65% | 93% | 61% | |
| MiniMax M2.7 | 89% | 61% | 60% | |
| Claude Sonnet 4.6 | 77% | 81% | 60% | |
| GPT-5.4 (Reasoning) | 83% | 62% | 57% | |
| Claude Opus 4.5 | 78% | 62% | 57% | |
| Qwen 3.5 397B A17B | 84% | 64% | 57% | |
| Inception Mercury 2 | 83% | 61% | 56% | |
| Qwen3.6 Max Preview | 82% | 62% | 55% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4o Mini (temp=0) | 97% | $0.0002 | 4.6s | 90% | |
| Claude Opus 4.6 | 94% | $0.0084 | 10.9s | 82% | |
| Claude Opus 4.6 (Reasoning) | 94% | $0.011 | 11.8s | 83% | |
| Claude Opus 4.7 (Reasoning) | 86% | $0.013 | 7.2s | 68% | |
| Inception Mercury 2 | 83% | $0.0027 | 4.1s | 56% | |
| GPT-4o, Aug. 6th (temp=0) | 68% | $0.0032 | 4.0s | 67% | |
| Claude Sonnet 4.6 | 77% | $0.0050 | 8.7s | 60% | |
| Claude Opus 4.5 | 78% | $0.0085 | 9.8s | 57% | |
| Grok 4.3 (Reasoning) | 97% | $0.028 | 2.2m | 88% | |
| GPT-4o, Aug. 6th (temp=1) | 65% | $0.0032 | 4.1s | 61% | |
| GPT-5.5 | 81% | $0.0098 | 9.6s | 51% | |
| Gemini 2.5 Flash Lite | 72% | $0.0001 | 1.7s | 51% | |
| GPT-5.5 (Reasoning) | 93% | $0.054 | 27.7s | 79% | |
| GPT-4o Mini (temp=1) | 68% | $0.0002 | 4.4s | 53% | |
| Grok 4 | 78% | $0.0083 | 19.1s | 52% | |
| GPT-5.2 | 88% | $0.032 | 39.2s | 67% | |
| GPT-5.4 Mini | 73% | $0.0015 | 2.3s | 46% | |
| Qwen 3.5 27B | 93% | $0.024 | 1.6m | 70% | |
| GPT-5.4 (Reasoning, Low) | 78% | $0.013 | 16.2s | 53% | |
| Nemotron 3 Super | 82% | $0.0000 | 1.2m | 54% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 58.4% | Dialogue to Total Word Ratio | ||
| 59.3% | Matches word count |
dialogue-500
Write 500 words with 30% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 87% | $0.013 | 1.1m | |
| Z.AI GLM 5 Turbo | 97% | $0.053 | 2.2m | |
| GPT-5 | 84% | $0.074 | 2.0m | |
| MiniMax M2.5 | 74% | $0.024 | 3.9m | |
| o4 Mini High | 80% | $0.066 | 2.5m | |
| Gemini 3.1 Pro (Preview) | 96% | $0.176 | 2.3m | |
| Inception Mercury 2 | 58% | $0.0064 | 10.9s | |
| Nemotron 3 Super | 71% | $0.0000 | 4.5m | |
| Grok 4.3 (Reasoning) | 71% | $0.047 | 3.7m | |
| Gemini 3 Flash (Preview) | 34% | $0.0021 | 6.8s | |
| Claude Sonnet 4.6 | 51% | $0.011 | 17.5s | |
| o4 Mini | 73% | $0.043 | 1.7m | |
| Claude Opus 4.5 | 52% | $0.018 | 19.6s | |
| Claude Opus 4.6 | 43% | $0.019 | 21.0s | |
| Claude 3.7 Sonnet | 38% | $0.012 | 15.7s | |
| Grok 4.20 (Beta) | 33% | $0.0040 | 4.0s | |
| Z.AI GLM 5.1 | 86% | $0.067 | 5.9m | |
| GPT-4o, Aug. 6th (temp=0) | 31% | $0.0070 | 8.1s | |
| Claude Opus 4.7 | 38% | $0.027 | 15.6s | |
| Claude Opus 4.7 (Reasoning) | 37% | $0.027 | 14.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 97% | 83% | 83% | |
| Gemini 3.1 Pro (Preview) | 96% | 76% | 76% | |
| GPT-5 Mini | 87% | 59% | 57% | |
| Z.AI GLM 5.1 | 86% | 57% | 56% | |
| Claude Sonnet 4.6 (Reasoning) | 82% | 58% | 55% | |
| o4 Mini High | 80% | 47% | 45% | |
| MiniMax M2.7 | 71% | 63% | 43% | |
| Claude Sonnet 4.6 | 51% | 81% | 41% | |
| o4 Mini | 73% | 55% | 39% | |
| GPT-5 | 84% | 36% | 36% | |
| Claude Opus 4.5 | 52% | 69% | 35% | |
| Grok 4.3 (Reasoning) | 71% | 48% | 34% | |
| MiniMax M2.5 | 74% | 34% | 33% | |
| Nemotron 3 Super | 71% | 36% | 32% | |
| Claude Opus 4.7 | 38% | 76% | 32% | |
| Claude 3.7 Sonnet | 38% | 65% | 30% | |
| Claude Opus 4.6 (Reasoning) | 68% | 33% | 28% | |
| Gemini 3 Flash (Preview) | 34% | 56% | 27% | |
| Grok 4.20 (Beta) | 33% | 62% | 27% | |
| Inception Mercury 2 | 58% | 45% | 26% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 97% | $0.053 | 2.2m | 83% | |
| GPT-5 Mini | 87% | $0.013 | 1.1m | 57% | |
| Gemini 3.1 Pro (Preview) | 96% | $0.176 | 2.3m | 76% | |
| Claude Sonnet 4.6 | 51% | $0.011 | 17.5s | 41% | |
| o4 Mini | 73% | $0.043 | 1.7m | 39% | |
| Inception Mercury 2 | 58% | $0.0064 | 10.9s | 26% | |
| Claude Opus 4.5 | 52% | $0.018 | 19.6s | 35% | |
| o4 Mini High | 80% | $0.066 | 2.5m | 45% | |
| GPT-5 | 84% | $0.074 | 2.0m | 36% | |
| Claude 3.7 Sonnet | 38% | $0.012 | 15.7s | 30% | |
| Gemini 3 Flash (Preview) | 34% | $0.0021 | 6.8s | 27% | |
| Claude Opus 4.7 | 38% | $0.027 | 15.6s | 32% | |
| Z.AI GLM 5.1 | 86% | $0.067 | 5.9m | 56% | |
| Grok 4.20 (Beta) | 33% | $0.0040 | 4.0s | 27% | |
| MiniMax M2.5 | 74% | $0.024 | 3.9m | 33% | |
| Nemotron 3 Super | 71% | $0.0000 | 4.5m | 32% | |
| GPT-4o, Aug. 6th (temp=0) | 31% | $0.0070 | 8.1s | 26% | |
| Claude Opus 4.6 | 43% | $0.019 | 21.0s | 20% | |
| GPT-4o Mini (temp=0) | 30% | $0.0005 | 10.8s | 22% | |
| Grok 4.3 (Reasoning) | 71% | $0.047 | 3.7m | 34% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 9.2% | Dialogue to Total Word Ratio | ||
| 15.6% | Matches word count |
Write 500 words with 50% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 79% | $0.012 | 1.2m | |
| GPT-5 | 95% | $0.059 | 1.5m | |
| Z.AI GLM 5 Turbo | 95% | $0.055 | 2.2m | |
| Claude Opus 4.6 | 71% | $0.019 | 20.3s | |
| Inception Mercury 2 | 56% | $0.0058 | 9.1s | |
| o4 Mini | 67% | $0.030 | 1.3m | |
| Claude Sonnet 4.6 | 49% | $0.011 | 17.6s | |
| Claude Opus 4.6 (Reasoning) | 53% | $0.026 | 24.3s | |
| Gemini 3.1 Flash Lite (Reasoning) | 32% | $0.0010 | 4.1s | |
| Claude 3.7 Sonnet | 32% | $0.012 | 16.7s | |
| GPT-4o Mini (temp=0) | 29% | $0.0005 | 13.5s | |
| Nemotron 3 Super | 57% | $0.0000 | 3.2m | |
| Grok 4 Fast | 30% | $0.0005 | 9.3s | |
| Claude Opus 4.5 | 37% | $0.019 | 18.9s | |
| Claude Sonnet 4.6 (Reasoning) | 50% | $0.042 | 37.2s | |
| Ministral 3 14B | 26% | $0.0002 | 9.5s | |
| MiniMax M2.7 | 72% | $0.028 | 5.6m | |
| MiniMax M2.5 | 63% | $0.025 | 3.8m | |
| Gemini 3.1 Flash Lite (Preview) | 32% | $0.0009 | 3.8s | |
| Claude Opus 4.7 (Reasoning) | 32% | $0.028 | 15.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 95% | 79% | 79% | |
| GPT-5 | 95% | 70% | 70% | |
| MoonshotAI: Kimi K2.6 | 81% | 56% | 51% | |
| GPT-5 Mini | 79% | 58% | 48% | |
| Claude Opus 4.6 | 71% | 59% | 43% | |
| Z.AI GLM 5.1 | 81% | 36% | 34% | |
| Claude Sonnet 4.6 | 49% | 67% | 32% | |
| MiniMax M2.7 | 72% | 37% | 31% | |
| o4 Mini | 67% | 53% | 30% | |
| Claude Sonnet 4.6 (Reasoning) | 50% | 58% | 28% | |
| Claude 3.7 Sonnet | 32% | 59% | 27% | |
| Claude Opus 4.5 | 37% | 58% | 27% | |
| Gemini 3.1 Flash Lite (Reasoning) | 32% | 62% | 26% | |
| Claude Opus 4.7 (Reasoning) | 32% | 62% | 26% | |
| Claude Opus 4.6 (Reasoning) | 53% | 49% | 25% | |
| Claude Opus 4.7 | 35% | 77% | 25% | |
| MiniMax M2.5 | 63% | 42% | 24% | |
| Nemotron 3 Super | 57% | 43% | 24% | |
| GPT-4o Mini (temp=0) | 29% | 58% | 23% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 95% | $0.055 | 2.2m | 79% | |
| GPT-5 | 95% | $0.059 | 1.5m | 70% | |
| GPT-5 Mini | 79% | $0.012 | 1.2m | 48% | |
| Claude Opus 4.6 | 71% | $0.019 | 20.3s | 43% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.189 | 2.6m | 100% | |
| Claude Sonnet 4.6 | 49% | $0.011 | 17.6s | 32% | |
| Inception Mercury 2 | 56% | $0.0058 | 9.1s | 20% | |
| o4 Mini | 67% | $0.030 | 1.3m | 30% | |
| Claude Opus 4.6 (Reasoning) | 53% | $0.026 | 24.3s | 25% | |
| Gemini 3.1 Flash Lite (Reasoning) | 32% | $0.0010 | 4.1s | 26% | |
| Gemini 3.1 Flash Lite (Preview) | 32% | $0.0009 | 3.8s | 23% | |
| Nemotron 3 Super | 57% | $0.0000 | 3.2m | 24% | |
| Grok 4 Fast | 30% | $0.0005 | 9.3s | 23% | |
| Claude Sonnet 4.6 (Reasoning) | 50% | $0.042 | 37.2s | 28% | |
| Claude Opus 4.5 | 37% | $0.019 | 18.9s | 27% | |
| GPT-4o Mini (temp=0) | 29% | $0.0005 | 13.5s | 23% | |
| Claude 3.7 Sonnet | 32% | $0.012 | 16.7s | 27% | |
| Ministral 3 14B | 26% | $0.0002 | 9.5s | 20% | |
| MiniMax M2.5 | 63% | $0.025 | 3.8m | 24% | |
| Claude Opus 4.7 | 35% | $0.029 | 16.4s | 25% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 17.4% | Dialogue to Total Word Ratio | ||
| 14.8% | Matches word count |
Write 500 words with 70% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Nemotron 3 Super | 74% | $0.0000 | 3.3m | |
| GPT-5 Mini | 70% | $0.014 | 1.2m | |
| Grok 4 Fast | 61% | $0.0005 | 9.4s | |
| Inception Mercury 2 | 34% | $0.0047 | 7.3s | |
| Claude 3.7 Sonnet | 54% | $0.012 | 16.2s | |
| Claude Opus 4.6 (Reasoning) | 57% | $0.021 | 22.2s | |
| Mistral Medium 3.1 | 44% | $0.0017 | 18.0s | |
| GPT-4o, Aug. 6th (temp=1) | 45% | $0.0074 | 8.9s | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | $0.0022 | 26.4s | |
| Gemini 3.1 Flash Lite (Reasoning) | 38% | $0.0010 | 4.0s | |
| GPT-4o Mini (temp=0) | 32% | $0.0005 | 11.7s | |
| GPT-4o, Aug. 6th (temp=0) | 54% | $0.0070 | 8.2s | |
| Gemini 3.1 Flash Lite | 32% | $0.0010 | 7.0s | |
| Claude Opus 4.6 | 41% | $0.019 | 21.7s | |
| DeepSeek-V2 Chat | 40% | $0.0002 | 41.2s | |
| Claude 3.5 Sonnet | 46% | $0.013 | 53.7s | |
| GPT-5 | 54% | $0.051 | 1.5m | |
| Claude Sonnet 4.6 | 37% | $0.011 | 17.3s | |
| Aion 2.0 | 28% | $0.0020 | 31.3s | |
| o4 Mini High | 62% | $0.064 | 2.6m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude 3.7 Sonnet | 54% | 76% | 39% | |
| Nemotron 3 Super | 74% | 44% | 38% | |
| Claude Opus 4.6 | 41% | 67% | 31% | |
| Grok 4 Fast | 61% | 55% | 31% | |
| Gemini 3.1 Flash Lite (Reasoning) | 38% | 71% | 30% | |
| Inception Mercury 2 | 34% | 60% | 29% | |
| Claude Opus 4.6 (Reasoning) | 57% | 54% | 29% | |
| o4 Mini High | 62% | 43% | 28% | |
| GPT-4o Mini (temp=0) | 32% | 62% | 26% | |
| GPT-5 Mini | 70% | 34% | 25% | |
| MiniMax M2.5 | 55% | 50% | 25% | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | 55% | 24% | |
| Claude Sonnet 4.6 | 37% | 62% | 23% | |
| MiniMax M2.7 | 54% | 42% | 21% | |
| Claude Opus 4.7 (Reasoning) | 36% | 50% | 20% | |
| MoonshotAI: Kimi K2.6 | 31% | 58% | 20% | |
| GPT-OSS 120B | 28% | 62% | 20% | |
| Aion 2.0 | 28% | 54% | 18% | |
| ByteDance Seed 1.6 | 25% | 59% | 18% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 61% | $0.0005 | 9.4s | 31% | |
| Claude 3.7 Sonnet | 54% | $0.012 | 16.2s | 39% | |
| GPT-5 Mini | 70% | $0.014 | 1.2m | 25% | |
| Claude Opus 4.6 (Reasoning) | 57% | $0.021 | 22.2s | 29% | |
| Gemini 3.1 Flash Lite (Reasoning) | 38% | $0.0010 | 4.0s | 30% | |
| GPT-4o, Aug. 6th (temp=0) | 54% | $0.0070 | 8.2s | 17% | |
| Nemotron 3 Super | 74% | $0.0000 | 3.3m | 38% | |
| Inception Mercury 2 | 34% | $0.0047 | 7.3s | 29% | |
| Claude Opus 4.6 | 41% | $0.019 | 21.7s | 31% | |
| Mistral Medium 3.1 | 44% | $0.0017 | 18.0s | 17% | |
| GPT-4o Mini (temp=0) | 32% | $0.0005 | 11.7s | 26% | |
| GPT-4o, Aug. 6th (temp=1) | 45% | $0.0074 | 8.9s | 13% | |
| Claude Sonnet 4.6 | 37% | $0.011 | 17.3s | 23% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.229 | 3.3m | 100% | |
| DeepSeek-V2 Chat | 40% | $0.0002 | 41.2s | 18% | |
| Gemini 3.1 Flash Lite | 32% | $0.0010 | 7.0s | 17% | |
| Grok 4.1 Fast | 31% | $0.0005 | 12.8s | 17% | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | $0.0022 | 26.4s | 24% | |
| Ministral 3 8B | 27% | $0.0001 | 3.9s | 17% | |
| Stealth: Healer Alpha | 29% | $0.0000 | 10.1s | 15% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 17.3% | Dialogue to Total Word Ratio | ||
| 10.1% | Matches word count |
Ungrouped
Write unattributed dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 96% | $0.0001 | 1.4s | |
| Ministral 3 14B | 100% | $0.0001 | 8.2s | |
| Mistral Small Creative | 88% | $0.0001 | 2.2s | |
| Mistral Small 3.2 24B | 100% | $0.0001 | 4.1s | |
| Mistral Small 4 | 100% | $0.0002 | 2.9s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.4s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0004 | 2.0s | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0004 | 3.6s | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0001 | 8.3s | |
| Gemini 3.1 Flash Lite | 100% | $0.0004 | 2.0s | |
| Llama 3.1 70B | 96% | $0.0003 | 3.2s | |
| DeepSeek V4 Flash | 100% | $0.0001 | 6.8s | |
| Grok 4 Fast | 100% | $0.0003 | 4.2s | |
| Gemma 3 12B | 100% | $0.0000 | 7.4s | |
| GPT-4.1 Mini | 87% | $0.0004 | 3.3s | |
| Claude 3 Haiku | 100% | $0.0004 | 3.4s | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0001 | 6.3s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 9.6s | |
| GPT-4o Mini (temp=1) | 100% | $0.0002 | 15.5s | |
| Z.AI GLM 4.5 | 96% | $0.0005 | 5.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Qwen3.6 Max Preview | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5.1 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Grok 4.3 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.7 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.5 (Reasoning, Low) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Gemma 4 31B (Reasoning) | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Qwen 3.5 Plus (2026-04-20) | 100% | 100% | 100% | |
| Gemma 4 26B (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small 4 | 100% | $0.0002 | 2.9s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.4s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0001 | 4.1s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0004 | 2.0s | 100% | |
| Gemini 3.1 Flash Lite | 100% | $0.0004 | 2.0s | 100% | |
| Grok 4 Fast | 100% | $0.0003 | 4.2s | 100% | |
| Claude 3 Haiku | 100% | $0.0004 | 3.4s | 100% | |
| Gemini 3.1 Flash Lite (Reasoning) | 100% | $0.0004 | 3.6s | 100% | |
| DeepSeek V4 Flash (Reasoning) | 100% | $0.0001 | 6.3s | 100% | |
| DeepSeek V4 Flash | 100% | $0.0001 | 6.8s | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 7.4s | 100% | |
| Ministral 3 14B | 100% | $0.0001 | 8.2s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0001 | 8.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0009 | 3.2s | 100% | |
| Grok 4.20 | 100% | $0.0007 | 4.9s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 9.6s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0006 | 6.1s | 100% | |
| Mistral Large 3 | 100% | $0.0005 | 7.2s | 100% | |
| Grok 4.1 Fast | 100% | $0.0003 | 10.2s | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0004 | 10.1s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Count dialogue tags |