Dialogue tags
Various tasks related to dialogue tags in text.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 83% | $0.0096 | 52.2s | |
| Z.AI GLM 5 Turbo | 87% | $0.030 | 1.3m | |
| Inception Mercury 2 | 71% | $0.0037 | 6.1s | |
| Claude Opus 4.6 | 73% | $0.013 | 14.7s | |
| GPT-5 | 84% | $0.049 | 1.4m | |
| Qwen 3.5 27B | 66% | $0.023 | 1.8m | |
| GPT-5.4 (Reasoning) | 62% | $0.027 | 36.1s | |
| Gemini 3 Flash (Preview, Reasoning) | 61% | $0.017 | 30.3s | |
| Claude Opus 4.6 (Reasoning) | 80% | $0.070 | 37.7s | |
| Nemotron 3 Super | 74% | $0.0000 | 2.5m | |
| Gemini 3.1 Pro (Preview) | 99% | $0.135 | 1.9m | |
| GPT-5.1 | 63% | $0.027 | 47.2s | |
| o4 Mini High | 79% | $0.045 | 1.8m | |
| Claude Sonnet 4.6 | 67% | $0.0074 | 12.1s | |
| GPT-5.2 | 60% | $0.024 | 34.2s | |
| MiniMax M2.7 | 81% | $0.021 | 4.3m | |
| MiniMax M2.5 | 74% | $0.015 | 2.9m | |
| o4 Mini | 68% | $0.023 | 58.0s | |
| GPT-5.4 (Reasoning, Low) | 55% | $0.016 | 21.6s | |
| Claude Sonnet 4.6 (Reasoning) | 75% | $0.101 | 1.2m | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 90% | 90% | |
| GPT-5 Mini | 83% | 48% | 48% | |
| Z.AI GLM 5 Turbo | 87% | 48% | 48% | |
| o4 Mini High | 79% | 47% | 45% | |
| MiniMax M2.7 | 81% | 45% | 45% | |
| Claude Opus 4.6 (Reasoning) | 80% | 45% | 44% | |
| GPT-5 | 84% | 43% | 43% | |
| Claude Sonnet 4.6 (Reasoning) | 75% | 40% | 38% | |
| Claude Opus 4.6 | 73% | 42% | 35% | |
| Claude Sonnet 4.6 | 67% | 50% | 35% | |
| Nemotron 3 Super | 74% | 36% | 34% | |
| MiniMax M2.5 | 74% | 43% | 33% | |
| Inception Mercury 2 | 71% | 38% | 31% | |
| o4 Mini | 68% | 32% | 23% | |
| Gemini 3.1 Flash Lite (Preview) | 51% | 45% | 23% | |
| Claude Opus 4.5 | 64% | 34% | 19% | |
| GPT-4o Mini (temp=0) | 55% | 37% | 19% | |
| Grok 4 | 49% | 37% | 18% | |
| GPT-4o, Aug. 6th (temp=0) | 57% | 37% | 18% | |
| MoonshotAI: Kimi K2.5 | 64% | 22% | 18% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5 Mini | 83% | $0.0096 | 52.2s | 48% | |
| Inception Mercury 2 | 71% | $0.0037 | 6.1s | 31% | |
| Claude Opus 4.6 | 73% | $0.013 | 14.7s | 35% | |
| Claude Sonnet 4.6 | 67% | $0.0074 | 12.1s | 35% | |
| Z.AI GLM 5 Turbo | 87% | $0.030 | 1.3m | 48% | |
| Gemini 3.1 Flash Lite (Preview) | 51% | $0.0006 | 2.7s | 23% | |
| GPT-4o Mini (temp=0) | 55% | $0.0003 | 7.8s | 19% | |
| Claude Opus 4.5 | 64% | $0.013 | 13.5s | 19% | |
| GPT-4o, Aug. 6th (temp=0) | 57% | $0.0049 | 6.0s | 18% | |
| GPT-5 | 84% | $0.049 | 1.4m | 43% | |
| Claude Opus 4.6 (Reasoning) | 80% | $0.070 | 37.7s | 44% | |
| Gemini 3.1 Pro (Preview) | 99% | $0.135 | 1.9m | 90% | |
| GPT-4o, Aug. 6th (temp=1) | 51% | $0.0050 | 6.0s | 17% | |
| Grok 4 Fast | 46% | $0.0004 | 6.4s | 18% | |
| o4 Mini High | 79% | $0.045 | 1.8m | 45% | |
| Gemini 3 Flash (Preview) | 47% | $0.0014 | 4.8s | 16% | |
| Nemotron 3 Super | 74% | $0.0000 | 2.5m | 34% | |
| Inception Mercury | 49% | $0.0005 | 8.1s | 11% | |
| o4 Mini | 68% | $0.023 | 58.0s | 23% | |
| GPT-4o Mini (temp=1) | 44% | $0.0003 | 9.4s | 16% | |
| Ungrouped | dialogue-200 | dialogue-500 | ||||||
|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Write unattributed dialogue | Write 200 words with 10% dialogue | Write 200 words with 50% dialogue | Write 200 words with 90% dialogue | Write 500 words with 30% dialogue | Write 500 words with 50% dialogue | Write 500 words with 70% dialogue |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 100% | 100% | 99% | 96% | 100% | 100% |
| Z.AI GLM 5 Turbo | 87% | 100% | 100% | 98% | 69% | 97% | 95% | 50% |
| GPT-5 | 84% | 100% | 100% | 95% | 63% | 84% | 95% | 54% |
| GPT-5 Mini | 83% | 90% | 100% | 90% | 62% | 87% | 79% | 70% |
| MiniMax M2.7 | 81% | 86% | 97% | 96% | 89% | 71% | 72% | 54% |
| Claude Opus 4.6 (Reasoning) | 80% | 100% | 100% | 90% | 94% | 68% | 53% | 57% |
| o4 Mini High | 79% | 96% | 100% | 85% | 74% | 80% | 57% | 62% |
| Claude Sonnet 4.6 (Reasoning) | 75% | 96% | 99% | 99% | 72% | 82% | 50% | 26% |
| Nemotron 3 Super | 74% | 56% | 90% | 90% | 82% | 71% | 57% | 74% |
| MiniMax M2.5 | 74% | 86% | 76% | 98% | 62% | 74% | 63% | 55% |
| Claude Opus 4.6 | 73% | 100% | 82% | 79% | 94% | 43% | 71% | 41% |
| Inception Mercury 2 | 71% | 80% | 90% | 97% | 83% | 58% | 56% | 34% |
| o4 Mini | 68% | 72% | 97% | 89% | 55% | 73% | 67% | 22% |
| Claude Sonnet 4.6 | 67% | 92% | 87% | 77% | 77% | 51% | 49% | 37% |
| Qwen 3.5 27B | 66% | 100% | 96% | 98% | 93% | 46% | 16% | 10% |
dialogue-200
Write 200 words with 10% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Stealth: Aurora Alpha | 89% | — | 7.4s | |
| Inception Mercury | 85% | $0.0002 | 8.2s | |
| Inception Mercury 2 | 90% | $0.0025 | 4.0s | |
| Claude Opus 4.5 | 96% | $0.0077 | 9.3s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0041 | 24.5s | |
| Claude Haiku 4.5 | 76% | $0.0016 | 4.4s | |
| Claude Opus 4.6 | 82% | $0.0079 | 9.7s | |
| GPT-5.4 Nano (Reasoning, Low) | 81% | $0.0025 | 15.4s | |
| Claude Sonnet 4 | 83% | $0.0046 | 7.5s | |
| GPT-5 Mini | 100% | $0.0088 | 45.8s | |
| GPT-5.2 | 100% | $0.026 | 29.9s | |
| Claude Sonnet 4.6 | 87% | $0.0045 | 8.0s | |
| GPT-5.4 (Reasoning, Low) | 91% | $0.025 | 23.9s | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.025 | 40.7s | |
| o4 Mini | 97% | $0.022 | 55.0s | |
| GPT-5.4 (Reasoning) | 100% | $0.034 | 38.8s | |
| Claude Sonnet 4.5 | 84% | $0.0046 | 7.8s | |
| Z.AI GLM 5 Turbo | 100% | $0.026 | 1.1m | |
| GPT-5.1 | 100% | $0.032 | 49.0s | |
| Grok 4.20 (Beta, Reasoning) | 92% | $0.043 | 28.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | 99% | 99% | |
| Qwen 3.5 397B A17B | 100% | 99% | 99% | |
| GPT-5.2 | 100% | 99% | 99% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 99% | 99% | |
| o4 Mini High | 100% | 98% | 98% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 95% | 95% | |
| Claude Opus 4.5 | 96% | 89% | 88% | |
| MoonshotAI: Kimi K2.5 | 97% | 87% | 86% | |
| Qwen 3.5 122B | 97% | 83% | 83% | |
| MiniMax M2.7 | 97% | 81% | 81% | |
| GPT-5 Nano | 97% | 81% | 81% | |
| o4 Mini | 97% | 80% | 80% | |
| Qwen 3.5 27B | 96% | 76% | 76% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-5.4 Nano (Reasoning) | 100% | $0.0041 | 24.5s | 99% | |
| GPT-5 Mini | 100% | $0.0088 | 45.8s | 100% | |
| Claude Opus 4.5 | 96% | $0.0077 | 9.3s | 88% | |
| GPT-5.2 | 100% | $0.026 | 29.9s | 99% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.025 | 40.7s | 99% | |
| GPT-5.4 (Reasoning) | 100% | $0.034 | 38.8s | 100% | |
| GPT-5.1 | 100% | $0.032 | 49.0s | 100% | |
| Z.AI GLM 5 Turbo | 100% | $0.026 | 1.1m | 100% | |
| Claude Sonnet 4.6 | 87% | $0.0045 | 8.0s | 71% | |
| o4 Mini High | 100% | $0.036 | 1.5m | 98% | |
| GPT-5 | 100% | $0.046 | 1.3m | 100% | |
| Inception Mercury 2 | 90% | $0.0025 | 4.0s | 60% | |
| o4 Mini | 97% | $0.022 | 55.0s | 80% | |
| Claude Sonnet 4 | 83% | $0.0046 | 7.5s | 64% | |
| Claude Sonnet 4.5 | 84% | $0.0046 | 7.8s | 62% | |
| Stealth: Aurora Alpha | 89% | — | 7.4s | 66% | |
| Qwen 3.5 122B | 97% | $0.035 | 1.2m | 83% | |
| GPT-5 Nano | 97% | $0.0051 | 2.1m | 81% | |
| Claude Opus 4.6 | 82% | $0.0079 | 9.7s | 57% | |
| Grok 4.20 (Beta, Reasoning) | 92% | $0.043 | 28.1s | 71% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 41.4% | Dialogue to Total Word Ratio | ||
| 67.2% | Matches word count |
Write 200 words with 50% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Inception Mercury 2 | 97% | $0.0031 | 5.3s | |
| Stealth: Aurora Alpha | 85% | — | 7.2s | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0072 | 34.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 80% | $0.0027 | 15.2s | |
| Claude Opus 4.6 (Reasoning) | 90% | $0.034 | 20.9s | |
| GPT-5 Mini | 90% | $0.0084 | 50.9s | |
| Claude Opus 4.6 | 79% | $0.0081 | 9.6s | |
| GPT-5.4 (Reasoning, Low) | 83% | $0.018 | 19.5s | |
| Gemini 3 Flash (Preview, Reasoning) | 87% | $0.024 | 39.7s | |
| GPT-5.4 (Reasoning) | 100% | $0.042 | 47.1s | |
| GPT-4o, Aug. 6th (temp=0) | 83% | $0.0031 | 4.0s | |
| GPT-5.1 | 95% | $0.038 | 1.1m | |
| Claude Sonnet 4.6 | 77% | $0.0048 | 7.9s | |
| GPT-5.2 | 90% | $0.038 | 48.2s | |
| Grok 4.20 (Beta, Reasoning) | 91% | $0.051 | 35.2s | |
| Z.AI GLM 5 Turbo | 98% | $0.028 | 1.3m | |
| Inception Mercury | 74% | $0.0002 | 9.8s | |
| o4 Mini | 89% | $0.027 | 1.1m | |
| GPT-5 | 95% | $0.046 | 1.2m | |
| Nemotron 3 Super | 90% | $0.0000 | 2.7m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 96% | 95% | |
| Qwen 3.5 397B A17B | 99% | 92% | 92% | |
| Qwen 3.5 27B | 98% | 92% | 92% | |
| Qwen 3.5 35B | 97% | 91% | 91% | |
| Z.AI GLM 5 Turbo | 98% | 90% | 90% | |
| Inception Mercury 2 | 97% | 86% | 86% | |
| MiniMax M2.5 | 98% | 86% | 86% | |
| MiniMax M2.7 | 96% | 77% | 77% | |
| Grok 4.20 (Beta, Reasoning) | 91% | 72% | 71% | |
| GPT-5.1 | 95% | 70% | 70% | |
| GPT-5.4 Nano (Reasoning) | 95% | 70% | 70% | |
| GPT-5 | 95% | 70% | 70% | |
| GPT-5.2 | 90% | 67% | 67% | |
| MoonshotAI: Kimi K2.5 | 89% | 70% | 67% | |
| Claude Opus 4.6 (Reasoning) | 90% | 65% | 64% | |
| GPT-4o, Aug. 6th (temp=0) | 83% | 71% | 62% | |
| o4 Mini | 89% | 61% | 61% | |
| Nemotron 3 Super | 90% | 60% | 60% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury 2 | 97% | $0.0031 | 5.3s | 86% | |
| GPT-5.4 (Reasoning) | 100% | $0.042 | 47.1s | 100% | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0072 | 34.0s | 70% | |
| Z.AI GLM 5 Turbo | 98% | $0.028 | 1.3m | 90% | |
| GPT-4o, Aug. 6th (temp=0) | 83% | $0.0031 | 4.0s | 62% | |
| Qwen 3.5 35B | 97% | $0.032 | 1.7m | 91% | |
| Qwen 3.5 27B | 98% | $0.028 | 2.0m | 92% | |
| GPT-5 Mini | 90% | $0.0084 | 50.9s | 60% | |
| GPT-5.4 Nano (Reasoning, Low) | 80% | $0.0027 | 15.2s | 52% | |
| Claude Opus 4.6 (Reasoning) | 90% | $0.034 | 20.9s | 64% | |
| GPT-5.4 (Reasoning, Low) | 83% | $0.018 | 19.5s | 59% | |
| Claude Sonnet 4.6 | 77% | $0.0048 | 7.9s | 48% | |
| Claude Opus 4.6 | 79% | $0.0081 | 9.6s | 49% | |
| GPT-5.1 | 95% | $0.038 | 1.1m | 70% | |
| Grok 4.20 (Beta, Reasoning) | 91% | $0.051 | 35.2s | 71% | |
| Gemini 3 Flash (Preview, Reasoning) | 87% | $0.024 | 39.7s | 59% | |
| GPT-5.2 | 90% | $0.038 | 48.2s | 67% | |
| Stealth: Aurora Alpha | 85% | — | 7.2s | 54% | |
| Inception Mercury | 74% | $0.0002 | 9.8s | 41% | |
| GPT-5 | 95% | $0.046 | 1.2m | 70% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 39.0% | Dialogue to Total Word Ratio | ||
| 64.2% | Matches word count |
Write 200 words with 90% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-4o Mini (temp=0) | 97% | $0.0002 | 4.6s | |
| Claude Opus 4.6 | 94% | $0.0084 | 10.9s | |
| Inception Mercury 2 | 83% | $0.0027 | 4.1s | |
| Claude Opus 4.6 (Reasoning) | 94% | $0.011 | 11.8s | |
| Claude Opus 4.5 | 78% | $0.0085 | 9.8s | |
| DeepSeek V3.2 | 73% | $0.0002 | 23.1s | |
| Gemini 3 Flash (Preview, Reasoning) | 74% | $0.014 | 23.7s | |
| DeepSeek-V2 Chat | 73% | $0.0001 | 20.6s | |
| GPT-5.4 (Reasoning, Low) | 78% | $0.013 | 16.2s | |
| Nemotron 3 Super | 82% | $0.0000 | 1.2m | |
| GPT-5.4 Mini | 73% | $0.0015 | 2.3s | |
| Qwen 3.5 Flash | 82% | $0.0057 | 1.4m | |
| Gemini 2.5 Flash Lite | 72% | $0.0001 | 1.7s | |
| Grok 4 | 78% | $0.0083 | 19.1s | |
| GPT-5.4 (Reasoning) | 83% | $0.024 | 30.0s | |
| GPT-5.4 Nano (Reasoning) | 75% | $0.0038 | 20.7s | |
| Claude Sonnet 4.6 | 77% | $0.0050 | 8.7s | |
| GPT-5.2 | 88% | $0.032 | 39.2s | |
| GPT-4o Mini (temp=1) | 68% | $0.0002 | 4.4s | |
| Grok 4.1 Fast | 69% | $0.0003 | 7.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 94% | 94% | |
| GPT-4o Mini (temp=0) | 97% | 91% | 90% | |
| Claude Opus 4.6 (Reasoning) | 94% | 85% | 83% | |
| Claude Opus 4.6 | 94% | 84% | 82% | |
| Qwen 3.5 27B | 93% | 71% | 70% | |
| GPT-4o, Aug. 6th (temp=0) | 68% | 99% | 67% | |
| GPT-5.2 | 88% | 69% | 67% | |
| GPT-4o, Aug. 6th (temp=1) | 65% | 93% | 61% | |
| MiniMax M2.7 | 89% | 61% | 60% | |
| Claude Sonnet 4.6 | 77% | 81% | 60% | |
| GPT-5.4 (Reasoning) | 83% | 62% | 57% | |
| Claude Opus 4.5 | 78% | 62% | 57% | |
| Qwen 3.5 397B A17B | 84% | 64% | 57% | |
| Inception Mercury 2 | 83% | 61% | 56% | |
| Nemotron 3 Super | 82% | 61% | 54% | |
| MoonshotAI: Kimi K2.5 | 79% | 61% | 54% | |
| GPT-5.4 (Reasoning, Low) | 78% | 64% | 53% | |
| GPT-4o Mini (temp=1) | 68% | 77% | 53% | |
| Qwen 3.5 122B | 80% | 56% | 52% | |
| Grok 4 | 78% | 65% | 52% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| GPT-4o Mini (temp=0) | 97% | $0.0002 | 4.6s | 90% | |
| Claude Opus 4.6 | 94% | $0.0084 | 10.9s | 82% | |
| Claude Opus 4.6 (Reasoning) | 94% | $0.011 | 11.8s | 83% | |
| Inception Mercury 2 | 83% | $0.0027 | 4.1s | 56% | |
| GPT-4o, Aug. 6th (temp=0) | 68% | $0.0032 | 4.0s | 67% | |
| Claude Sonnet 4.6 | 77% | $0.0050 | 8.7s | 60% | |
| Claude Opus 4.5 | 78% | $0.0085 | 9.8s | 57% | |
| GPT-4o, Aug. 6th (temp=1) | 65% | $0.0032 | 4.1s | 61% | |
| Gemini 2.5 Flash Lite | 72% | $0.0001 | 1.7s | 51% | |
| GPT-4o Mini (temp=1) | 68% | $0.0002 | 4.4s | 53% | |
| Grok 4 | 78% | $0.0083 | 19.1s | 52% | |
| GPT-5.4 Mini | 73% | $0.0015 | 2.3s | 46% | |
| GPT-5.2 | 88% | $0.032 | 39.2s | 67% | |
| GPT-5.4 (Reasoning, Low) | 78% | $0.013 | 16.2s | 53% | |
| Qwen 3.5 27B | 93% | $0.024 | 1.6m | 70% | |
| DeepSeek-V2 Chat | 73% | $0.0001 | 20.6s | 48% | |
| Nemotron 3 Super | 82% | $0.0000 | 1.2m | 54% | |
| GPT-5.4 Nano (Reasoning) | 75% | $0.0038 | 20.7s | 46% | |
| GPT-5.4 (Reasoning) | 83% | $0.024 | 30.0s | 57% | |
| Grok 4.1 Fast | 69% | $0.0003 | 7.5s | 41% | |
dialogue-500
Write 500 words with 30% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 87% | $0.013 | 1.1m | |
| Z.AI GLM 5 Turbo | 97% | $0.053 | 2.2m | |
| GPT-5 | 84% | $0.074 | 2.0m | |
| MiniMax M2.5 | 74% | $0.024 | 3.9m | |
| o4 Mini High | 80% | $0.066 | 2.5m | |
| Gemini 3.1 Pro (Preview) | 96% | $0.176 | 2.3m | |
| Inception Mercury 2 | 58% | $0.0064 | 10.9s | |
| Nemotron 3 Super | 71% | $0.0000 | 4.5m | |
| Gemini 3 Flash (Preview) | 34% | $0.0021 | 6.8s | |
| Claude Sonnet 4.6 | 51% | $0.011 | 17.5s | |
| o4 Mini | 73% | $0.043 | 1.7m | |
| Claude Opus 4.5 | 52% | $0.018 | 19.6s | |
| Claude Opus 4.6 | 43% | $0.019 | 21.0s | |
| Claude 3.7 Sonnet | 38% | $0.012 | 15.7s | |
| Grok 4.20 (Beta) | 33% | $0.0040 | 4.0s | |
| GPT-4o, Aug. 6th (temp=0) | 31% | $0.0070 | 8.1s | |
| Qwen 3.5 27B | 46% | $0.028 | 2.4m | |
| Gemini 3.1 Flash Lite (Preview) | 29% | $0.0009 | 3.7s | |
| GPT-4o Mini (temp=0) | 30% | $0.0005 | 10.8s | |
| ByteDance Seed 2.0 Lite | 30% | $0.0053 | 1.0m | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 97% | 83% | 83% | |
| Gemini 3.1 Pro (Preview) | 96% | 76% | 76% | |
| GPT-5 Mini | 87% | 59% | 57% | |
| Claude Sonnet 4.6 (Reasoning) | 82% | 58% | 55% | |
| o4 Mini High | 80% | 47% | 45% | |
| MiniMax M2.7 | 71% | 63% | 43% | |
| Claude Sonnet 4.6 | 51% | 81% | 41% | |
| o4 Mini | 73% | 55% | 39% | |
| GPT-5 | 84% | 36% | 36% | |
| Claude Opus 4.5 | 52% | 69% | 35% | |
| MiniMax M2.5 | 74% | 34% | 33% | |
| Nemotron 3 Super | 71% | 36% | 32% | |
| Claude 3.7 Sonnet | 38% | 65% | 30% | |
| Claude Opus 4.6 (Reasoning) | 68% | 33% | 28% | |
| Gemini 3 Flash (Preview) | 34% | 56% | 27% | |
| Grok 4.20 (Beta) | 33% | 62% | 27% | |
| Inception Mercury 2 | 58% | 45% | 26% | |
| GPT-4o, Aug. 6th (temp=0) | 31% | 62% | 26% | |
| ByteDance Seed 2.0 Mini | 35% | 67% | 24% | |
| Qwen 3.5 27B | 46% | 46% | 23% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 97% | $0.053 | 2.2m | 83% | |
| GPT-5 Mini | 87% | $0.013 | 1.1m | 57% | |
| Gemini 3.1 Pro (Preview) | 96% | $0.176 | 2.3m | 76% | |
| Claude Sonnet 4.6 | 51% | $0.011 | 17.5s | 41% | |
| o4 Mini | 73% | $0.043 | 1.7m | 39% | |
| Inception Mercury 2 | 58% | $0.0064 | 10.9s | 26% | |
| Claude Opus 4.5 | 52% | $0.018 | 19.6s | 35% | |
| o4 Mini High | 80% | $0.066 | 2.5m | 45% | |
| GPT-5 | 84% | $0.074 | 2.0m | 36% | |
| Claude 3.7 Sonnet | 38% | $0.012 | 15.7s | 30% | |
| Gemini 3 Flash (Preview) | 34% | $0.0021 | 6.8s | 27% | |
| Grok 4.20 (Beta) | 33% | $0.0040 | 4.0s | 27% | |
| MiniMax M2.5 | 74% | $0.024 | 3.9m | 33% | |
| Nemotron 3 Super | 71% | $0.0000 | 4.5m | 32% | |
| GPT-4o, Aug. 6th (temp=0) | 31% | $0.0070 | 8.1s | 26% | |
| Claude Opus 4.6 | 43% | $0.019 | 21.0s | 20% | |
| GPT-4o Mini (temp=0) | 30% | $0.0005 | 10.8s | 22% | |
| Gemini 3.1 Flash Lite (Preview) | 29% | $0.0009 | 3.7s | 20% | |
| Grok 4 | 32% | $0.015 | 35.4s | 22% | |
| Gemini 2.5 Flash (Reasoning) | 33% | $0.019 | 37.0s | 22% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 8.6% | Dialogue to Total Word Ratio | ||
| 13.6% | Matches word count |
Write 500 words with 50% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 79% | $0.012 | 1.2m | |
| GPT-5 | 95% | $0.059 | 1.5m | |
| Z.AI GLM 5 Turbo | 95% | $0.055 | 2.2m | |
| Claude Opus 4.6 | 71% | $0.019 | 20.3s | |
| Inception Mercury 2 | 56% | $0.0058 | 9.1s | |
| Claude Sonnet 4.6 | 49% | $0.011 | 17.6s | |
| Claude Opus 4.6 (Reasoning) | 53% | $0.026 | 24.3s | |
| o4 Mini | 67% | $0.030 | 1.3m | |
| GPT-4o Mini (temp=0) | 29% | $0.0005 | 13.5s | |
| Claude 3.7 Sonnet | 32% | $0.012 | 16.7s | |
| Grok 4 Fast | 30% | $0.0005 | 9.3s | |
| Claude Opus 4.5 | 37% | $0.019 | 18.9s | |
| Claude Sonnet 4.6 (Reasoning) | 50% | $0.042 | 37.2s | |
| Ministral 3 14B | 26% | $0.0002 | 9.5s | |
| Gemini 3.1 Flash Lite (Preview) | 32% | $0.0009 | 3.8s | |
| Nemotron 3 Super | 57% | $0.0000 | 3.2m | |
| GPT-4.1 | 26% | $0.0060 | 9.5s | |
| GPT-5.4 Nano (Reasoning) | 31% | $0.0063 | 28.7s | |
| Inception Mercury | 29% | $0.0005 | 10.2s | |
| Grok 4 | 25% | $0.015 | 35.0s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 95% | 79% | 79% | |
| GPT-5 | 95% | 70% | 70% | |
| GPT-5 Mini | 79% | 58% | 48% | |
| Claude Opus 4.6 | 71% | 59% | 43% | |
| Claude Sonnet 4.6 | 49% | 67% | 32% | |
| MiniMax M2.7 | 72% | 37% | 31% | |
| o4 Mini | 67% | 53% | 30% | |
| Claude Sonnet 4.6 (Reasoning) | 50% | 58% | 28% | |
| Claude 3.7 Sonnet | 32% | 59% | 27% | |
| Claude Opus 4.5 | 37% | 58% | 27% | |
| Claude Opus 4.6 (Reasoning) | 53% | 49% | 25% | |
| MiniMax M2.5 | 63% | 42% | 24% | |
| Nemotron 3 Super | 57% | 43% | 24% | |
| GPT-4o Mini (temp=0) | 29% | 58% | 23% | |
| Gemini 3.1 Flash Lite (Preview) | 32% | 72% | 23% | |
| Grok 4 Fast | 30% | 58% | 23% | |
| Claude Opus 4 | 26% | 64% | 22% | |
| o4 Mini High | 57% | 41% | 21% | |
| Inception Mercury 2 | 56% | 40% | 20% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 95% | $0.055 | 2.2m | 79% | |
| GPT-5 | 95% | $0.059 | 1.5m | 70% | |
| GPT-5 Mini | 79% | $0.012 | 1.2m | 48% | |
| Claude Opus 4.6 | 71% | $0.019 | 20.3s | 43% | |
| Inception Mercury 2 | 56% | $0.0058 | 9.1s | 20% | |
| Claude Sonnet 4.6 | 49% | $0.011 | 17.6s | 32% | |
| o4 Mini | 67% | $0.030 | 1.3m | 30% | |
| Claude Opus 4.6 (Reasoning) | 53% | $0.026 | 24.3s | 25% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.189 | 2.6m | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 32% | $0.0009 | 3.8s | 23% | |
| Grok 4 Fast | 30% | $0.0005 | 9.3s | 23% | |
| GPT-4o Mini (temp=0) | 29% | $0.0005 | 13.5s | 23% | |
| Claude Opus 4.5 | 37% | $0.019 | 18.9s | 27% | |
| Claude 3.7 Sonnet | 32% | $0.012 | 16.7s | 27% | |
| Claude Sonnet 4.6 (Reasoning) | 50% | $0.042 | 37.2s | 28% | |
| Ministral 3 14B | 26% | $0.0002 | 9.5s | 20% | |
| GPT-4.1 | 26% | $0.0060 | 9.5s | 18% | |
| GPT-5.4 Nano (Reasoning) | 31% | $0.0063 | 28.7s | 17% | |
| Inception Mercury | 29% | $0.0005 | 10.2s | 10% | |
| Gemini 2.5 Flash Lite | 21% | $0.0003 | 3.6s | 10% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 17.4% | Dialogue to Total Word Ratio | ||
| 12.0% | Matches word count |
Write 500 words with 70% dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| GPT-5 Mini | 70% | $0.014 | 1.2m | |
| Grok 4 Fast | 61% | $0.0005 | 9.4s | |
| Nemotron 3 Super | 74% | $0.0000 | 3.3m | |
| Inception Mercury 2 | 34% | $0.0047 | 7.3s | |
| Claude 3.7 Sonnet | 54% | $0.012 | 16.2s | |
| Claude Opus 4.6 (Reasoning) | 57% | $0.021 | 22.2s | |
| GPT-4o, Aug. 6th (temp=1) | 45% | $0.0074 | 8.9s | |
| Mistral Medium 3.1 | 44% | $0.0017 | 18.0s | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | $0.0022 | 26.4s | |
| GPT-4o Mini (temp=0) | 32% | $0.0005 | 11.7s | |
| GPT-4o, Aug. 6th (temp=0) | 54% | $0.0070 | 8.2s | |
| Claude Opus 4.6 | 41% | $0.019 | 21.7s | |
| DeepSeek-V2 Chat | 40% | $0.0002 | 41.2s | |
| Claude 3.5 Sonnet | 46% | $0.013 | 53.7s | |
| Claude Sonnet 4.6 | 37% | $0.011 | 17.3s | |
| GPT-5 | 54% | $0.051 | 1.5m | |
| Aion 2.0 | 28% | $0.0020 | 31.3s | |
| Inception Mercury | 28% | $0.0011 | 7.9s | |
| Stealth: Healer Alpha | 29% | $0.0000 | 10.1s | |
| Llama 3.1 70B | 29% | $0.0006 | 6.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude 3.7 Sonnet | 54% | 76% | 39% | |
| Nemotron 3 Super | 74% | 44% | 38% | |
| Claude Opus 4.6 | 41% | 67% | 31% | |
| Grok 4 Fast | 61% | 55% | 31% | |
| Inception Mercury 2 | 34% | 60% | 29% | |
| Claude Opus 4.6 (Reasoning) | 57% | 54% | 29% | |
| o4 Mini High | 62% | 43% | 28% | |
| GPT-4o Mini (temp=0) | 32% | 62% | 26% | |
| GPT-5 Mini | 70% | 34% | 25% | |
| MiniMax M2.5 | 55% | 50% | 25% | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | 55% | 24% | |
| Claude Sonnet 4.6 | 37% | 62% | 23% | |
| MiniMax M2.7 | 54% | 42% | 21% | |
| Aion 2.0 | 28% | 54% | 18% | |
| ByteDance Seed 1.6 | 25% | 59% | 18% | |
| DeepSeek-V2 Chat | 40% | 45% | 18% | |
| Grok 4 | 31% | 61% | 17% | |
| Ministral 3 8B | 27% | 68% | 17% | |
| Mistral Medium 3.1 | 44% | 37% | 17% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Grok 4 Fast | 61% | $0.0005 | 9.4s | 31% | |
| Claude 3.7 Sonnet | 54% | $0.012 | 16.2s | 39% | |
| Claude Opus 4.6 (Reasoning) | 57% | $0.021 | 22.2s | 29% | |
| GPT-5 Mini | 70% | $0.014 | 1.2m | 25% | |
| GPT-4o, Aug. 6th (temp=0) | 54% | $0.0070 | 8.2s | 17% | |
| Nemotron 3 Super | 74% | $0.0000 | 3.3m | 38% | |
| Inception Mercury 2 | 34% | $0.0047 | 7.3s | 29% | |
| Claude Opus 4.6 | 41% | $0.019 | 21.7s | 31% | |
| Mistral Medium 3.1 | 44% | $0.0017 | 18.0s | 17% | |
| GPT-4o Mini (temp=0) | 32% | $0.0005 | 11.7s | 26% | |
| GPT-4o, Aug. 6th (temp=1) | 45% | $0.0074 | 8.9s | 13% | |
| Claude Sonnet 4.6 | 37% | $0.011 | 17.3s | 23% | |
| Gemini 3.1 Pro (Preview) | 100% | $0.229 | 3.3m | 100% | |
| DeepSeek-V2 Chat | 40% | $0.0002 | 41.2s | 18% | |
| Grok 4.1 Fast | 31% | $0.0005 | 12.8s | 17% | |
| Gemini 2.5 Flash Lite (Reasoning) | 28% | $0.0022 | 26.4s | 24% | |
| Ministral 3 8B | 27% | $0.0001 | 3.9s | 17% | |
| Stealth: Healer Alpha | 29% | $0.0000 | 10.1s | 15% | |
| Inception Mercury | 28% | $0.0011 | 7.9s | 15% | |
| Gemini 3.1 Flash Lite (Preview) | 26% | $0.0009 | 3.6s | 15% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 21.3% | Dialogue to Total Word Ratio | ||
| 10.0% | Matches word count |
Ungrouped
Write unattributed dialogue
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Qwen 3.5 122B | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 96% | $0.0001 | 1.4s | |
| Mistral Small Creative | 88% | $0.0001 | 2.2s | |
| Ministral 3 14B | 100% | $0.0001 | 8.2s | |
| Mistral Small 4 | 100% | $0.0002 | 2.9s | |
| Mistral Small 3.2 24B | 100% | $0.0001 | 4.1s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0004 | 2.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.4s | |
| Llama 3.1 70B | 96% | $0.0003 | 3.2s | |
| GPT-4.1 Mini | 87% | $0.0004 | 3.3s | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0001 | 8.3s | |
| Claude 3 Haiku | 100% | $0.0004 | 3.4s | |
| Grok 4 Fast | 100% | $0.0003 | 4.2s | |
| Gemma 3 12B | 100% | $0.0000 | 7.4s | |
| Gemini 3 Flash (Preview) | 100% | $0.0009 | 3.2s | |
| Z.AI GLM 4.5 | 96% | $0.0005 | 5.8s | |
| GPT-4o Mini (temp=1) | 100% | $0.0002 | 15.5s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 9.6s | |
| Mistral Medium 3.1 | 100% | $0.0006 | 6.1s | |
| Mistral Large 3 | 100% | $0.0005 | 7.2s | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0003 | 14.6s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small 4 | 100% | $0.0002 | 2.9s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0002 | 3.4s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0001 | 4.1s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0004 | 2.0s | 100% | |
| Grok 4 Fast | 100% | $0.0003 | 4.2s | 100% | |
| Claude 3 Haiku | 100% | $0.0004 | 3.4s | 100% | |
| Gemma 3 12B | 100% | $0.0000 | 7.4s | 100% | |
| Ministral 3 14B | 100% | $0.0001 | 8.2s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0001 | 8.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0009 | 3.2s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0001 | 9.6s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0006 | 6.1s | 100% | |
| Mistral Large 3 | 100% | $0.0005 | 7.2s | 100% | |
| Grok 4.1 Fast | 100% | $0.0003 | 10.2s | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0004 | 10.1s | 100% | |
| Gemma 3 27B | 100% | $0.0001 | 12.5s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0002 | 15.5s | 100% | |
| DeepSeek V3 (2025-03-24) | 100% | $0.0003 | 14.6s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0001 | 16.1s | 100% | |
| DeepSeek V3.1 | 100% | $0.0002 | 15.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 96.1% | Count dialogue tags |