Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 97% | $0.0003 | 1.7s | |
| Gemini 3.1 Flash Lite (Preview) | 99% | $0.0010 | 1.8s | |
| Mistral Small 4 | 96% | $0.0004 | 3.3s | |
| Mistral Small 3.2 24B | 97% | $0.0002 | 5.0s | |
| Gemini 2.5 Flash | 99% | $0.0015 | 2.2s | |
| Grok 4 Fast | 99% | $0.0008 | 6.5s | |
| Gemini 3 Flash (Preview) | 99% | $0.0019 | 3.4s | |
| GPT-4.1 Mini | 98% | $0.0011 | 7.0s | |
| Inception Mercury 2 | 95% | $0.0017 | 2.3s | |
| Mistral Large 3 | 98% | $0.0011 | 7.7s | |
| Gemma 3 12B | 95% | $0.0001 | 9.0s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 7.2s | |
| Qwen 2.5 72B | 98% | $0.0003 | 10.9s | |
| GPT-4o Mini (temp=1) | 95% | $0.0004 | 9.5s | |
| Grok 4.20 (Beta) | 98% | $0.0034 | 1.8s | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.2s | |
| Stealth: Hunter Alpha | 98% | $0.0000 | 19.5s | |
| Mistral Medium 3.1 | 97% | $0.0013 | 5.9s | |
| Grok 4.1 Fast | 99% | $0.0010 | 12.4s | |
| Mistral Small Creative | 96% | $0.0002 | 3.1s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
12 low-scoring outliers hidden: Ministral 3 8B (87.0%), Ministral 8B (86.7%), Mistral NeMO (86.6%), Arcee AI: Trinity Mini (85.7%), Nemotron 3 Nano (83.3%), Ministral 3 3B (81.2%), Ministral 3B (80.9%), Cohere Command R+ (Aug. 2024) (73.7%), LFM2 24B (71.7%), Hermes 3 70B (69.5%), Rocinante 12B (66.3%), Claude 3 Haiku (61.1%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 | 100% | 99% | 99% | |
| Claude Opus 4.5 | 100% | 98% | 98% | |
| Claude Sonnet 4 | 100% | 98% | 98% | |
| Gemini 3 Pro (Preview) | 100% | 98% | 98% | |
| Claude Opus 4.6 (Reasoning) | 100% | 98% | 98% | |
| Claude Sonnet 4.5 | 100% | 98% | 98% | |
| Grok 4 | 100% | 98% | 98% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 98% | 98% | |
| Qwen 3.5 27B | 99% | 98% | 98% | |
| Z.AI GLM 5 | 99% | 98% | 98% | |
| Gemini 2.5 Pro | 99% | 97% | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 97% | 97% | |
| Gemini 3 Flash (Preview, Reasoning) | 99% | 97% | 97% | |
| Z.AI GLM 4.7 | 99% | 97% | 97% | |
| Claude Sonnet 4.6 | 99% | 97% | 97% | |
| Grok 4.20 (Beta, Reasoning) | 99% | 97% | 97% | |
| Claude Haiku 4.5 | 99% | 97% | 97% | |
| Gemini 3 Flash (Preview) | 99% | 97% | 97% | |
| GPT-5 | 99% | 96% | 96% | |
| GPT-5.1 | 99% | 96% | 96% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite (Preview) | 99% | $0.0010 | 1.8s | 96% | |
| Gemini 3 Flash (Preview) | 99% | $0.0019 | 3.4s | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 7.2s | 97% | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.2s | 97% | |
| Grok 4 Fast | 99% | $0.0008 | 6.5s | 95% | |
| Gemini 2.5 Flash | 99% | $0.0015 | 2.2s | 92% | |
| GPT-4.1 Mini | 98% | $0.0011 | 7.0s | 94% | |
| Stealth: Healer Alpha | 99% | $0.0000 | 14.3s | 94% | |
| Grok 4.1 Fast | 99% | $0.0010 | 12.4s | 92% | |
| Mistral Medium 3.1 | 97% | $0.0013 | 5.9s | 91% | |
| Gemini 2.5 Flash Lite | 97% | $0.0003 | 1.7s | 86% | |
| Claude Sonnet 4.5 | 100% | $0.011 | 4.9s | 98% | |
| Claude Sonnet 4 | 100% | $0.011 | 6.1s | 98% | |
| GPT-4.1 | 98% | $0.0054 | 4.4s | 92% | |
| Grok 4.20 (Beta) | 98% | $0.0034 | 1.8s | 89% | |
| Qwen 2.5 72B | 98% | $0.0003 | 10.9s | 89% | |
| Mistral Large 3 | 98% | $0.0011 | 7.7s | 88% | |
| Claude Sonnet 4.6 | 99% | $0.011 | 4.7s | 97% | |
| Mistral Large | 98% | $0.0044 | 7.6s | 90% | |
| Stealth: Hunter Alpha | 98% | $0.0000 | 19.5s | 89% | |
| Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Character rename: Elena->Mirabel, Gregor->Aldric | Character rename: Elena->Mirabel, Gregor->Aldric | Location rename: market square, outer ring, bridge, northern mines | Location rename: market square, outer ring, bridge, northern mines | Expand all contractions | Expand all contractions | Tense rewriting: past to present | Tense rewriting: past to present | POV shift: 3rd person to 1st person (Elena's perspective) | POV shift: 3rd person to 1st person (Elena's perspective) | Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged | Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged | Combined: 3rd person past → 1st person present | Combined: 3rd person past → 1st person present | Passive voice → active voice | Passive voice → active voice | Avoid said/asked/replied/answered | Avoid said/asked/replied/answered |
| Claude Sonnet 4 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 97% | 100% | 100% |
| Claude Opus 4.6 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 99% | 98% | 98% | 100% | 100% |
| Claude Sonnet 4.5 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 99% | 96% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Grok 4 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 99% | 99% | 98% | 96% | 100% | 100% |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Claude Opus 4.5 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 99% | 97% | 97% | 100% | 100% |
| Z.AI GLM 5 | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 96% | 100% | 100% |
| Gemini 2.5 Pro | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 95% | 100% | 100% |
| GPT-5 | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 98% | 98% | 97% | 100% | 100% |
| Qwen 3.5 27B | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 99% | 100% | 99% | 98% | 97% | 100% | 100% |
| Z.AI GLM 4.7 | 99% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 96% | 100% | 100% |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 99% | 100% | 99% | 99% | 97% | 96% | 100% | 100% |
| Grok 4.20 (Beta, Reasoning) | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 99% | 99% | 99% | 96% | 100% | 100% |
Generic Prompt
Character rename: Elena->Mirabel, Gregor->Aldric
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Inception Mercury | 100% | $0.0004 | 815ms | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Inception Mercury 2 | 100% | $0.0007 | 971ms | |
| Ministral 8B | 100% | $0.0001 | 3.2s | |
| Ministral 3 8B | 100% | $0.0002 | 2.9s | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | |
| Mistral Small 4 | 100% | $0.0004 | 2.8s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 6.7s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.7s | |
| Ministral 3 14B | 100% | $0.0002 | 4.0s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.6s | |
| Grok 4 Fast | 100% | $0.0005 | 3.3s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.8s | |
| GPT-5.4 Nano | 100% | $0.0007 | 2.7s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0007 | 3.2s | |
| Gemma 3 4B | 100% | $0.0001 | 6.1s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 5.7s | |
| Mistral NeMO | 93% | $0.0002 | 2.3s | |
| Llama 3.1 8B | 100% | $0.0000 | 9.9s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Inception Mercury | 100% | $0.0004 | 815ms | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0007 | 971ms | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 2.9s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.6s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.2s | 100% | |
| Mistral Small 4 | 100% | $0.0004 | 2.8s | 100% | |
| GPT-5.4 Nano | 100% | $0.0007 | 2.7s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.8s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.7s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 3.3s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 4.0s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0007 | 3.2s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0007 | 4.7s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 5.7s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 6.1s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| GPT-5.4 Mini | 100% | $0.0027 | 1.9s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Location rename: market square, outer ring, bridge, northern mines
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | |
| Mistral Small 4 | 99% | $0.0004 | 2.9s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 6.9s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | |
| Inception Mercury | 100% | $0.0004 | 4.0s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Grok 4 Fast | 100% | $0.0005 | 3.8s | |
| Inception Mercury 2 | 100% | $0.0010 | 1.4s | |
| Llama 3.1 70B | 100% | $0.0005 | 17.8s | |
| ByteDance Seed 1.6 Flash | 99% | $0.0003 | 5.3s | |
| Grok 4.1 Fast | 100% | $0.0006 | 5.2s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0009 | 7.3s | |
| GPT-4.1 Mini | 100% | $0.0010 | 5.8s | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.8s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 16.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Inception Mercury 2 | 100% | $0.0010 | 1.4s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 3.8s | 100% | |
| Inception Mercury | 100% | $0.0004 | 4.0s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0006 | 5.2s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 6.9s | 100% | |
| GPT-5.4 Mini | 100% | $0.0027 | 2.2s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 5.8s | 100% | |
| Mistral Small 4 | 99% | $0.0004 | 2.9s | 96% | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0009 | 7.3s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.8s | 100% | |
| ByteDance Seed 1.6 Flash | 99% | $0.0003 | 5.3s | 97% | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.8s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0030 | 4.2s | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0034 | 3.7s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Expand all contractions
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Qwen 3.5 122B | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| Gemini 2.5 Pro | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0007 | 1.5s | |
| Mistral Small 4 | 99% | $0.0003 | 2.4s | |
| Stealth: Healer Alpha | 99% | $0.0000 | 7.8s | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | |
| Ministral 8B | 95% | $0.0001 | 2.4s | |
| Inception Mercury | 98% | $0.0003 | 4.9s | |
| GPT-5.4 Nano (Reasoning, Low) | 98% | $0.0006 | 2.4s | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 15.8s | |
| Llama 3.1 8B | 98% | $0.0000 | 6.1s | |
| GPT-5.4 Nano (Reasoning) | 98% | $0.0006 | 2.5s | |
| Inception Mercury 2 | 97% | $0.0012 | 1.6s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | |
| Qwen 2.5 72B | 100% | $0.0002 | 7.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0007 | 1.5s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | 100% | |
| Mistral Small 4 | 99% | $0.0003 | 2.4s | 99% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | 99% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | 100% | |
| Mistral Large 3 | 100% | $0.0008 | 5.8s | 100% | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0012 | 5.5s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 99% | $0.0008 | 4.6s | 99% | |
| GPT-5.4 Nano (Reasoning) | 98% | $0.0006 | 2.5s | 97% | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | 97% | |
| GPT-5.4 Nano (Reasoning, Low) | 98% | $0.0006 | 2.4s | 97% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | 99% | |
| Claude Haiku 4.5 | 100% | $0.0027 | 2.2s | 100% | |
| GPT-4.1 Mini | 99% | $0.0008 | 5.1s | 99% | |
| Mistral Small Creative | 97% | $0.0002 | 2.3s | 97% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 99.7% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Possessive traps preserved |
Tense rewriting: past to present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0003 | 10.6s | |
| Gemini 2.5 Flash Lite | 98% | $0.0003 | 1.5s | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 5.9s | |
| Mistral Small 4 | 99% | $0.0004 | 2.8s | |
| Ministral 3 14B | 99% | $0.0002 | 3.8s | |
| Mistral NeMO | 98% | $0.0002 | 3.0s | |
| Ministral 3 8B | 98% | $0.0002 | 3.7s | |
| Ministral 8B | 97% | $0.0001 | 3.6s | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.4s | |
| Stealth: Hunter Alpha | 97% | $0.0000 | 11.2s | |
| Mistral Large 3 | 99% | $0.0010 | 7.4s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | |
| Stealth: Healer Alpha | 96% | $0.0000 | 13.0s | |
| Rocinante 12B | 69% | $0.0004 | 8.1s | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | |
| Arcee AI: Trinity Large (Preview) | 99% | $0.0000 | 21.2s | |
| Gemini 3.1 Flash Lite (Preview) | 96% | $0.0009 | 1.7s | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0008 | 3.6s | |
| Llama 3.1 8B | 96% | $0.0001 | 11.5s | |
| LFM2 24B | 96% | $0.0001 | 11.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 99% | 99% | |
| Qwen3 235B A22B Instruct 2507 | 100% | 99% | 99% | |
| Claude Sonnet 4 | 100% | 99% | 99% | |
| Claude Opus 4.6 | 99% | 100% | 99% | |
| Gemini 3 Pro (Preview) | 99% | 100% | 99% | |
| Mistral Large 3 | 99% | 100% | 99% | |
| Claude 3.7 Sonnet | 99% | 100% | 99% | |
| Mistral Large 2 | 99% | 100% | 99% | |
| Mistral Large | 99% | 100% | 99% | |
| Mistral Small 3.2 24B | 99% | 100% | 99% | |
| Arcee AI: Trinity Large (Preview) | 99% | 100% | 99% | |
| Claude Opus 4.5 | 99% | 100% | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 99% | |
| Writer: Palmyra X5 | 100% | 99% | 99% | |
| Claude 3.5 Sonnet | 99% | 99% | 99% | |
| Grok 4 | 100% | 99% | 99% | |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Ministral 3 14B | 99% | 100% | 99% | |
| Ministral 3 8B | 98% | 100% | 98% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small 3.2 24B | 99% | $0.0002 | 5.9s | 99% | |
| Ministral 3 14B | 99% | $0.0002 | 3.8s | 99% | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0003 | 10.6s | 99% | |
| Ministral 3 8B | 98% | $0.0002 | 3.7s | 98% | |
| Mistral Large 3 | 99% | $0.0010 | 7.4s | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | 99% | |
| Mistral Small 4 | 99% | $0.0004 | 2.8s | 97% | |
| Mistral NeMO | 98% | $0.0002 | 3.0s | 97% | |
| Gemini 2.5 Flash Lite | 98% | $0.0003 | 1.5s | 96% | |
| Arcee AI: Trinity Large (Preview) | 99% | $0.0000 | 21.2s | 99% | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.4s | 96% | |
| Gemini 3.1 Flash Lite (Preview) | 96% | $0.0009 | 1.7s | 96% | |
| Writer: Palmyra X5 | 100% | $0.0033 | 9.2s | 99% | |
| Mistral Large | 99% | $0.0041 | 7.1s | 99% | |
| Mistral Large 2 | 99% | $0.0041 | 7.2s | 99% | |
| Ministral 8B | 97% | $0.0001 | 3.6s | 93% | |
| LFM2 24B | 96% | $0.0001 | 11.1s | 96% | |
| Gemini 3 Flash (Preview) | 96% | $0.0018 | 3.2s | 95% | |
| Mistral Medium 3.1 | 96% | $0.0012 | 6.5s | 95% | |
| Llama 3.1 Nemotron 70B | 97% | $0.0013 | 13.9s | 97% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 88.6% | Dialogue content preserved | ||
| 97.4% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
POV shift: 3rd person to 1st person (Elena's perspective)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 5 Turbo | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Qwen 3.5 122B | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Qwen 3.5 27B | 100% | |
| ByteDance Seed 1.6 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| GPT-5.2 | 100% | |
| Grok 4.1 Fast | 100% | |
| Z.AI GLM 4.6 | 100% | |
| MiniMax M2.7 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.7s | |
| Inception Mercury 2 | 100% | $0.0007 | 1.0s | |
| Inception Mercury | 100% | $0.0004 | 1.8s | |
| Grok 4 Fast | 100% | $0.0004 | 2.6s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 7.4s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.2s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | |
| Mistral Small 4 | 87% | $0.0004 | 3.3s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.2s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.9s | |
| GPT-5.4 Nano | 100% | $0.0007 | 3.4s | |
| Ministral 3 14B | 99% | $0.0002 | 3.7s | |
| Llama 3.1 8B | 99% | $0.0000 | 5.8s | |
| Gemma 3 12B | 100% | $0.0001 | 7.7s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | |
| Grok 4.1 Fast | 100% | $0.0006 | 7.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.2s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0007 | 1.0s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | 100% | |
| Grok 4 Fast | 100% | $0.0004 | 2.6s | 100% | |
| Inception Mercury | 100% | $0.0004 | 1.8s | 99% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0008 | 3.2s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.2s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.9s | 99% | |
| GPT-5.4 Nano | 100% | $0.0007 | 3.4s | 99% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| Ministral 3 14B | 99% | $0.0002 | 3.7s | 99% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 7.4s | 100% | |
| GPT-5.4 Mini | 100% | $0.0026 | 2.2s | 99% | |
| GPT-5.4 Mini (Reasoning) | 100% | $0.0030 | 2.5s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 7.7s | 100% | |
| GPT-5.4 Mini (Reasoning, Low) | 100% | $0.0028 | 3.1s | 100% | |
| Grok 4.1 Fast | 100% | $0.0006 | 7.3s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0033 | 2.9s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0014 | 6.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Mistral Small Creative | 100% | $0.0002 | 3.2s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0010 | 1.9s | |
| Grok 4 Fast | 100% | $0.0007 | 5.0s | |
| GPT-5.4 Nano (Reasoning, Low) | 94% | $0.0008 | 3.3s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 12.6s | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | |
| Grok 4.1 Fast | 100% | $0.0007 | 10.2s | |
| Mistral Medium 3.1 | 100% | $0.0014 | 6.9s | |
| GPT-4.1 Mini | 100% | $0.0012 | 6.5s | |
| Inception Mercury 2 | 97% | $0.0014 | 2.0s | |
| Qwen 2.5 72B | 89% | $0.0003 | 11.6s | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 3.5s | |
| Stealth: Healer Alpha | 99% | $0.0000 | 13.9s | |
| ByteDance Seed 1.6 Flash | 95% | $0.0007 | 12.0s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0017 | 7.6s | |
| GPT-5.4 Mini (Reasoning, Low) | 89% | $0.0035 | 2.9s | |
| Inception Mercury | 93% | $0.0005 | 3.4s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0015 | 16.8s | |
| Claude Haiku 4.5 | 100% | $0.0041 | 3.1s | |
| Mistral Small 4 (Reasoning) | 86% | $0.0016 | 14.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| MiniMax M2.7 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small Creative | 100% | $0.0002 | 3.2s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0010 | 1.9s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | 100% | |
| Grok 4 Fast | 100% | $0.0007 | 5.0s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 3.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0012 | 6.5s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0014 | 6.9s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0017 | 7.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 10.2s | 100% | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 12.6s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0041 | 3.1s | 100% | |
| Stealth: Healer Alpha | 99% | $0.0000 | 13.9s | 99% | |
| GPT-4.1 | 100% | $0.0059 | 4.6s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0015 | 16.8s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0074 | 2.7s | 100% | |
| GPT-4o, Aug. 6th (temp=1) | 100% | $0.0074 | 3.5s | 99% | |
| Inception Mercury 2 | 97% | $0.0014 | 2.0s | 95% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0065 | 10.3s | 100% | |
| MiniMax M2.7 | 100% | $0.0026 | 22.1s | 100% | |
| Hermes 3 405B | 100% | $0.0013 | 25.6s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 85.7% | Mara pronouns preserved (coreference test) | ||
| 99.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Combined: 3rd person past → 1st person present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.8s | |
| Grok 4 Fast | 98% | $0.0006 | 5.0s | |
| GPT-5.4 Nano | 99% | $0.0007 | 3.2s | |
| Gemini 2.5 Flash | 94% | $0.0015 | 2.1s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.6s | |
| Stealth: Hunter Alpha | 99% | $0.0000 | 22.8s | |
| GPT-5.4 Nano (Reasoning) | 97% | $0.0007 | 3.1s | |
| Gemini 3.1 Flash Lite (Preview) | 98% | $0.0009 | 1.7s | |
| Stealth: Healer Alpha | 97% | $0.0000 | 11.0s | |
| Gemma 3 12B | 94% | $0.0001 | 9.3s | |
| Mistral Small 4 | 85% | $0.0004 | 4.8s | |
| Mistral Small 3.2 24B | 98% | $0.0002 | 4.4s | |
| GPT-4.1 Mini | 99% | $0.0010 | 12.0s | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.5s | |
| Gemma 3 27B | 98% | $0.0002 | 13.1s | |
| Hermes 3 70B | 99% | $0.0003 | 13.7s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0014 | 6.7s | |
| LFM2 24B | 97% | $0.0001 | 11.4s | |
| Gemini 3 Flash (Preview) | 98% | $0.0018 | 3.1s | |
| Mistral Large 3 | 98% | $0.0010 | 7.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Z.AI GLM 5 Turbo | 100% | 99% | 99% | |
| Claude Opus 4.6 (Reasoning) | 99% | 100% | 99% | |
| Claude Opus 4.6 | 99% | 100% | 99% | |
| GPT-5.4 Mini (Reasoning) | 99% | 100% | 99% | |
| Claude Sonnet 4 | 99% | 100% | 99% | |
| Grok 4.20 (Beta, Reasoning) | 99% | 100% | 99% | |
| Aion 2.0 | 99% | 99% | 99% | |
| Gemini 2.5 Pro | 99% | 99% | 99% | |
| GPT-5.4 | 99% | 99% | 99% | |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Qwen 3.5 397B A17B | 99% | 100% | 99% | |
| Claude Sonnet 4.6 | 99% | 100% | 99% | |
| Claude Opus 4.5 | 99% | 100% | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 99% | |
| GPT-4o, May 13th (temp=0) | 99% | 100% | 99% | |
| Claude 3.5 Sonnet | 99% | 100% | 99% | |
| Claude 3.7 Sonnet | 99% | 100% | 99% | |
| GPT-4o, Aug. 6th (temp=0) | 99% | 100% | 99% | |
| DeepSeek V3.2 | 99% | 100% | 99% | |
| Claude Haiku 4.5 | 99% | 99% | 99% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.8s | 98% | |
| GPT-5.4 Nano | 99% | $0.0007 | 3.2s | 98% | |
| Gemini 3.1 Flash Lite (Preview) | 98% | $0.0009 | 1.7s | 98% | |
| Mistral Small 3.2 24B | 98% | $0.0002 | 4.4s | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0014 | 6.7s | 99% | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.6s | 96% | |
| Grok 4 Fast | 98% | $0.0006 | 5.0s | 96% | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.5s | 98% | |
| Gemini 3 Flash (Preview) | 98% | $0.0018 | 3.1s | 97% | |
| Claude Haiku 4.5 | 99% | $0.0033 | 2.8s | 99% | |
| Hermes 3 70B | 99% | $0.0003 | 13.7s | 98% | |
| Mistral Large 3 | 98% | $0.0010 | 7.1s | 98% | |
| GPT-4.1 Mini | 99% | $0.0010 | 12.0s | 98% | |
| GPT-5.4 Mini (Reasoning, Low) | 98% | $0.0027 | 3.9s | 96% | |
| Stealth: Healer Alpha | 97% | $0.0000 | 11.0s | 95% | |
| GPT-5.4 Mini (Reasoning) | 99% | $0.0050 | 6.2s | 99% | |
| Stealth: Hunter Alpha | 99% | $0.0000 | 22.8s | 98% | |
| Gemma 3 27B | 98% | $0.0002 | 13.1s | 95% | |
| GPT-5.4 Nano (Reasoning, Low) | 97% | $0.0007 | 2.8s | 92% | |
| GPT-5.4 Nano (Reasoning) | 97% | $0.0007 | 3.1s | 93% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 99.3% | Dialogue content preserved | ||
| 96.2% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Passive voice → active voice
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4.1 Fast | 96% | $0.0014 | 16.8s | |
| Qwen 3.5 Plus (2026-02-15) | 96% | $0.0021 | 10.9s | |
| Gemini 2.5 Flash | 95% | $0.0021 | 2.9s | |
| Gemini 3.1 Flash Lite (Preview) | 93% | $0.0013 | 2.4s | |
| Gemini 3 Flash (Preview) | 94% | $0.0026 | 4.3s | |
| Gemini 2.5 Flash Lite | 90% | $0.0004 | 2.5s | |
| Grok 4 Fast | 93% | $0.0013 | 11.1s | |
| Grok 4.20 (Beta) | 95% | $0.0047 | 2.6s | |
| Claude Haiku 4.5 | 95% | $0.0050 | 5.3s | |
| Stealth: Hunter Alpha | 87% | $0.0000 | 33.4s | |
| DeepSeek V3.1 | 89% | $0.0008 | 37.0s | |
| DeepSeek-V2 Chat | 93% | $0.0011 | 22.6s | |
| DeepSeek V3 (2024-12-26) | 93% | $0.0012 | 23.8s | |
| GPT-4.1 Mini | 90% | $0.0015 | 10.8s | |
| Mistral Large 3 | 91% | $0.0015 | 10.5s | |
| DeepSeek V3.2 | 96% | $0.0008 | 52.0s | |
| GPT-5.4 (Reasoning, Low) | 97% | $0.013 | 8.3s | |
| GPT-4.1 | 89% | $0.0074 | 7.2s | |
| GPT-5.4 | 94% | $0.013 | 8.5s | |
| GPT-5.4 Mini | 91% | $0.0039 | 3.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 97% | 99% | 97% | |
| Claude Opus 4.6 | 98% | 99% | 96% | |
| GPT-5 | 97% | 99% | 96% | |
| Claude Sonnet 4 | 97% | 99% | 96% | |
| GPT-5.4 (Reasoning) | 97% | 99% | 96% | |
| Qwen 3.5 27B | 97% | 99% | 96% | |
| Z.AI GLM 5 Turbo | 96% | 99% | 96% | |
| GPT-5.4 (Reasoning, Low) | 97% | 99% | 95% | |
| Qwen 3.5 397B A17B | 97% | 98% | 95% | |
| Gemini 3 Pro (Preview) | 96% | 99% | 95% | |
| Gemini 3.1 Pro (Preview) | 96% | 99% | 95% | |
| Z.AI GLM 4.7 | 96% | 99% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 96% | 99% | 95% | |
| Claude Opus 4.6 (Reasoning) | 96% | 98% | 94% | |
| Qwen 3.5 Plus (2026-02-15) | 96% | 98% | 94% | |
| Grok 4.1 Fast | 96% | 98% | 94% | |
| Grok 4.20 (Beta, Reasoning) | 96% | 97% | 94% | |
| DeepSeek V3.2 | 96% | 98% | 94% | |
| GPT-5.1 | 96% | 98% | 94% | |
| Aion 2.0 | 96% | 98% | 94% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Qwen 3.5 Plus (2026-02-15) | 96% | $0.0021 | 10.9s | 94% | |
| Gemini 2.5 Flash | 95% | $0.0021 | 2.9s | 92% | |
| Grok 4.1 Fast | 96% | $0.0014 | 16.8s | 94% | |
| Gemini 3 Flash (Preview) | 94% | $0.0026 | 4.3s | 92% | |
| GPT-5.4 (Reasoning, Low) | 97% | $0.013 | 8.3s | 95% | |
| Gemini 3.1 Flash Lite (Preview) | 93% | $0.0013 | 2.4s | 91% | |
| Grok 4.20 (Beta) | 95% | $0.0047 | 2.6s | 91% | |
| Claude Sonnet 4 | 97% | $0.015 | 9.5s | 96% | |
| Claude Haiku 4.5 | 95% | $0.0050 | 5.3s | 90% | |
| Claude Sonnet 4.5 | 96% | $0.015 | 7.4s | 93% | |
| Claude Opus 4.5 | 97% | $0.025 | 8.2s | 97% | |
| Claude Opus 4.6 | 98% | $0.025 | 8.7s | 96% | |
| Mistral Large 3 | 91% | $0.0015 | 10.5s | 90% | |
| DeepSeek-V2 Chat | 93% | $0.0011 | 22.6s | 91% | |
| Claude Sonnet 4.6 | 94% | $0.015 | 7.5s | 92% | |
| GPT-5.4 | 94% | $0.013 | 8.5s | 91% | |
| GPT-5.4 Mini | 91% | $0.0039 | 3.8s | 88% | |
| Mistral Large 2 | 91% | $0.0061 | 10.7s | 90% | |
| Mistral Large | 91% | $0.0061 | 10.5s | 90% | |
| DeepSeek V3 (2024-12-26) | 93% | $0.0012 | 23.8s | 90% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 91.4% | Dialogue content preserved | ||
| 100.0% | No hallucinated or fabricated content | ||
| 87.5% | Non-passive narration preserved | ||
| 73.9% | Passive → active voice transformations | ||
| 100.0% | Structural similarity to original |
Avoid said/asked/replied/answered
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Mistral Small Creative | 98% | $0.0002 | 3.0s | |
| Mistral Small 4 | 100% | $0.0004 | 2.7s | |
| Inception Mercury 2 | 100% | $0.0008 | 1.1s | |
| Stealth: Healer Alpha | 98% | $0.0000 | 5.3s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | |
| Gemma 3 12B | 100% | $0.0001 | 8.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0006 | 4.9s | |
| Grok 4 Fast | 100% | $0.0006 | 5.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.5s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 39.4s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | |
| Inception Mercury | 92% | $0.0004 | 4.0s | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.8s | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.7s | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.2s | |
| Mistral Large 3 | 100% | $0.0010 | 7.2s | |
| Gemma 3 27B | 100% | $0.0002 | 14.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Inception Mercury 2 | 100% | $0.0008 | 1.1s | 100% | |
| Mistral Small 4 | 100% | $0.0004 | 2.7s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0006 | 4.9s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Grok 4 Fast | 100% | $0.0006 | 5.3s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.7s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.8s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.2s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.5s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.8s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | 100% | |
| Mistral Large 3 | 100% | $0.0010 | 7.2s | 100% | |
| GPT-5.4 Mini | 100% | $0.0027 | 1.8s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 7.2s | 100% | |
| Grok 4.20 (Beta) | 100% | $0.0032 | 1.7s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0033 | 2.7s | 100% | |
| Gemma 3 27B | 100% | $0.0002 | 14.1s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Forbidden words eliminated | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Structural similarity to original |
Specific Prompt
Character rename: Elena->Mirabel, Gregor->Aldric
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 2.0s | |
| Mistral NeMO | 100% | $0.0002 | 1.6s | |
| Ministral 3 3B | 100% | $0.0001 | 2.3s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Ministral 8B | 100% | $0.0001 | 3.4s | |
| Ministral 3 8B | 100% | $0.0002 | 3.2s | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | |
| Gemma 3 4B | 100% | $0.0001 | 5.2s | |
| Llama 3.1 8B | 100% | $0.0000 | 9.0s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.6s | |
| Mistral Small 4 | 100% | $0.0004 | 2.9s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Ministral 3 14B | 95% | $0.0002 | 4.3s | |
| Inception Mercury | 100% | $0.0004 | 4.1s | |
| Llama 3.1 70B | 100% | $0.0004 | 11.0s | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.9s | |
| GPT-5.4 Nano | 100% | $0.0007 | 2.9s | |
| Grok 4 Fast | 100% | $0.0005 | 4.3s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0007 | 2.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral NeMO | 100% | $0.0002 | 1.6s | 100% | |
| Ministral 3B | 100% | $0.0000 | 2.0s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 2.3s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.2s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.4s | 100% | |
| Mistral Small 4 | 100% | $0.0004 | 2.9s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.8s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.6s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0007 | 2.7s | 100% | |
| GPT-5.4 Nano | 100% | $0.0007 | 2.9s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.9s | 100% | |
| Inception Mercury | 100% | $0.0004 | 4.1s | 100% | |
| Inception Mercury 2 | 100% | $0.0013 | 1.5s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 5.2s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 4.3s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Location rename: market square, outer ring, bridge, northern mines
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 1.9s | |
| Ministral 3 3B | 100% | $0.0001 | 1.9s | |
| Ministral 8B | 100% | $0.0001 | 3.2s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| Ministral 3 8B | 100% | $0.0002 | 3.1s | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | |
| Mistral Small 4 | 100% | $0.0004 | 2.8s | |
| Gemma 3 4B | 100% | $0.0001 | 5.8s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.5s | |
| Llama 3.1 8B | 100% | $0.0001 | 9.7s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 12.4s | |
| Gemma 3 12B | 100% | $0.0001 | 8.5s | |
| Grok 4 Fast | 100% | $0.0006 | 3.6s | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.8s | |
| GPT-5.4 Nano | 100% | $0.0007 | 3.0s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 4.3s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.6s | |
| Inception Mercury | 100% | $0.0004 | 5.5s | |
| Grok 4.1 Fast | 100% | $0.0005 | 5.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 1.9s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 1.9s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.2s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.1s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | 100% | |
| Mistral Small 4 | 100% | $0.0004 | 2.8s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.6s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 5.8s | 100% | |
| GPT-5.4 Nano (Reasoning, Low) | 100% | $0.0007 | 2.8s | 100% | |
| Grok 4 Fast | 100% | $0.0006 | 3.6s | 100% | |
| GPT-5.4 Nano | 100% | $0.0007 | 3.0s | 100% | |
| Inception Mercury | 100% | $0.0004 | 5.5s | 100% | |
| Grok 4.1 Fast | 100% | $0.0005 | 5.3s | 100% | |
| Claude 3 Haiku | 100% | $0.0009 | 4.0s | 100% | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 4.3s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Expand all contractions
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | |
| Mistral Small 4 | 100% | $0.0003 | 2.6s | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | |
| Mistral NeMO | 98% | $0.0001 | 1.8s | |
| LFM2 24B | 100% | $0.0001 | 8.1s | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0008 | 1.6s | |
| Llama 3.1 8B | 99% | $0.0000 | 7.8s | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0003 | 9.9s | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | |
| Stealth: Hunter Alpha | 100% | $0.0000 | 14.6s | |
| Mistral Small Creative | 98% | $0.0002 | 2.4s | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | 100% | |
| Mistral Small 4 | 100% | $0.0003 | 2.6s | 100% | |
| LFM2 24B | 100% | $0.0001 | 8.1s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | 99% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | 100% | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0008 | 1.6s | 99% | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0003 | 9.9s | 100% | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.7s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0010 | 4.3s | 100% | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | 99% | |
| Mistral Large 3 | 100% | $0.0009 | 5.9s | 100% | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | 99% | |
| Llama 3.1 70B | 100% | $0.0005 | 7.8s | 99% | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Possessive traps preserved |
Tense rewriting: past to present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Qwen 3.5 122B | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| Qwen 3.5 27B | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| Claude Opus 4.5 | 100% | |
| Aion 2.0 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| Gemini 2.5 Pro | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0000 | 1.9s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Llama 3.1 8B | 100% | $0.0001 | 10.6s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | |
| Claude 3 Haiku | 100% | $0.0009 | 4.5s | |
| Mistral Medium 3.1 | 100% | $0.0013 | 5.3s | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | |
| Mistral Small 4 | 99% | $0.0004 | 5.1s | |
| Ministral 3 14B | 99% | $0.0002 | 3.9s | |
| GPT-5.4 Mini | 100% | $0.0027 | 2.3s | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.5s | |
| Grok 4.20 (Beta) | 100% | $0.0032 | 1.6s | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.4s | |
| Mistral NeMO | 99% | $0.0002 | 2.7s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | 100% | |
| Claude 3 Haiku | 100% | $0.0009 | 4.5s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | 100% | |
| Ministral 3 14B | 99% | $0.0002 | 3.9s | 99% | |
| Mistral Medium 3.1 | 100% | $0.0013 | 5.3s | 100% | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.5s | 100% | |
| Llama 3.1 8B | 100% | $0.0001 | 10.6s | 100% | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.4s | 100% | |
| Mistral NeMO | 99% | $0.0002 | 2.7s | 99% | |
| GPT-5.4 Mini | 100% | $0.0027 | 2.3s | 100% | |
| Mistral Small 4 | 99% | $0.0004 | 5.1s | 98% | |
| Grok 4.20 (Beta) | 100% | $0.0032 | 1.6s | 100% | |
| Ministral 3B | 99% | $0.0000 | 1.9s | 97% | |
| Ministral 3 8B | 99% | $0.0002 | 3.5s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
POV shift: 3rd person to 1st person (Elena's perspective)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0000 | 2.2s | |
| Mistral NeMO | 100% | $0.0002 | 2.0s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Ministral 3 3B | 100% | $0.0001 | 2.4s | |
| Ministral 8B | 100% | $0.0001 | 3.5s | |
| Mistral Small Creative | 100% | $0.0002 | 3.2s | |
| Ministral 3 8B | 100% | $0.0002 | 3.0s | |
| Mistral Small 4 | 100% | $0.0004 | 3.3s | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.5s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.8s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | |
| Gemma 3 4B | 100% | $0.0001 | 6.4s | |
| GPT-5.4 Nano (Reasoning) | 100% | $0.0009 | 3.5s | |
| Inception Mercury | 100% | $0.0005 | 3.9s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | |
| Inception Mercury 2 | 100% | $0.0018 | 2.3s | |
| Llama 3.1 8B | 100% | $0.0000 | 8.6s | |
| Mistral Medium 3.1 | 100% | $0.0013 | 6.2s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 8.0s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Mistral NeMO | 100% | $0.0002 | 2.0s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.0s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.2s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.8s | 100% | |
| Mistral Small 4 | 100% | $0.0004 | 3.3s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 2.4s | 99% | |
| Inception Mercury 2 | 100% | $0.0018 | 2.3s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 6.4s | 100% | |
| Inception Mercury | 100% | $0.0005 | 3.9s | 99% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.1s | 100% | |
| GPT-5.4 Mini | 100% | $0.0027 | 1.9s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 8.6s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.5s | 98% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 8.0s | 100% | |
| Grok 4 Fast | 100% | $0.0009 | 6.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0001 | 2.5s | |
| Ministral 3 3B | 100% | $0.0001 | 2.2s | |
| Mistral NeMO | 100% | $0.0002 | 3.6s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.9s | |
| Ministral 3 8B | 100% | $0.0002 | 3.9s | |
| Ministral 8B | 100% | $0.0001 | 3.8s | |
| Mistral Small Creative | 100% | $0.0003 | 3.2s | |
| Ministral 3 14B | 100% | $0.0003 | 4.6s | |
| GPT-4.1 Nano | 95% | $0.0003 | 3.9s | |
| Mistral Small 4 | 100% | $0.0005 | 3.3s | |
| Mistral Small 3.2 24B | 86% | $0.0003 | 4.9s | |
| Gemma 3 4B | 100% | $0.0001 | 6.6s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 1.9s | |
| GPT-5.4 Nano | 100% | $0.0009 | 3.3s | |
| Llama 3.1 8B | 100% | $0.0000 | 13.0s | |
| Qwen 2.5 72B | 100% | $0.0003 | 11.4s | |
| Claude 3 Haiku | 100% | $0.0011 | 5.2s | |
| Stealth: Hunter Alpha | 99% | $0.0000 | 22.2s | |
| GPT-5.4 Nano (Reasoning, Low) | 98% | $0.0009 | 13.4s | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning, Low) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 3 3B | 100% | $0.0001 | 2.2s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.9s | 100% | |
| Mistral Small Creative | 100% | $0.0003 | 3.2s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.8s | 100% | |
| Mistral NeMO | 100% | $0.0002 | 3.6s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.9s | 100% | |
| Mistral Small 4 | 100% | $0.0005 | 3.3s | 100% | |
| Ministral 3 14B | 100% | $0.0003 | 4.6s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 6.6s | 100% | |
| GPT-5.4 Nano | 100% | $0.0009 | 3.3s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | 100% | |
| Claude 3 Haiku | 100% | $0.0011 | 5.2s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 7.0s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 11.4s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0011 | 1.9s | 98% | |
| Gemini 3 Flash (Preview) | 100% | $0.0022 | 3.6s | 100% | |
| GPT-4.1 Mini | 100% | $0.0012 | 7.8s | 99% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 15.5s | 100% | |
| Grok 4.20 (Beta) | 100% | $0.0038 | 1.9s | 100% | |
| Gemma 3 27B | 100% | $0.0003 | 18.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 100.0% | Mara pronouns preserved (coreference test) | ||
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Combined: 3rd person past → 1st person present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.4 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Claude Opus 4.5 | 100% | |
| GPT-4.1 | 100% | |
| Claude Opus 4 | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | |
| Gemini 2.5 Flash | 100% | |
| GPT-5 | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| GPT-5.4 Mini (Reasoning) | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.6s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.0s | |
| Gemini 3.1 Flash Lite (Preview) | 99% | $0.0009 | 1.8s | |
| Qwen3 235B A22B Instruct 2507 | 99% | $0.0003 | 8.6s | |
| Mistral NeMO | 98% | $0.0002 | 2.6s | |
| Ministral 8B | 97% | $0.0001 | 2.9s | |
| Mistral Small Creative | 99% | $0.0002 | 2.8s | |
| Ministral 3 8B | 99% | $0.0002 | 3.1s | |
| Ministral 3 14B | 99% | $0.0003 | 3.8s | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 4.3s | |
| Mistral Medium 3.1 | 99% | $0.0013 | 7.1s | |
| Mistral Small 4 | 97% | $0.0004 | 2.8s | |
| GPT-4.1 Mini | 99% | $0.0011 | 6.4s | |
| Stealth: Hunter Alpha | 99% | $0.0000 | 13.6s | |
| Llama 3.1 8B | 98% | $0.0001 | 9.4s | |
| Mistral Large 3 | 99% | $0.0011 | 7.8s | |
| Z.AI GLM 4.5 | 100% | $0.0030 | 25.1s | |
| Grok 4 Fast | 99% | $0.0013 | 10.2s | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.1s | |
| Stealth: Healer Alpha | 99% | $0.0000 | 21.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| Gemini 2.5 Flash | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 99% | 99% | |
| GPT-5.4 Mini (Reasoning) | 100% | 99% | 99% | |
| Gemini 2.5 Pro | 100% | 99% | 99% | |
| Qwen 3.5 27B | 100% | 99% | 99% | |
| Z.AI GLM 4.7 | 100% | 99% | 99% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash | 100% | $0.0015 | 2.0s | 100% | |
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.6s | 99% | |
| Gemini 3.1 Flash Lite (Preview) | 99% | $0.0009 | 1.8s | 99% | |
| Qwen3 235B A22B Instruct 2507 | 99% | $0.0003 | 8.6s | 99% | |
| Mistral Small Creative | 99% | $0.0002 | 2.8s | 99% | |
| Ministral 3 8B | 99% | $0.0002 | 3.1s | 99% | |
| Ministral 3 14B | 99% | $0.0003 | 3.8s | 99% | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 4.3s | 99% | |
| Mistral Large 3 | 99% | $0.0011 | 7.8s | 99% | |
| Stealth: Hunter Alpha | 99% | $0.0000 | 13.6s | 99% | |
| Mistral Medium 3.1 | 99% | $0.0013 | 7.1s | 99% | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.1s | 99% | |
| DeepSeek-V2 Chat | 99% | $0.0008 | 14.6s | 99% | |
| GPT-4.1 | 100% | $0.0055 | 3.8s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | 98% | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.1s | 98% | |
| Gemini 3 Flash (Preview) | 98% | $0.0019 | 3.1s | 98% | |
| Gemma 3 12B | 98% | $0.0001 | 10.0s | 98% | |
| Grok 4.1 Fast | 100% | $0.0015 | 16.8s | 99% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0069 | 2.5s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 98.1% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Passive voice → active voice
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 99% | $0.0027 | 4.4s | |
| Grok 4 Fast | 98% | $0.0015 | 12.2s | |
| Grok 4.20 (Beta) | 98% | $0.0048 | 2.4s | |
| Gemini 2.5 Flash | 97% | $0.0021 | 3.1s | |
| Stealth: Healer Alpha | 98% | $0.0000 | 31.8s | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0022 | 9.5s | |
| Gemini 3.1 Flash Lite (Preview) | 95% | $0.0014 | 2.4s | |
| Grok 4.1 Fast | 98% | $0.0020 | 28.2s | |
| Mistral Large 3 | 96% | $0.0016 | 10.4s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0031 | 25.5s | |
| Claude Haiku 4.5 | 96% | $0.0052 | 6.2s | |
| DeepSeek V3.2 | 99% | $0.0006 | 54.4s | |
| Claude Sonnet 4.5 | 99% | $0.016 | 7.0s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 26.3s | |
| Gemini 2.5 Flash Lite | 94% | $0.0004 | 2.4s | |
| DeepSeek V3 (2024-12-26) | 95% | $0.0012 | 22.5s | |
| Gemma 3 12B | 94% | $0.0001 | 12.1s | |
| DeepSeek V3.1 | 90% | $0.0009 | 33.9s | |
| Hermes 3 405B | 94% | $0.0018 | 27.1s | |
| Mistral Medium 3.1 | 94% | $0.0019 | 6.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Claude Opus 4.6 | 98% | 100% | 98% | |
| Grok 4.20 (Beta, Reasoning) | 99% | 99% | 98% | |
| Grok 4 | 99% | 99% | 98% | |
| Claude Sonnet 4.6 | 98% | 99% | 98% | |
| Gemini 2.5 Pro | 99% | 98% | 97% | |
| Claude Sonnet 4.5 | 99% | 98% | 97% | |
| DeepSeek V3.2 | 99% | 98% | 97% | |
| Gemini 3 Flash (Preview) | 99% | 99% | 97% | |
| Z.AI GLM 5 Turbo | 98% | 99% | 97% | |
| Claude Opus 4.6 (Reasoning) | 99% | 98% | 97% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 98% | 97% | |
| Grok 4.20 (Beta) | 98% | 99% | 97% | |
| Z.AI GLM 5 | 98% | 98% | 97% | |
| ByteDance Seed 1.6 | 97% | 100% | 97% | |
| GPT-5 | 98% | 97% | 97% | |
| Grok 4 Fast | 98% | 98% | 97% | |
| Grok 4.1 Fast | 98% | 98% | 97% | |
| o4 Mini | 98% | 98% | 96% | |
| GPT-5.1 | 98% | 98% | 96% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 99% | $0.0027 | 4.4s | 97% | |
| Grok 4 Fast | 98% | $0.0015 | 12.2s | 97% | |
| Grok 4.20 (Beta) | 98% | $0.0048 | 2.4s | 97% | |
| Gemini 2.5 Flash | 97% | $0.0021 | 3.1s | 95% | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0022 | 9.5s | 95% | |
| Grok 4.1 Fast | 98% | $0.0020 | 28.2s | 97% | |
| Claude Sonnet 4.5 | 99% | $0.016 | 7.0s | 97% | |
| Mistral Large 3 | 96% | $0.0016 | 10.4s | 95% | |
| Stealth: Healer Alpha | 98% | $0.0000 | 31.8s | 95% | |
| Claude Sonnet 4.6 | 98% | $0.016 | 7.4s | 98% | |
| DeepSeek V3.2 | 99% | $0.0006 | 54.4s | 97% | |
| Gemini 3.1 Flash Lite (Preview) | 95% | $0.0014 | 2.4s | 92% | |
| Mistral Large 2 | 96% | $0.0064 | 10.4s | 95% | |
| Claude Sonnet 4 | 98% | $0.016 | 9.2s | 96% | |
| Mistral Large | 95% | $0.0064 | 10.4s | 94% | |
| Claude Haiku 4.5 | 96% | $0.0052 | 6.2s | 92% | |
| Gemini 2.5 Flash Lite | 94% | $0.0004 | 2.4s | 92% | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 26.3s | 94% | |
| Mistral Medium 3.1 | 94% | $0.0019 | 6.3s | 93% | |
| GPT-5.4 | 97% | $0.013 | 8.2s | 95% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 94.3% | Dialogue content preserved | ||
| 100.0% | No hallucinated or fabricated content | ||
| 92.9% | Non-passive narration preserved | ||
| 84.9% | Passive → active voice transformations | ||
| 100.0% | Structural similarity to original |
Avoid said/asked/replied/answered
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| Mistral Small 4 | 99% | $0.0004 | 3.1s | |
| Inception Mercury | 95% | $0.0004 | 3.4s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 6.2s | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | |
| Gemma 3 4B | 98% | $0.0001 | 5.5s | |
| Gemma 3 12B | 98% | $0.0001 | 7.4s | |
| Grok 4 Fast | 100% | $0.0008 | 6.0s | |
| Inception Mercury 2 | 100% | $0.0012 | 1.7s | |
| Claude 3 Haiku | 85% | $0.0008 | 4.3s | |
| Stealth: Hunter Alpha | 95% | $0.0000 | 11.2s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.7s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | |
| GPT-5.4 Nano (Reasoning) | 95% | $0.0011 | 4.2s | |
| Mistral Medium 3.1 | 100% | $0.0012 | 5.1s | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.3s | |
| Mistral Large 3 | 100% | $0.0011 | 7.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Z.AI GLM 5 Turbo | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.4 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Qwen 3.5 122B | 100% | 100% | 100% | |
| Grok 4.20 (Beta, Reasoning) | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Qwen 3.5 27B | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Gemini 3.1 Flash Lite (Preview) | 100% | $0.0009 | 1.7s | 100% | |
| Inception Mercury 2 | 100% | $0.0012 | 1.7s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 6.2s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | 100% | |
| Grok 4 Fast | 100% | $0.0008 | 6.0s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0012 | 5.1s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.3s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | 100% | |
| Mistral Large 3 | 100% | $0.0011 | 7.2s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.7s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.9s | 100% | |
| Grok 4.20 (Beta) | 100% | $0.0032 | 1.7s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.7s | 100% | |
| Stealth: Healer Alpha | 100% | $0.0000 | 16.4s | 100% | |
| Mistral Small 4 | 99% | $0.0004 | 3.1s | 96% | |
| DeepSeek-V2 Chat | 100% | $0.0008 | 16.8s | 100% | |
| Qwen3 235B A22B Instruct 2507 | 100% | $0.0004 | 18.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Forbidden words eliminated | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Structural similarity to original |