Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 97% | $0.0003 | 1.7s | |
| Mistral Small 3.2 24B | 97% | $0.0002 | 5.0s | |
| Gemini 2.5 Flash | 99% | $0.0015 | 2.2s | |
| Gemini 3 Flash (Preview) | 99% | $0.0019 | 3.4s | |
| Grok 4 Fast | 99% | $0.0008 | 6.5s | |
| GPT-4.1 Mini | 98% | $0.0011 | 7.0s | |
| Mistral Large 3 | 98% | $0.0011 | 7.7s | |
| Gemma 3 12B | 95% | $0.0001 | 9.0s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 7.2s | |
| Qwen 2.5 72B | 98% | $0.0003 | 10.9s | |
| GPT-4o Mini (temp=1) | 95% | $0.0004 | 9.5s | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.2s | |
| Mistral Medium 3.1 | 97% | $0.0013 | 5.9s | |
| Mistral Small Creative | 96% | $0.0002 | 3.1s | |
| Grok 4.1 Fast | 99% | $0.0010 | 12.4s | |
| Ministral 3 14B | 93% | $0.0002 | 4.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0019 | 17.4s | |
| DeepSeek V3 (2024-12-26) | 96% | $0.0008 | 16.0s | |
| GPT-4.1 | 98% | $0.0054 | 4.4s | |
| Mistral Large 2 | 98% | $0.0044 | 7.6s | |
Cost vs Performance
Compares total cost for this test against the test score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.
10 low-scoring outliers hidden: Ministral 3 8B (87.0%), Ministral 8B (86.7%), Mistral NeMO (86.6%), Arcee AI: Trinity Mini (85.7%), Ministral 3 3B (81.2%), Ministral 3B (80.9%), Cohere Command R+ (Aug. 2024) (73.7%), Hermes 3 70B (69.5%), Rocinante 12B (66.3%), Claude 3 Haiku (61.1%).
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 | 100% | 99% | 99% | |
| Claude Opus 4.5 | 100% | 98% | 98% | |
| Claude Sonnet 4 | 100% | 98% | 98% | |
| Gemini 3 Pro (Preview) | 100% | 98% | 98% | |
| Claude Opus 4.6 (Reasoning) | 100% | 98% | 98% | |
| Claude Sonnet 4.5 | 100% | 98% | 98% | |
| Grok 4 | 100% | 98% | 98% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 98% | 98% | |
| Z.AI GLM 5 | 99% | 98% | 98% | |
| Gemini 2.5 Pro | 99% | 97% | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 97% | 97% | |
| Gemini 3 Flash (Preview, Reasoning) | 99% | 97% | 97% | |
| Z.AI GLM 4.7 | 99% | 97% | 97% | |
| Claude Sonnet 4.6 | 99% | 97% | 97% | |
| Claude Haiku 4.5 | 99% | 97% | 97% | |
| Gemini 3 Flash (Preview) | 99% | 97% | 97% | |
| GPT-5 | 99% | 96% | 96% | |
| GPT-5.1 | 99% | 96% | 96% | |
| Claude 3.7 Sonnet | 99% | 96% | 96% | |
| ByteDance Seed 1.6 | 99% | 95% | 95% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 99% | $0.0019 | 3.4s | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 7.2s | 97% | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.2s | 97% | |
| Grok 4 Fast | 99% | $0.0008 | 6.5s | 95% | |
| Gemini 2.5 Flash | 99% | $0.0015 | 2.2s | 92% | |
| GPT-4.1 Mini | 98% | $0.0011 | 7.0s | 94% | |
| Grok 4.1 Fast | 99% | $0.0010 | 12.4s | 92% | |
| Mistral Medium 3.1 | 97% | $0.0013 | 5.9s | 91% | |
| Gemini 2.5 Flash Lite | 97% | $0.0003 | 1.7s | 86% | |
| Claude Sonnet 4.5 | 100% | $0.011 | 4.9s | 98% | |
| Claude Sonnet 4 | 100% | $0.011 | 6.1s | 98% | |
| GPT-4.1 | 98% | $0.0054 | 4.4s | 92% | |
| Qwen 2.5 72B | 98% | $0.0003 | 10.9s | 89% | |
| Mistral Large 3 | 98% | $0.0011 | 7.7s | 88% | |
| Claude Sonnet 4.6 | 99% | $0.011 | 4.7s | 97% | |
| Mistral Large | 98% | $0.0044 | 7.6s | 90% | |
| Gemini 2.5 Flash (Reasoning) | 98% | $0.0063 | 11.7s | 93% | |
| Claude 3.7 Sonnet | 99% | $0.011 | 5.9s | 96% | |
| Mistral Small Creative | 96% | $0.0002 | 3.1s | 84% | |
| GPT-4o, May 13th (temp=0) | 99% | $0.011 | 3.5s | 94% | |
| Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | Specific Prompt | Generic Prompt | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Total ▼ | Character rename: Elena->Mirabel, Gregor->Aldric | Character rename: Elena->Mirabel, Gregor->Aldric | Location rename: market square, outer ring, bridge, northern mines | Location rename: market square, outer ring, bridge, northern mines | Expand all contractions | Expand all contractions | Tense rewriting: past to present | Tense rewriting: past to present | POV shift: 3rd person to 1st person (Elena's perspective) | POV shift: 3rd person to 1st person (Elena's perspective) | Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged | Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged | Combined: 3rd person past → 1st person present | Combined: 3rd person past → 1st person present | Passive voice → active voice | Passive voice → active voice | Avoid said/asked/replied/answered | Avoid said/asked/replied/answered |
| Claude Sonnet 4 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 97% | 100% | 100% |
| Claude Opus 4.6 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 99% | 98% | 98% | 100% | 100% |
| Claude Sonnet 4.5 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 99% | 96% | 100% | 100% |
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Grok 4 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 99% | 99% | 98% | 96% | 100% | 100% |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
| Claude Opus 4.5 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 99% | 100% | 100% | 100% | 99% | 97% | 97% | 100% | 100% |
| Z.AI GLM 5 | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 96% | 100% | 100% |
| Gemini 2.5 Pro | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 95% | 100% | 100% |
| GPT-5 | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 98% | 98% | 97% | 100% | 100% |
| Z.AI GLM 4.7 | 99% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 99% | 98% | 96% | 100% | 100% |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 99% | 100% | 99% | 99% | 97% | 96% | 100% | 100% |
| Gemini 3 Flash (Preview, Reasoning) | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 99% | 97% | 95% | 100% | 100% |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 99% | 99% | 96% | 100% | 100% |
Generic Prompt
Character rename: Elena->Mirabel, Gregor->Aldric
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Ministral 8B | 100% | $0.0001 | 3.2s | |
| Ministral 3 8B | 100% | $0.0002 | 2.9s | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.7s | |
| Ministral 3 14B | 100% | $0.0002 | 4.0s | |
| Grok 4 Fast | 100% | $0.0005 | 3.3s | |
| Gemma 3 4B | 100% | $0.0001 | 6.1s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 5.7s | |
| Mistral NeMO | 93% | $0.0002 | 2.3s | |
| Llama 3.1 8B | 100% | $0.0000 | 9.9s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0007 | 4.7s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0003 | 6.6s | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.8s | |
| Rocinante 12B | 97% | $0.0004 | 6.3s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.7s | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 2.9s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.2s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.7s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 3.3s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 4.0s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0007 | 4.7s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 5.7s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 6.1s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.8s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0003 | 6.6s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.6s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.7s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0029 | 4.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 7.7s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | 100% | |
| Mistral Large 3 | 100% | $0.0010 | 7.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Location rename: market square, outer ring, bridge, northern mines
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Grok 4 Fast | 100% | $0.0005 | 3.8s | |
| Llama 3.1 70B | 100% | $0.0005 | 17.8s | |
| ByteDance Seed 1.6 Flash | 99% | $0.0003 | 5.3s | |
| Grok 4.1 Fast | 100% | $0.0006 | 5.2s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0009 | 7.3s | |
| GPT-4.1 Mini | 100% | $0.0010 | 5.8s | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.8s | |
| Gemma 3 27B | 98% | $0.0002 | 12.0s | |
| Llama 3.1 8B | 99% | $0.0000 | 11.9s | |
| Ministral 3 14B | 96% | $0.0002 | 4.4s | |
| Gemma 3 4B | 96% | $0.0001 | 8.4s | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0007 | 14.5s | |
| DeepSeek-V2 Chat | 99% | $0.0007 | 14.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 3.8s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0006 | 5.2s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 5.8s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.6s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.8s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0009 | 7.3s | 100% | |
| ByteDance Seed 1.6 Flash | 99% | $0.0003 | 5.3s | 97% | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.8s | 100% | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | 96% | |
| Ministral 3 14B | 96% | $0.0002 | 4.4s | 96% | |
| GPT-4.1 | 100% | $0.0052 | 4.1s | 100% | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | 91% | |
| Mistral Medium 3.1 | 96% | $0.0012 | 4.6s | 96% | |
| GPT-4o, Aug. 6th (temp=1) | 100% | $0.0065 | 2.8s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0044 | 6.7s | 100% | |
| Mistral Large 2 | 100% | $0.0042 | 7.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Expand all contractions
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Claude Sonnet 4.5 | 100% | |
| Claude Opus 4 | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | |
| Ministral 8B | 95% | $0.0001 | 2.4s | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | |
| Llama 3.1 8B | 98% | $0.0000 | 6.1s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | |
| Qwen 2.5 72B | 100% | $0.0002 | 7.7s | |
| Mistral Small Creative | 97% | $0.0002 | 2.3s | |
| Mistral Large 3 | 100% | $0.0008 | 5.8s | |
| Arcee AI: Trinity Mini | 97% | $0.0003 | 12.7s | |
| GPT-4.1 Nano | 97% | $0.0002 | 2.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 99% | $0.0008 | 4.6s | |
| Ministral 3 8B | 96% | $0.0001 | 3.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | |
| GPT-4.1 Mini | 99% | $0.0008 | 5.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | 100% | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | 99% | |
| Mistral Large 3 | 100% | $0.0008 | 5.8s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0012 | 5.5s | 100% | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | 97% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 99% | $0.0008 | 4.6s | 99% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | 99% | |
| Claude Haiku 4.5 | 100% | $0.0027 | 2.2s | 100% | |
| Qwen 2.5 72B | 100% | $0.0002 | 7.7s | 98% | |
| GPT-4.1 Mini | 99% | $0.0008 | 5.1s | 99% | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | 97% | |
| Mistral Small Creative | 97% | $0.0002 | 2.3s | 97% | |
| Hermes 3 70B | 100% | $0.0003 | 11.8s | 100% | |
| GPT-4.1 Nano | 97% | $0.0002 | 2.8s | 96% | |
| Mistral Medium 3.1 | 98% | $0.0010 | 5.4s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Possessive traps preserved |
Tense rewriting: past to present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 98% | $0.0003 | 1.5s | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 5.9s | |
| Ministral 3 14B | 99% | $0.0002 | 3.8s | |
| Mistral NeMO | 98% | $0.0002 | 3.0s | |
| Ministral 3 8B | 98% | $0.0002 | 3.7s | |
| Ministral 8B | 97% | $0.0001 | 3.6s | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.4s | |
| Mistral Large 3 | 99% | $0.0010 | 7.4s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | |
| Rocinante 12B | 69% | $0.0004 | 8.1s | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | |
| Arcee AI: Trinity Large (Preview) | 99% | $0.0000 | 21.2s | |
| Llama 3.1 8B | 96% | $0.0001 | 11.5s | |
| Writer: Palmyra X5 | 100% | $0.0033 | 9.2s | |
| Gemini 3 Flash (Preview) | 96% | $0.0018 | 3.2s | |
| Mistral Medium 3.1 | 96% | $0.0012 | 6.5s | |
| Hermes 3 70B | 95% | $0.0003 | 13.9s | |
| Llama 3.1 70B | 96% | $0.0007 | 26.0s | |
| Mistral Large 2 | 99% | $0.0041 | 7.2s | |
| Mistral Large | 99% | $0.0041 | 7.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4.6 (Reasoning) | 100% | 99% | 99% | |
| Claude Sonnet 4 | 100% | 99% | 99% | |
| Claude Opus 4.6 | 99% | 100% | 99% | |
| Gemini 3 Pro (Preview) | 99% | 100% | 99% | |
| Mistral Large 3 | 99% | 100% | 99% | |
| Claude 3.7 Sonnet | 99% | 100% | 99% | |
| Mistral Large 2 | 99% | 100% | 99% | |
| Mistral Large | 99% | 100% | 99% | |
| Mistral Small 3.2 24B | 99% | 100% | 99% | |
| Arcee AI: Trinity Large (Preview) | 99% | 100% | 99% | |
| Claude Opus 4.5 | 99% | 100% | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 99% | |
| Writer: Palmyra X5 | 100% | 99% | 99% | |
| Claude 3.5 Sonnet | 99% | 99% | 99% | |
| Grok 4 | 100% | 99% | 99% | |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Ministral 3 14B | 99% | 100% | 99% | |
| Ministral 3 8B | 98% | 100% | 98% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 97% | 97% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small 3.2 24B | 99% | $0.0002 | 5.9s | 99% | |
| Ministral 3 14B | 99% | $0.0002 | 3.8s | 99% | |
| Ministral 3 8B | 98% | $0.0002 | 3.7s | 98% | |
| Mistral Large 3 | 99% | $0.0010 | 7.4s | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | 99% | |
| Mistral NeMO | 98% | $0.0002 | 3.0s | 97% | |
| Gemini 2.5 Flash Lite | 98% | $0.0003 | 1.5s | 96% | |
| Arcee AI: Trinity Large (Preview) | 99% | $0.0000 | 21.2s | 99% | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.4s | 96% | |
| Writer: Palmyra X5 | 100% | $0.0033 | 9.2s | 99% | |
| Mistral Large | 99% | $0.0041 | 7.1s | 99% | |
| Mistral Large 2 | 99% | $0.0041 | 7.2s | 99% | |
| Ministral 8B | 97% | $0.0001 | 3.6s | 93% | |
| Gemini 3 Flash (Preview) | 96% | $0.0018 | 3.2s | 95% | |
| Mistral Medium 3.1 | 96% | $0.0012 | 6.5s | 95% | |
| Llama 3.1 Nemotron 70B | 97% | $0.0013 | 13.9s | 97% | |
| Claude Haiku 4.5 | 97% | $0.0033 | 3.3s | 96% | |
| Mistral Small Creative | 96% | $0.0002 | 2.9s | 91% | |
| Llama 3.1 8B | 96% | $0.0001 | 11.5s | 91% | |
| GPT-4o, Aug. 6th (temp=0) | 97% | $0.0063 | 2.6s | 97% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 90.0% | Dialogue content preserved | ||
| 97.8% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
POV shift: 3rd person to 1st person (Elena's perspective)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | |
| GPT-5.1 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| GPT-5.2 | 100% | |
| Grok 4.1 Fast | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Claude Sonnet 4.5 | 100% | |
| Claude Opus 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.7s | |
| Grok 4 Fast | 100% | $0.0004 | 2.6s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.2s | |
| Ministral 3 14B | 99% | $0.0002 | 3.7s | |
| Llama 3.1 8B | 99% | $0.0000 | 5.8s | |
| Gemma 3 12B | 100% | $0.0001 | 7.7s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | |
| Grok 4.1 Fast | 100% | $0.0006 | 7.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.2s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | |
| GPT-4.1 Mini | 99% | $0.0010 | 6.3s | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0013 | 9.2s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | |
| ByteDance Seed 1.6 Flash | 99% | $0.0005 | 9.5s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0014 | 6.5s | |
| Gemma 3 4B | 95% | $0.0001 | 5.9s | |
| Arcee AI: Trinity Mini | 96% | $0.0002 | 6.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.7s | 100% | |
| Grok 4 Fast | 100% | $0.0004 | 2.6s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.2s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| Ministral 3 14B | 99% | $0.0002 | 3.7s | 99% | |
| Gemma 3 12B | 100% | $0.0001 | 7.7s | 100% | |
| Grok 4.1 Fast | 100% | $0.0006 | 7.3s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0033 | 2.9s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0014 | 6.5s | 100% | |
| Mistral Large 3 | 100% | $0.0010 | 7.3s | 100% | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.5s | 96% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.2s | 100% | |
| Llama 3.1 8B | 99% | $0.0000 | 5.8s | 96% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0013 | 9.2s | 99% | |
| GPT-4.1 Mini | 99% | $0.0010 | 6.3s | 96% | |
| GPT-4.1 | 100% | $0.0050 | 4.5s | 100% | |
| Mistral Large 2 | 100% | $0.0041 | 6.9s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 99.7% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| Gemini 2.5 Pro | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Mistral Small Creative | 100% | $0.0002 | 3.2s | |
| Grok 4 Fast | 100% | $0.0007 | 5.0s | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | |
| Mistral Medium 3.1 | 100% | $0.0014 | 6.9s | |
| GPT-4.1 Mini | 100% | $0.0012 | 6.5s | |
| Grok 4.1 Fast | 100% | $0.0007 | 10.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 3.5s | |
| Qwen 2.5 72B | 89% | $0.0003 | 11.6s | |
| ByteDance Seed 1.6 Flash | 95% | $0.0007 | 12.0s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0017 | 7.6s | |
| Claude Haiku 4.5 | 100% | $0.0041 | 3.1s | |
| Llama 3.1 Nemotron 70B | 100% | $0.0015 | 16.8s | |
| Ministral 3 3B | 89% | $0.0001 | 2.2s | |
| Llama 3.1 70B | 96% | $0.0005 | 34.1s | |
| GPT-4.1 | 100% | $0.0059 | 4.6s | |
| Hermes 3 405B | 100% | $0.0013 | 25.6s | |
| Gemma 3 27B | 94% | $0.0002 | 17.7s | |
| Arcee AI: Trinity Mini | 91% | $0.0002 | 7.4s | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0074 | 2.7s | |
| Ministral 3B | 87% | $0.0001 | 2.2s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral Small Creative | 100% | $0.0002 | 3.2s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | 100% | |
| Grok 4 Fast | 100% | $0.0007 | 5.0s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0021 | 3.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0012 | 6.5s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0014 | 6.9s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0017 | 7.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 10.2s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0041 | 3.1s | 100% | |
| GPT-4.1 | 100% | $0.0059 | 4.6s | 100% | |
| Llama 3.1 Nemotron 70B | 100% | $0.0015 | 16.8s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0074 | 2.7s | 100% | |
| GPT-4o, Aug. 6th (temp=1) | 100% | $0.0074 | 3.5s | 99% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | $0.0065 | 10.3s | 100% | |
| Hermes 3 405B | 100% | $0.0013 | 25.6s | 100% | |
| GPT-4o, May 13th (temp=1) | 100% | $0.012 | 3.9s | 100% | |
| GPT-4o, May 13th (temp=0) | 100% | $0.012 | 4.6s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 99% | $0.0055 | 8.3s | 96% | |
| Claude Sonnet 4.6 | 100% | $0.012 | 5.1s | 100% | |
| Claude Sonnet 4.5 | 100% | $0.012 | 5.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 98.7% | Mara pronouns preserved (coreference test) | ||
| 99.4% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Combined: 3rd person past → 1st person present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.8s | |
| Grok 4 Fast | 98% | $0.0006 | 5.0s | |
| Gemini 2.5 Flash | 94% | $0.0015 | 2.1s | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.6s | |
| Gemma 3 12B | 94% | $0.0001 | 9.3s | |
| Mistral Small 3.2 24B | 98% | $0.0002 | 4.4s | |
| GPT-4.1 Mini | 99% | $0.0010 | 12.0s | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.5s | |
| Gemma 3 27B | 98% | $0.0002 | 13.1s | |
| Hermes 3 70B | 99% | $0.0003 | 13.7s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0014 | 6.7s | |
| Gemini 3 Flash (Preview) | 98% | $0.0018 | 3.1s | |
| Mistral Large 3 | 98% | $0.0010 | 7.1s | |
| Claude Haiku 4.5 | 99% | $0.0033 | 2.8s | |
| Mistral Medium 3.1 | 96% | $0.0012 | 5.8s | |
| DeepSeek V3.1 | 99% | $0.0007 | 36.6s | |
| Mistral Small Creative | 95% | $0.0002 | 3.4s | |
| DeepSeek V3 (2024-12-26) | 97% | $0.0007 | 14.3s | |
| Grok 4.1 Fast | 95% | $0.0008 | 11.0s | |
| Hermes 3 405B | 99% | $0.0011 | 23.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 99% | 100% | 99% | |
| Claude Opus 4.6 | 99% | 100% | 99% | |
| Claude Sonnet 4 | 99% | 100% | 99% | |
| Aion 2.0 | 99% | 99% | 99% | |
| Gemini 2.5 Pro | 99% | 99% | 99% | |
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Qwen 3.5 397B A17B | 99% | 100% | 99% | |
| Claude Sonnet 4.6 | 99% | 100% | 99% | |
| Claude Opus 4.5 | 99% | 100% | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | 100% | 99% | |
| GPT-4o, May 13th (temp=0) | 99% | 100% | 99% | |
| Claude 3.5 Sonnet | 99% | 100% | 99% | |
| Claude 3.7 Sonnet | 99% | 100% | 99% | |
| GPT-4o, Aug. 6th (temp=0) | 99% | 100% | 99% | |
| DeepSeek V3.2 | 99% | 100% | 99% | |
| Claude Haiku 4.5 | 99% | 99% | 99% | |
| GPT-5.1 | 99% | 99% | 99% | |
| Hermes 3 405B | 99% | 100% | 98% | |
| Hermes 3 70B | 99% | 100% | 98% | |
| DeepSeek V3.1 | 99% | 99% | 98% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.8s | 98% | |
| Mistral Small 3.2 24B | 98% | $0.0002 | 4.4s | 97% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0014 | 6.7s | 99% | |
| GPT-4.1 Nano | 98% | $0.0003 | 3.6s | 96% | |
| Grok 4 Fast | 98% | $0.0006 | 5.0s | 96% | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.5s | 98% | |
| Gemini 3 Flash (Preview) | 98% | $0.0018 | 3.1s | 97% | |
| Claude Haiku 4.5 | 99% | $0.0033 | 2.8s | 99% | |
| Hermes 3 70B | 99% | $0.0003 | 13.7s | 98% | |
| Mistral Large 3 | 98% | $0.0010 | 7.1s | 98% | |
| GPT-4.1 Mini | 99% | $0.0010 | 12.0s | 98% | |
| Gemma 3 27B | 98% | $0.0002 | 13.1s | 95% | |
| DeepSeek V3 (2024-12-26) | 97% | $0.0007 | 14.3s | 97% | |
| Mistral Medium 3.1 | 96% | $0.0012 | 5.8s | 95% | |
| GPT-4.1 | 99% | $0.0050 | 4.4s | 98% | |
| GPT-4o Mini (temp=0) | 95% | $0.0004 | 9.6s | 95% | |
| DeepSeek-V2 Chat | 97% | $0.0008 | 15.2s | 97% | |
| GPT-4o, Aug. 6th (temp=0) | 99% | $0.0063 | 2.9s | 99% | |
| Mistral Large | 98% | $0.0041 | 6.7s | 98% | |
| Mistral Large 2 | 98% | $0.0041 | 7.3s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 96.1% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Passive voice → active voice
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Grok 4.1 Fast | 96% | $0.0014 | 16.8s | |
| Qwen 3.5 Plus (2026-02-15) | 96% | $0.0021 | 10.9s | |
| Gemini 2.5 Flash | 95% | $0.0021 | 2.9s | |
| Gemini 3 Flash (Preview) | 94% | $0.0026 | 4.3s | |
| Gemini 2.5 Flash Lite | 90% | $0.0004 | 2.5s | |
| Grok 4 Fast | 93% | $0.0013 | 11.1s | |
| Claude Haiku 4.5 | 95% | $0.0050 | 5.3s | |
| DeepSeek V3.1 | 89% | $0.0008 | 37.0s | |
| DeepSeek-V2 Chat | 93% | $0.0011 | 22.6s | |
| DeepSeek V3 (2024-12-26) | 93% | $0.0012 | 23.8s | |
| GPT-4.1 Mini | 90% | $0.0015 | 10.8s | |
| Mistral Large 3 | 91% | $0.0015 | 10.5s | |
| DeepSeek V3.2 | 96% | $0.0008 | 52.0s | |
| GPT-4.1 | 89% | $0.0074 | 7.2s | |
| Minimax M2.5 | 94% | $0.0018 | 1.3m | |
| Claude Sonnet 4 | 97% | $0.015 | 9.5s | |
| Z.AI GLM 4.5 | 94% | $0.0055 | 44.8s | |
| Claude Sonnet 4.5 | 96% | $0.015 | 7.4s | |
| ByteDance Seed 1.6 Flash | 90% | $0.0011 | 19.6s | |
| Mistral Large 2 | 91% | $0.0061 | 10.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.5 | 97% | 99% | 97% | |
| Claude Opus 4.6 | 98% | 99% | 96% | |
| GPT-5 | 97% | 99% | 96% | |
| Claude Sonnet 4 | 97% | 99% | 96% | |
| Qwen 3.5 397B A17B | 97% | 98% | 95% | |
| Gemini 3 Pro (Preview) | 96% | 99% | 95% | |
| Gemini 3.1 Pro (Preview) | 96% | 99% | 95% | |
| Z.AI GLM 4.7 | 96% | 99% | 95% | |
| Claude Sonnet 4.6 (Reasoning) | 96% | 99% | 95% | |
| Claude Opus 4.6 (Reasoning) | 96% | 98% | 94% | |
| Qwen 3.5 Plus (2026-02-15) | 96% | 98% | 94% | |
| Grok 4.1 Fast | 96% | 98% | 94% | |
| DeepSeek V3.2 | 96% | 98% | 94% | |
| GPT-5.1 | 96% | 98% | 94% | |
| Aion 2.0 | 96% | 98% | 94% | |
| Claude Sonnet 4.5 | 96% | 98% | 93% | |
| Z.AI GLM 5 | 96% | 97% | 93% | |
| Gemini 3 Flash (Preview, Reasoning) | 95% | 98% | 93% | |
| Gemini 2.5 Flash (Reasoning) | 94% | 98% | 93% | |
| Z.AI GLM 4.5 | 94% | 98% | 93% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Qwen 3.5 Plus (2026-02-15) | 96% | $0.0021 | 10.9s | 94% | |
| Gemini 2.5 Flash | 95% | $0.0021 | 2.9s | 92% | |
| Grok 4.1 Fast | 96% | $0.0014 | 16.8s | 94% | |
| Gemini 3 Flash (Preview) | 94% | $0.0026 | 4.3s | 92% | |
| Claude Sonnet 4 | 97% | $0.015 | 9.5s | 96% | |
| Claude Haiku 4.5 | 95% | $0.0050 | 5.3s | 90% | |
| Claude Sonnet 4.5 | 96% | $0.015 | 7.4s | 93% | |
| Claude Opus 4.5 | 97% | $0.025 | 8.2s | 97% | |
| Claude Opus 4.6 | 98% | $0.025 | 8.7s | 96% | |
| Mistral Large 3 | 91% | $0.0015 | 10.5s | 90% | |
| DeepSeek-V2 Chat | 93% | $0.0011 | 22.6s | 91% | |
| Claude Sonnet 4.6 | 94% | $0.015 | 7.5s | 92% | |
| Mistral Large 2 | 91% | $0.0061 | 10.7s | 90% | |
| Mistral Large | 91% | $0.0061 | 10.5s | 90% | |
| DeepSeek V3 (2024-12-26) | 93% | $0.0012 | 23.8s | 90% | |
| GPT-4.1 Mini | 90% | $0.0015 | 10.8s | 88% | |
| Gemini 2.5 Flash (Reasoning) | 94% | $0.014 | 21.5s | 93% | |
| Claude 3.7 Sonnet | 93% | $0.015 | 8.4s | 91% | |
| Grok 4 Fast | 93% | $0.0013 | 11.1s | 84% | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0034 | 40.3s | 93% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 91.4% | Dialogue content preserved | ||
| 100.0% | No hallucinated or fabricated content | ||
| 87.5% | Non-passive narration preserved | ||
| 74.5% | Passive → active voice transformations | ||
| 100.0% | Structural similarity to original |
Avoid said/asked/replied/answered
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Mistral Small Creative | 98% | $0.0002 | 3.0s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | |
| Gemma 3 12B | 100% | $0.0001 | 8.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0006 | 4.9s | |
| Grok 4 Fast | 100% | $0.0006 | 5.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.5s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | |
| Qwen 2.5 72B | 98% | $0.0003 | 9.8s | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.7s | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.2s | |
| Mistral Large 3 | 100% | $0.0010 | 7.2s | |
| Gemma 3 27B | 100% | $0.0002 | 14.1s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.8s | |
| ByteDance Seed 1.6 Flash | 96% | $0.0006 | 10.7s | |
| Hermes 3 70B | 100% | $0.0003 | 14.3s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 7.2s | |
| Llama 3.1 70B | 100% | $0.0004 | 21.0s | |
| Z.AI GLM 4.5 | 100% | $0.0016 | 13.5s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0006 | 4.9s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Grok 4 Fast | 100% | $0.0006 | 5.3s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0012 | 4.7s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.8s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.2s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.5s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.8s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.0s | 100% | |
| Mistral Large 3 | 100% | $0.0010 | 7.2s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 7.2s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0033 | 2.7s | 100% | |
| Gemma 3 27B | 100% | $0.0002 | 14.1s | 100% | |
| Hermes 3 70B | 100% | $0.0003 | 14.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0009 | 12.9s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0007 | 15.2s | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0008 | 15.1s | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | $0.0036 | 5.6s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Forbidden words eliminated | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Structural similarity to original |
Specific Prompt
Character rename: Elena->Mirabel, Gregor->Aldric
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 2.0s | |
| Mistral NeMO | 100% | $0.0002 | 1.6s | |
| Ministral 3 3B | 100% | $0.0001 | 2.3s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Ministral 8B | 100% | $0.0001 | 3.4s | |
| Ministral 3 8B | 100% | $0.0002 | 3.2s | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | |
| Gemma 3 4B | 100% | $0.0001 | 5.2s | |
| Llama 3.1 8B | 100% | $0.0000 | 9.0s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.6s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Ministral 3 14B | 95% | $0.0002 | 4.3s | |
| Llama 3.1 70B | 100% | $0.0004 | 11.0s | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | |
| Grok 4 Fast | 100% | $0.0005 | 4.3s | |
| Rocinante 12B | 99% | $0.0004 | 7.3s | |
| Claude 3 Haiku | 90% | $0.0008 | 4.3s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.5s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.8s | |
| Grok 4.1 Fast | 100% | $0.0007 | 8.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Mistral NeMO | 100% | $0.0002 | 1.6s | 100% | |
| Ministral 3B | 100% | $0.0000 | 2.0s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 2.3s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.2s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.4s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.6s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 5.2s | 100% | |
| Grok 4 Fast | 100% | $0.0005 | 4.3s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.1s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 5.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0009 | 6.3s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 9.0s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 4.2s | 100% | |
| Grok 4.1 Fast | 100% | $0.0007 | 8.3s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | 100% | |
| Mistral Large 3 | 100% | $0.0010 | 7.9s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Location rename: market square, outer ring, bridge, northern mines
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 1.9s | |
| Ministral 3 3B | 100% | $0.0001 | 1.9s | |
| Ministral 8B | 100% | $0.0001 | 3.2s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| Ministral 3 8B | 100% | $0.0002 | 3.1s | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | |
| Gemma 3 4B | 100% | $0.0001 | 5.8s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.5s | |
| Llama 3.1 8B | 100% | $0.0001 | 9.7s | |
| Gemma 3 12B | 100% | $0.0001 | 8.5s | |
| Grok 4 Fast | 100% | $0.0006 | 3.6s | |
| Grok 4.1 Fast | 100% | $0.0005 | 5.3s | |
| Rocinante 12B | 99% | $0.0003 | 8.4s | |
| Claude 3 Haiku | 100% | $0.0009 | 4.0s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 10.1s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 9.3s | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0010 | 23.4s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 3B | 100% | $0.0000 | 1.9s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 1.9s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.2s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.1s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.0s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.5s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 5.8s | 100% | |
| Grok 4 Fast | 100% | $0.0006 | 3.6s | 100% | |
| Grok 4.1 Fast | 100% | $0.0005 | 5.3s | 100% | |
| Claude 3 Haiku | 100% | $0.0009 | 4.0s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.5s | 100% | |
| Llama 3.1 8B | 100% | $0.0001 | 9.7s | 100% | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.0s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | 100% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0005 | 9.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Expand all contractions
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Aion 2.0 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | |
| Mistral NeMO | 98% | $0.0001 | 1.8s | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | |
| Llama 3.1 8B | 99% | $0.0000 | 7.8s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | |
| Mistral Small Creative | 98% | $0.0002 | 2.4s | |
| Ministral 3 14B | 99% | $0.0002 | 3.3s | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.7s | |
| Mistral Medium 3.1 | 100% | $0.0010 | 4.3s | |
| Llama 3.1 70B | 100% | $0.0005 | 7.8s | |
| Mistral Large 3 | 100% | $0.0009 | 5.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | 99% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | 100% | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.7s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0010 | 4.3s | 100% | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | 99% | |
| Mistral Large 3 | 100% | $0.0009 | 5.9s | 100% | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | 99% | |
| Llama 3.1 70B | 100% | $0.0005 | 7.8s | 99% | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | 98% | |
| Gemma 3 27B | 100% | $0.0001 | 14.6s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0012 | 5.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.8s | 100% | |
| Llama 3.1 8B | 99% | $0.0000 | 7.8s | 99% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Possessive traps preserved |
Tense rewriting: past to present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0000 | 1.9s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | |
| Llama 3.1 8B | 100% | $0.0001 | 10.6s | |
| Ministral 3 14B | 99% | $0.0002 | 3.9s | |
| Claude 3 Haiku | 100% | $0.0009 | 4.5s | |
| Mistral NeMO | 99% | $0.0002 | 2.7s | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | |
| Ministral 3 8B | 99% | $0.0002 | 3.5s | |
| Ministral 8B | 98% | $0.0001 | 3.2s | |
| Mistral Medium 3.1 | 100% | $0.0013 | 5.3s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.5s | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.4s | |
| Grok 4 Fast | 100% | $0.0010 | 8.9s | |
| Gemma 3 27B | 100% | $0.0002 | 13.2s | |
| Gemini 3 Flash (Preview) | 99% | $0.0018 | 3.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 2.9s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.6s | 100% | |
| Claude 3 Haiku | 100% | $0.0009 | 4.5s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 8.4s | 100% | |
| Ministral 3 14B | 99% | $0.0002 | 3.9s | 99% | |
| Mistral Medium 3.1 | 100% | $0.0013 | 5.3s | 100% | |
| Llama 3.1 8B | 100% | $0.0001 | 10.6s | 100% | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.5s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.7s | 100% | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | 100% | |
| Mistral NeMO | 99% | $0.0002 | 2.7s | 99% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.4s | 100% | |
| Ministral 3B | 99% | $0.0000 | 1.9s | 97% | |
| Ministral 3 8B | 99% | $0.0002 | 3.5s | 98% | |
| Grok 4 Fast | 100% | $0.0010 | 8.9s | 99% | |
| Gemma 3 27B | 100% | $0.0002 | 13.2s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.8s | 100% | |
| DeepSeek V3 (2024-12-26) | 100% | $0.0007 | 14.1s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
POV shift: 3rd person to 1st person (Elena's perspective)
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0000 | 2.2s | |
| Mistral NeMO | 100% | $0.0002 | 2.0s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | |
| Ministral 3 3B | 100% | $0.0001 | 2.4s | |
| Ministral 8B | 100% | $0.0001 | 3.5s | |
| Mistral Small Creative | 100% | $0.0002 | 3.2s | |
| Ministral 3 8B | 100% | $0.0002 | 3.0s | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.5s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | |
| Gemma 3 4B | 100% | $0.0001 | 6.4s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | |
| Mistral Medium 3.1 | 100% | $0.0013 | 6.2s | |
| Llama 3.1 8B | 100% | $0.0000 | 8.6s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.1s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 8.0s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.1s | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.6s | |
| Grok 4 Fast | 100% | $0.0009 | 6.8s | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.6s | 100% | |
| Mistral NeMO | 100% | $0.0002 | 2.0s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.0s | 100% | |
| Mistral Small Creative | 100% | $0.0002 | 3.2s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.5s | 100% | |
| Ministral 3 14B | 100% | $0.0002 | 3.8s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 4.8s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.1s | 100% | |
| Ministral 3 3B | 100% | $0.0001 | 2.4s | 99% | |
| Gemma 3 4B | 100% | $0.0001 | 6.4s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.1s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 8.6s | 100% | |
| GPT-4.1 Nano | 100% | $0.0003 | 3.5s | 98% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 8.0s | 100% | |
| Grok 4 Fast | 100% | $0.0009 | 6.8s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 8.1s | 100% | |
| GPT-4.1 Mini | 100% | $0.0011 | 6.6s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0013 | 6.2s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 10.1s | 100% | |
| Mistral Large 3 | 100% | $0.0011 | 7.8s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Name replacement accuracy | ||
| 100.0% | No remaining old names | ||
| 100.0% | Non-name text preserved |
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Claude Opus 4.5 | 100% | |
| Aion 2.0 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Claude Sonnet 4.5 | 100% | |
| Claude Opus 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 3B | 99% | $0.0001 | 2.5s | |
| Mistral NeMO | 100% | $0.0002 | 3.6s | |
| Ministral 3 3B | 100% | $0.0001 | 2.2s | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.9s | |
| Ministral 3 8B | 100% | $0.0002 | 3.9s | |
| Mistral Small Creative | 100% | $0.0003 | 3.2s | |
| Ministral 8B | 100% | $0.0001 | 3.8s | |
| GPT-4.1 Nano | 95% | $0.0003 | 3.9s | |
| Ministral 3 14B | 100% | $0.0003 | 4.6s | |
| Mistral Small 3.2 24B | 86% | $0.0003 | 4.9s | |
| Gemma 3 4B | 100% | $0.0001 | 6.6s | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | |
| Claude 3 Haiku | 100% | $0.0011 | 5.2s | |
| Llama 3.1 8B | 100% | $0.0000 | 13.0s | |
| Grok 4 Fast | 100% | $0.0010 | 7.0s | |
| Qwen 2.5 72B | 100% | $0.0003 | 11.4s | |
| Gemini 3 Flash (Preview) | 100% | $0.0022 | 3.6s | |
| GPT-4.1 Mini | 100% | $0.0012 | 7.8s | |
| Mistral Medium 3.1 | 99% | $0.0015 | 6.1s | |
| Mistral Large 3 | 99% | $0.0013 | 8.3s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 3 3B | 100% | $0.0001 | 2.2s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.9s | 100% | |
| Mistral Small Creative | 100% | $0.0003 | 3.2s | 100% | |
| Mistral NeMO | 100% | $0.0002 | 3.6s | 100% | |
| Ministral 8B | 100% | $0.0001 | 3.8s | 100% | |
| Ministral 3 8B | 100% | $0.0002 | 3.9s | 100% | |
| Ministral 3 14B | 100% | $0.0003 | 4.6s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0017 | 2.4s | 100% | |
| Gemma 3 4B | 100% | $0.0001 | 6.6s | 100% | |
| Claude 3 Haiku | 100% | $0.0011 | 5.2s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0022 | 3.6s | 100% | |
| Grok 4 Fast | 100% | $0.0010 | 7.0s | 100% | |
| Ministral 3B | 99% | $0.0001 | 2.5s | 95% | |
| GPT-4.1 Mini | 100% | $0.0012 | 7.8s | 99% | |
| Claude Haiku 4.5 | 100% | $0.0043 | 3.4s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 11.4s | 100% | |
| Mistral Large 3 | 99% | $0.0013 | 8.3s | 99% | |
| Mistral Medium 3.1 | 99% | $0.0015 | 6.1s | 97% | |
| GPT-4.1 | 100% | $0.0062 | 4.9s | 100% | |
| Llama 3.1 8B | 100% | $0.0000 | 13.0s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 100.0% | Mara pronouns preserved (coreference test) | ||
| 100.0% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Combined: 3rd person past → 1st person present
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| Claude Opus 4.6 | 100% | |
| Claude Opus 4.5 | 100% | |
| GPT-4.1 | 100% | |
| Claude Opus 4 | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | |
| Gemini 2.5 Flash | 100% | |
| GPT-5 | 100% | |
| Aion 2.0 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4o, May 13th (temp=0) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.6s | |
| Gemini 2.5 Flash | 100% | $0.0015 | 2.0s | |
| Mistral NeMO | 98% | $0.0002 | 2.6s | |
| Ministral 8B | 97% | $0.0001 | 2.9s | |
| Mistral Small Creative | 99% | $0.0002 | 2.8s | |
| Ministral 3 8B | 99% | $0.0002 | 3.1s | |
| Ministral 3 14B | 99% | $0.0003 | 3.8s | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 4.3s | |
| Mistral Medium 3.1 | 99% | $0.0013 | 7.1s | |
| Llama 3.1 8B | 98% | $0.0001 | 9.4s | |
| GPT-4.1 Mini | 99% | $0.0011 | 6.4s | |
| Mistral Large 3 | 99% | $0.0011 | 7.8s | |
| Rocinante 12B | 96% | $0.0004 | 6.3s | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.1s | |
| Grok 4 Fast | 99% | $0.0013 | 10.2s | |
| Z.AI GLM 4.5 | 100% | $0.0030 | 25.1s | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | |
| DeepSeek-V2 Chat | 99% | $0.0008 | 14.6s | |
| Gemma 3 12B | 98% | $0.0001 | 10.0s | |
| Gemini 3 Flash (Preview) | 98% | $0.0019 | 3.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | 100% | 100% | |
| Gemini 2.5 Flash | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 99% | 99% | |
| Gemini 2.5 Pro | 100% | 99% | 99% | |
| Z.AI GLM 4.7 | 100% | 99% | 99% | |
| GPT-4o, May 13th (temp=0) | 100% | 99% | 99% | |
| GPT-5.2 | 99% | 100% | 99% | |
| Gemini 3 Pro (Preview) | 99% | 100% | 99% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash | 100% | $0.0015 | 2.0s | 100% | |
| Gemini 2.5 Flash Lite | 99% | $0.0003 | 1.6s | 99% | |
| Mistral Small Creative | 99% | $0.0002 | 2.8s | 99% | |
| Ministral 3 8B | 99% | $0.0002 | 3.1s | 99% | |
| Ministral 3 14B | 99% | $0.0003 | 3.8s | 99% | |
| Mistral Small 3.2 24B | 99% | $0.0002 | 4.3s | 99% | |
| Mistral Large 3 | 99% | $0.0011 | 7.8s | 99% | |
| Mistral Medium 3.1 | 99% | $0.0013 | 7.1s | 99% | |
| Qwen 2.5 72B | 99% | $0.0003 | 10.1s | 99% | |
| GPT-4.1 | 100% | $0.0055 | 3.8s | 100% | |
| DeepSeek-V2 Chat | 99% | $0.0008 | 14.6s | 99% | |
| Claude Haiku 4.5 | 99% | $0.0036 | 3.1s | 98% | |
| Gemini 3 Flash (Preview) | 98% | $0.0019 | 3.1s | 98% | |
| Qwen 3.5 Plus (2026-02-15) | 99% | $0.0015 | 6.6s | 98% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0069 | 2.5s | 100% | |
| Gemma 3 12B | 98% | $0.0001 | 10.0s | 98% | |
| Mistral Large 2 | 99% | $0.0046 | 7.1s | 99% | |
| Mistral Large | 99% | $0.0046 | 7.6s | 99% | |
| Grok 4 Fast | 99% | $0.0013 | 10.2s | 97% | |
| Grok 4.1 Fast | 100% | $0.0015 | 16.8s | 99% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Dialogue content preserved | ||
| 98.2% | Name replacement accuracy | ||
| 100.0% | Non-name text preserved |
Passive voice → active voice
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 99% | $0.0027 | 4.4s | |
| Grok 4 Fast | 98% | $0.0015 | 12.2s | |
| Gemini 2.5 Flash | 97% | $0.0021 | 3.1s | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0022 | 9.5s | |
| Grok 4.1 Fast | 98% | $0.0020 | 28.2s | |
| Mistral Large 3 | 96% | $0.0016 | 10.4s | |
| Claude Haiku 4.5 | 96% | $0.0052 | 6.2s | |
| Gemini 2.5 Flash Lite (Reasoning) | 95% | $0.0031 | 25.5s | |
| Claude Sonnet 4.5 | 99% | $0.016 | 7.0s | |
| DeepSeek V3.2 | 99% | $0.0006 | 54.4s | |
| Gemini 2.5 Flash Lite | 94% | $0.0004 | 2.4s | |
| Gemma 3 12B | 94% | $0.0001 | 12.1s | |
| DeepSeek V3 (2024-12-26) | 95% | $0.0012 | 22.5s | |
| Mistral Medium 3.1 | 94% | $0.0019 | 6.3s | |
| Hermes 3 405B | 94% | $0.0018 | 27.1s | |
| DeepSeek V3.1 | 90% | $0.0009 | 33.9s | |
| Claude Sonnet 4.6 | 98% | $0.016 | 7.4s | |
| Mistral Large 2 | 96% | $0.0064 | 10.4s | |
| Mistral Large | 95% | $0.0064 | 10.4s | |
| Writer: Palmyra X5 | 95% | $0.0052 | 9.7s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 99% | 100% | 99% | |
| Claude Opus 4.6 | 98% | 100% | 98% | |
| Grok 4 | 99% | 99% | 98% | |
| Claude Sonnet 4.6 | 98% | 99% | 98% | |
| Gemini 2.5 Pro | 99% | 98% | 97% | |
| Claude Sonnet 4.5 | 99% | 98% | 97% | |
| DeepSeek V3.2 | 99% | 98% | 97% | |
| Gemini 3 Flash (Preview) | 99% | 99% | 97% | |
| Claude Opus 4.6 (Reasoning) | 99% | 98% | 97% | |
| Claude Sonnet 4.6 (Reasoning) | 99% | 98% | 97% | |
| Z.AI GLM 5 | 98% | 98% | 97% | |
| ByteDance Seed 1.6 | 97% | 100% | 97% | |
| GPT-5 | 98% | 97% | 97% | |
| Grok 4 Fast | 98% | 98% | 97% | |
| Grok 4.1 Fast | 98% | 98% | 97% | |
| o4 Mini | 98% | 98% | 96% | |
| GPT-5.1 | 98% | 98% | 96% | |
| Claude Opus 4.5 | 97% | 99% | 96% | |
| Claude Sonnet 4 | 98% | 98% | 96% | |
| Gemini 3 Flash (Preview, Reasoning) | 97% | 99% | 95% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 3 Flash (Preview) | 99% | $0.0027 | 4.4s | 97% | |
| Grok 4 Fast | 98% | $0.0015 | 12.2s | 97% | |
| Gemini 2.5 Flash | 97% | $0.0021 | 3.1s | 95% | |
| Qwen 3.5 Plus (2026-02-15) | 97% | $0.0022 | 9.5s | 95% | |
| Grok 4.1 Fast | 98% | $0.0020 | 28.2s | 97% | |
| Claude Sonnet 4.5 | 99% | $0.016 | 7.0s | 97% | |
| Mistral Large 3 | 96% | $0.0016 | 10.4s | 95% | |
| Claude Sonnet 4.6 | 98% | $0.016 | 7.4s | 98% | |
| DeepSeek V3.2 | 99% | $0.0006 | 54.4s | 97% | |
| Mistral Large 2 | 96% | $0.0064 | 10.4s | 95% | |
| Claude Sonnet 4 | 98% | $0.016 | 9.2s | 96% | |
| Mistral Large | 95% | $0.0064 | 10.4s | 94% | |
| Claude Haiku 4.5 | 96% | $0.0052 | 6.2s | 92% | |
| Gemini 2.5 Flash Lite | 94% | $0.0004 | 2.4s | 92% | |
| Mistral Medium 3.1 | 94% | $0.0019 | 6.3s | 93% | |
| Claude Opus 4.6 | 98% | $0.026 | 8.7s | 98% | |
| DeepSeek V3 (2024-12-26) | 95% | $0.0012 | 22.5s | 92% | |
| Gemma 3 12B | 94% | $0.0001 | 12.1s | 92% | |
| Writer: Palmyra X5 | 95% | $0.0052 | 9.7s | 93% | |
| Claude Opus 4.5 | 97% | $0.026 | 8.0s | 96% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 95.7% | Dialogue content preserved | ||
| 100.0% | No hallucinated or fabricated content | ||
| 94.6% | Non-passive narration preserved | ||
| 86.5% | Passive → active voice transformations | ||
| 100.0% | Structural similarity to original |
Avoid said/asked/replied/answered
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5 Mini | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Grok 4.1 Fast | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 6.2s | |
| Gemma 3 4B | 98% | $0.0001 | 5.5s | |
| Gemma 3 12B | 98% | $0.0001 | 7.4s | |
| Grok 4 Fast | 100% | $0.0008 | 6.0s | |
| Claude 3 Haiku | 85% | $0.0008 | 4.3s | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.7s | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | |
| Mistral Medium 3.1 | 100% | $0.0012 | 5.1s | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.3s | |
| Mistral Large 3 | 100% | $0.0011 | 7.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.9s | |
| Gemma 3 27B | 100% | $0.0002 | 17.0s | |
| Hermes 3 70B | 91% | $0.0010 | 1.3m | |
| ByteDance Seed 1.6 Flash | 96% | $0.0007 | 12.6s | |
| DeepSeek-V2 Chat | 100% | $0.0008 | 16.8s | |
| Ministral 8B | 94% | $0.0001 | 2.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5 Mini | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Grok 4.1 Fast | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0003 | 1.5s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0014 | 2.0s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 6.2s | 100% | |
| Grok 4 Fast | 100% | $0.0008 | 6.0s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0018 | 3.2s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0012 | 5.1s | 100% | |
| GPT-4.1 Mini | 100% | $0.0010 | 6.3s | 100% | |
| Mistral Large 3 | 100% | $0.0011 | 7.2s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0004 | 9.5s | 100% | |
| GPT-4o Mini (temp=0) | 100% | $0.0004 | 9.7s | 100% | |
| Qwen 2.5 72B | 100% | $0.0003 | 10.4s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0015 | 6.9s | 100% | |
| Claude Haiku 4.5 | 100% | $0.0034 | 2.7s | 100% | |
| DeepSeek-V2 Chat | 100% | $0.0008 | 16.8s | 100% | |
| GPT-4.1 | 100% | $0.0052 | 4.0s | 100% | |
| Mistral Large 2 | 100% | $0.0042 | 7.2s | 100% | |
| Mistral Large | 100% | $0.0042 | 7.3s | 100% | |
| Grok 4.1 Fast | 100% | $0.0013 | 16.7s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 100% | $0.0018 | 16.3s | 100% | |
| GPT-4o, Aug. 6th (temp=0) | 100% | $0.0065 | 2.4s | 100% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Forbidden words eliminated | ||
| 100.0% | Non-name text preserved | ||
| 100.0% | Structural similarity to original |