Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Expand all contractions
Text Editing
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| Claude Sonnet 4.6 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| ByteDance Seed 1.6 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| o4 Mini High | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Aion 2.0 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| GPT-4.1 | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | |
| Mistral NeMO | 98% | $0.0001 | 1.8s | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | |
| Llama 3.1 8B | 99% | $0.0000 | 7.8s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | |
| Mistral Small Creative | 98% | $0.0002 | 2.4s | |
| Ministral 3 14B | 99% | $0.0002 | 3.3s | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.7s | |
| Mistral Medium 3.1 | 100% | $0.0010 | 4.3s | |
| Llama 3.1 70B | 100% | $0.0005 | 7.8s | |
| Mistral Large 3 | 100% | $0.0009 | 5.9s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4.6 (Reasoning) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| Claude Sonnet 4.6 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| ByteDance Seed 1.6 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| o4 Mini High | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Aion 2.0 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| GPT-4.1 | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Ministral 8B | 100% | $0.0001 | 2.4s | 100% | |
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Ministral 3 8B | 100% | $0.0001 | 3.0s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.5s | 100% | |
| Gemma 3 12B | 100% | $0.0001 | 5.9s | 99% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 7.1s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.6s | 100% | |
| Qwen 2.5 72B | 100% | $0.0002 | 8.5s | 100% | |
| GPT-4.1 Mini | 100% | $0.0008 | 4.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.7s | 100% | |
| Mistral Medium 3.1 | 100% | $0.0010 | 4.3s | 100% | |
| Claude 3 Haiku | 100% | $0.0007 | 3.7s | 99% | |
| Mistral Large 3 | 100% | $0.0009 | 5.9s | 100% | |
| Arcee AI: Trinity Mini | 99% | $0.0002 | 4.6s | 99% | |
| Llama 3.1 70B | 100% | $0.0005 | 7.8s | 99% | |
| GPT-4.1 Nano | 99% | $0.0002 | 3.1s | 98% | |
| Gemma 3 27B | 100% | $0.0001 | 14.6s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0012 | 5.3s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.8s | 100% | |
| Llama 3.1 8B | 99% | $0.0000 | 7.8s | 99% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Contraction expansion accuracy | ||
| 100.0% | Non-contraction text preserved | ||
| 100.0% | Possessive traps preserved |