Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Expand all contractions
Text Editing
Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | ||
|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | |
| GPT-5.1 | 100% | |
| Claude Opus 4.6 | 100% | |
| GPT-5 | 100% | |
| Qwen 3.5 397B A17B | 100% | |
| Z.AI GLM 5 | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | |
| GPT-5.2 | 100% | |
| Claude Opus 4.5 | 100% | |
| Z.AI GLM 4.6 | 100% | |
| Gemini 3 Pro (Preview) | 100% | |
| Claude Sonnet 4 | 100% | |
| Z.AI GLM 4.7 | 100% | |
| Gemini 2.5 Pro | 100% | |
| Grok 4 | 100% | |
| Claude Sonnet 4.5 | 100% | |
| Claude Opus 4 | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | |
Price-Performance Score Distribution (Top 20)
Click a model name to view its detail page.
| Score | Cost | Time | ||
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | |
| Ministral 8B | 95% | $0.0001 | 2.4s | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | |
| Llama 3.1 8B | 98% | $0.0000 | 6.1s | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | |
| Qwen 2.5 72B | 100% | $0.0002 | 7.7s | |
| Mistral Small Creative | 97% | $0.0002 | 2.3s | |
| Mistral Large 3 | 100% | $0.0008 | 5.8s | |
| Arcee AI: Trinity Mini | 97% | $0.0003 | 12.7s | |
| GPT-4.1 Nano | 97% | $0.0002 | 2.8s | |
| Gemini 2.5 Flash Lite (Reasoning) | 99% | $0.0008 | 4.6s | |
| Ministral 3 8B | 96% | $0.0001 | 3.2s | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | |
| GPT-4.1 Mini | 99% | $0.0008 | 5.1s | |
Most Stable Models (Top 20)
Ranked by stability (median × consistency). Click a model name to view its detail page.
| Score | Consistency | Stability | ||
|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 100% | 100% | 100% | |
| Gemini 3.1 Pro (Preview) | 100% | 100% | 100% | |
| GPT-5.1 | 100% | 100% | 100% | |
| Claude Opus 4.6 | 100% | 100% | 100% | |
| GPT-5 | 100% | 100% | 100% | |
| Qwen 3.5 397B A17B | 100% | 100% | 100% | |
| Z.AI GLM 5 | 100% | 100% | 100% | |
| MoonshotAI: Kimi K2.5 | 100% | 100% | 100% | |
| Gemini 3 Flash (Preview, Reasoning) | 100% | 100% | 100% | |
| GPT-5.2 | 100% | 100% | 100% | |
| Claude Opus 4.5 | 100% | 100% | 100% | |
| Z.AI GLM 4.6 | 100% | 100% | 100% | |
| Gemini 3 Pro (Preview) | 100% | 100% | 100% | |
| Claude Sonnet 4 | 100% | 100% | 100% | |
| Z.AI GLM 4.7 | 100% | 100% | 100% | |
| Gemini 2.5 Pro | 100% | 100% | 100% | |
| Grok 4 | 100% | 100% | 100% | |
| Claude Sonnet 4.5 | 100% | 100% | 100% | |
| Claude Opus 4 | 100% | 100% | 100% | |
| Gemini 2.5 Flash (Reasoning) | 100% | 100% | 100% | |
Top Overall Models (Top 20)
Ranked by composite score (performance, cost, speed & stability). Click a model name to view its detail page.
| Score | Cost | Speed | Stability | ||
|---|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 100% | $0.0002 | 1.3s | 100% | |
| Mistral Small 3.2 24B | 100% | $0.0002 | 3.7s | 100% | |
| Gemini 2.5 Flash | 100% | $0.0012 | 1.8s | 100% | |
| Gemini 3 Flash (Preview) | 100% | $0.0015 | 2.7s | 100% | |
| GPT-4o Mini (temp=1) | 100% | $0.0003 | 7.0s | 100% | |
| Grok 4 Fast | 100% | $0.0007 | 4.9s | 99% | |
| Mistral Large 3 | 100% | $0.0008 | 5.8s | 100% | |
| Qwen 3.5 Plus (2026-02-15) | 100% | $0.0012 | 5.5s | 100% | |
| Mistral NeMO | 98% | $0.0001 | 2.6s | 97% | |
| GPT-4o Mini (temp=0) | 100% | $0.0003 | 8.4s | 100% | |
| Gemini 2.5 Flash Lite (Reasoning) | 99% | $0.0008 | 4.6s | 99% | |
| ByteDance Seed 1.6 Flash | 100% | $0.0004 | 8.4s | 99% | |
| Claude Haiku 4.5 | 100% | $0.0027 | 2.2s | 100% | |
| Qwen 2.5 72B | 100% | $0.0002 | 7.7s | 98% | |
| GPT-4.1 Mini | 99% | $0.0008 | 5.1s | 99% | |
| Ministral 3 14B | 98% | $0.0002 | 3.2s | 97% | |
| Mistral Small Creative | 97% | $0.0002 | 2.3s | 97% | |
| Hermes 3 70B | 100% | $0.0003 | 11.8s | 100% | |
| GPT-4.1 Nano | 97% | $0.0002 | 2.8s | 96% | |
| Mistral Medium 3.1 | 98% | $0.0010 | 5.4s | 98% | |
| Median | Evaluator | Top 3 | Flop 3 |
|---|---|---|---|
| 100.0% | Contraction expansion accuracy | ||
| 100.0% | Non-contraction text preserved | ||
| 100.0% | Possessive traps preserved |