Text Editing

18 scenarios across 3 subcategories. 89 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Transformation 81.25% Gemini 3 Pro (Preview) 98.12%
Preservation 92.40% Claude Opus 4.6 100.00%
Structural Integrity 98.02% Claude Opus 4.6 (Reasoning) 100.00%

Model Leaderboard

All models ranked by their Text Editing category score.

# Model Text Editing Transformation Preservation Structural Integrity Overall
1 Claude Sonnet 4 99.13% 97.70% 99.69% 100.00% 88.72%
2 Claude Sonnet 4.5 99.02% 97.74% 99.31% 100.00% 88.03%
3 GPT-5 98.90% 98.11% 98.57% 100.00% 91.93%
4 Gemini 3 Pro (Preview) 98.86% 98.12% 98.47% 100.00% 88.79%
5 Claude Opus 4.6 (Reasoning) 98.86% 97.85% 98.72% 100.00% 95.02%
6 Grok 4 98.76% 97.06% 99.23% 100.00% 88.12%
7 Qwen 3.5 27B 98.69% 97.53% 98.55% 100.00% 90.85%
8 Z.AI GLM 5 98.59% 97.15% 98.62% 100.00% 91.23%
9 Gemini 2.5 Pro 98.58% 96.93% 98.80% 100.00% 88.53%
10 GPT-5.1 98.54% 97.43% 98.18% 100.00% 92.54%
11 Gemini 3.1 Pro (Preview) 98.51% 96.51% 99.03% 100.00% 94.37%
12 ByteDance Seed 1.6 98.40% 97.62% 97.58% 100.00% 90.70%
13 Claude Opus 4.6 98.35% 95.06% 100.00% 100.00% 92.35%
14 Claude Sonnet 4.6 (Reasoning) 98.30% 95.84% 99.06% 100.00% 93.66%
15 Z.AI GLM 4.7 98.22% 95.98% 98.67% 100.00% 88.69%
16 Gemini 3 Flash (Preview, Reasoning) 98.12% 96.17% 98.19% 100.00% 90.50%
17 Gemini 2.5 Flash (Reasoning) 98.12% 96.95% 97.39% 100.00% 86.51%
18 Qwen 3.5 Plus (2026-02-15) 98.10% 94.91% 99.38% 100.00% 85.96%
19 Qwen 3.5 397B A17B 98.05% 97.78% 96.38% 100.00% 91.73%
20 Grok 4.1 Fast 97.87% 96.32% 97.29% 100.00% 89.55%
21 Gemini 2.5 Flash 97.83% 94.82% 98.67% 100.00% 80.60%
22 MoonshotAI: Kimi K2.5 97.79% 96.37% 98.19% 98.81% 91.04%
23 Z.AI GLM 4.6 97.78% 95.02% 98.31% 100.00% 89.11%
24 Claude Opus 4.5 97.69% 93.08% 100.00% 100.00% 89.69%
25 GPT-5.2 97.54% 96.25% 96.38% 100.00% 90.26%
26 Gemini 3 Flash (Preview) 97.54% 93.70% 98.90% 100.00% 85.35%
27 Grok 4 Fast 97.26% 93.19% 98.60% 100.00% 86.15%
28 Claude Opus 4 97.25% 93.42% 98.34% 100.00% 87.69%
29 GPT-5 Mini 97.13% 93.87% 97.51% 100.00% 92.62%
30 Claude 3.7 Sonnet 97.12% 93.64% 97.73% 100.00% 83.39%
31 Claude Haiku 4.5 96.81% 90.77% 99.64% 100.00% 85.14%
32 Claude 3.5 Sonnet 96.57% 92.76% 96.96% 100.00% 84.24%
33 Claude Sonnet 4.6 96.37% 89.29% 99.82% 100.00% 91.15%
34 Qwen 3.5 122B 96.31% 91.84% 98.28% 98.81% 91.53%
35 Minimax M2.5 96.02% 90.05% 98.01% 100.00% 88.71%
36 DeepSeek V3.2 95.78% 95.62% 91.71% 100.00% 82.25%
37 GPT-4.1 Mini 95.62% 89.26% 97.60% 100.00% 83.20%
38 GPT-4o, May 13th (temp=0) 95.35% 87.89% 98.16% 100.00% 85.36%
39 Aion 2.0 95.34% 93.58% 97.21% 95.24% 89.21%
40 Z.AI GLM 4.5 95.32% 95.45% 91.69% 98.81% 86.27%
41 Mistral Large 95.14% 90.77% 94.67% 100.00% 80.15%
42 Qwen 3.5 35B 94.95% 91.82% 94.21% 98.81% 88.00%
43 Gemini 2.5 Flash Lite (Reasoning) 94.54% 94.57% 89.04% 100.00% 85.75%
44 GPT-4.1 94.40% 86.53% 96.66% 100.00% 88.68%
45 o4 Mini High 94.36% 87.18% 97.09% 98.81% 90.29%
46 Mistral Large 2 94.16% 90.87% 91.61% 100.00% 82.41%
47 Mistral Large 3 94.09% 90.68% 91.61% 100.00% 85.43%
48 GPT-4o, Aug. 6th (temp=0) 93.77% 88.11% 96.79% 96.43% 82.45%
49 Mistral Medium 3.1 93.77% 84.76% 96.56% 100.00% 77.83%
50 DeepSeek V3 (2024-12-26) 93.58% 89.33% 92.60% 98.81% 83.68%
51 Qwen 3.5 Flash 92.80% 88.93% 93.04% 96.43% 86.38%
52 GPT-4o, May 13th (temp=1) 92.41% 79.68% 97.54% 100.00% 83.80%
53 Gemini 2.5 Flash Lite 92.13% 84.21% 92.17% 100.00% 81.08%
54 Llama 3.1 70B 92.10% 84.01% 94.67% 97.62% 78.40%
55 ByteDance Seed 1.6 Flash 91.64% 81.97% 95.32% 97.62% 73.27%
56 Writer: Palmyra X5 91.20% 79.31% 94.28% 100.00% 79.57%
57 DeepSeek-V2 Chat 90.90% 83.56% 91.53% 97.62% 84.83%
58 o4 Mini 90.61% 77.98% 96.24% 97.62% 88.35%
59 Mistral Small Creative 90.31% 73.96% 96.98% 100.00% 73.27%
60 DeepSeek V3 (2025-03-24) 89.57% 79.33% 90.56% 98.81% 81.99%
61 Mistral Small 3.2 24B 89.48% 76.07% 92.38% 100.00% 78.60%
62 Qwen 2.5 72B 89.18% 71.74% 95.78% 100.00% 75.46%
63 Hermes 3 405B 89.14% 69.94% 97.48% 100.00% 82.86%
64 WizardLM 2 8x22b 88.13% 81.62% 89.67% 93.12% 71.07%
65 DeepSeek V3.1 87.27% 81.88% 84.70% 95.24% 82.39%
66 Llama 3.1 Nemotron 70B 87.26% 73.30% 93.45% 95.04% 74.70%
67 GPT-4o, Aug. 6th (temp=1) 86.72% 72.28% 95.01% 92.86% 82.62%
68 Gemma 3 27B 86.63% 64.41% 95.49% 100.00% 77.85%
69 Arcee AI: Trinity Large (Preview) 86.62% 76.70% 83.17% 100.00% 73.33%
70 Ministral 3 14B 86.20% 69.48% 89.12% 100.00% 72.54%
71 Z.AI GLM 4.7 Flash 85.82% 70.63% 90.40% 96.43% 84.82%
72 GPT-4o Mini (temp=1) 85.78% 70.40% 86.94% 100.00% 79.08%
73 Gemma 3 12B 85.18% 67.87% 87.66% 100.00% 78.41%
74 GPT-4o Mini (temp=0) 84.62% 67.00% 86.87% 100.00% 78.29%
75 GPT-5 Nano 82.74% 61.06% 91.91% 95.24% 82.60%
76 Ministral 3 8B 78.52% 52.05% 83.53% 100.00% 71.76%
77 Gemma 3 4B 78.38% 40.80% 94.35% 100.00% 68.57%
78 Ministral 8B 77.52% 48.88% 83.69% 100.00% 64.87%
79 GPT-4.1 Nano 76.06% 50.14% 78.02% 100.00% 71.94%
80 Llama 3.1 8B 75.45% 51.53% 80.76% 94.05% 63.37%
81 Arcee AI: Trinity Mini 73.88% 50.86% 85.07% 85.71% 70.90%
82 Mistral NeMO 73.69% 45.51% 82.35% 93.21% 65.04%
83 LFM2 24B 71.56% 66.33% 57.43% 90.94% 58.77%
84 Ministral 3B 70.91% 40.57% 74.55% 97.62% 61.29%
85 Ministral 3 3B 69.80% 39.84% 73.12% 96.43% 67.22%
86 Cohere Command R+ (Aug. 2024) 68.40% 39.15% 70.67% 95.39% 69.03%
87 Claude 3 Haiku 64.36% 48.58% 63.45% 81.03% 71.19%
88 Hermes 3 70B 63.34% 44.57% 66.87% 78.57% 72.57%
89 Rocinante 12B 56.31% 32.19% 63.96% 72.78% 54.55%