Text Editing

18 scenarios across 3 subcategories. 116 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Transformation 81.33% Gemini 3 Pro (Preview) 98.12%
Preservation 92.50% Claude Opus 4.6 100.00%
Structural Integrity 98.00% Claude Opus 4.6 (Reasoning) 100.00%

Model Leaderboard

All models ranked by their Text Editing category score.

# Model Text Editing Transformation Preservation Structural Integrity Overall
1 Claude Sonnet 4 99.13% 97.70% 99.69% 100.00% 88.72%
2 Claude Sonnet 4.5 99.02% 97.74% 99.31% 100.00% 88.03%
3 GPT-5 98.90% 98.11% 98.57% 100.00% 91.93%
4 Gemini 3 Pro (Preview) 98.86% 98.12% 98.47% 100.00% 88.79%
5 Claude Opus 4.6 (Reasoning) 98.86% 97.85% 98.72% 100.00% 95.02%
6 Grok 4 98.76% 97.06% 99.23% 100.00% 88.12%
7 Qwen 3.5 27B 98.69% 97.53% 98.55% 100.00% 90.85%
8 Grok 4.20 (Beta, Reasoning) 98.69% 97.27% 98.80% 100.00% 91.49%
9 Z.AI GLM 5 98.59% 97.15% 98.62% 100.00% 91.23%
10 Gemini 2.5 Pro 98.58% 96.93% 98.80% 100.00% 88.53%
11 GPT-5.1 98.54% 97.43% 98.18% 100.00% 92.54%
12 Gemini 3.1 Pro (Preview) 98.51% 96.51% 99.03% 100.00% 94.37%
13 GPT-5.4 (Reasoning) 98.42% 97.84% 97.42% 100.00% 93.24%
14 ByteDance Seed 1.6 98.40% 97.62% 97.58% 100.00% 90.70%
15 Claude Opus 4.6 98.35% 95.06% 100.00% 100.00% 92.35%
16 Claude Sonnet 4.6 (Reasoning) 98.30% 95.84% 99.06% 100.00% 93.66%
17 Z.AI GLM 4.7 98.22% 95.98% 98.67% 100.00% 88.69%
18 Z.AI GLM 5 Turbo 98.17% 96.68% 97.83% 100.00% 94.27%
19 Gemini 3 Flash (Preview, Reasoning) 98.12% 96.17% 98.19% 100.00% 90.50%
20 Gemini 2.5 Flash (Reasoning) 98.12% 96.95% 97.39% 100.00% 86.51%
21 Qwen 3.5 Plus (2026-02-15) 98.10% 94.91% 99.38% 100.00% 85.96%
22 Qwen 3.5 397B A17B 98.05% 97.78% 96.38% 100.00% 91.73%
23 GPT-5.4 (Reasoning, Low) 98.01% 96.83% 97.21% 100.00% 91.41%
24 Grok 4.1 Fast 97.87% 96.32% 97.29% 100.00% 89.55%
25 Gemini 2.5 Flash 97.83% 94.82% 98.67% 100.00% 80.60%
26 MoonshotAI: Kimi K2.5 97.79% 96.37% 98.19% 98.81% 91.04%
27 Z.AI GLM 4.6 97.78% 95.02% 98.31% 100.00% 89.11%
28 Claude Opus 4.5 97.69% 93.08% 100.00% 100.00% 89.69%
29 GPT-5.2 97.54% 96.25% 96.38% 100.00% 90.26%
30 Gemini 3 Flash (Preview) 97.54% 93.70% 98.90% 100.00% 85.35%
31 Grok 4 Fast 97.26% 93.19% 98.60% 100.00% 86.15%
32 Claude Opus 4 97.25% 93.42% 98.34% 100.00% 87.69%
33 GPT-5 Mini 97.13% 93.87% 97.51% 100.00% 92.62%
34 Claude 3.7 Sonnet 97.12% 93.64% 97.73% 100.00% 83.39%
35 Claude Haiku 4.5 96.81% 90.77% 99.64% 100.00% 85.14%
36 GPT-5.4 96.73% 94.09% 96.10% 100.00% 84.32%
37 Claude 3.5 Sonnet 96.57% 92.76% 96.96% 100.00% 84.24%
38 Gemini 3.1 Flash Lite (Preview) 96.46% 91.27% 98.12% 100.00% 85.87%
39 Claude Sonnet 4.6 96.37% 89.29% 99.82% 100.00% 91.15%
40 Qwen 3.5 122B 96.31% 91.84% 98.28% 98.81% 91.53%
41 Stealth: Healer Alpha 96.04% 90.26% 97.86% 100.00% 85.93%
42 MiniMax M2.5 96.02% 90.05% 98.01% 100.00% 88.71%
43 GPT-5.4 Mini (Reasoning) 95.78% 93.03% 95.50% 98.81% 90.65%
44 DeepSeek V3.2 95.78% 95.62% 91.71% 100.00% 82.25%
45 GPT-4.1 Mini 95.62% 89.26% 97.60% 100.00% 83.20%
46 Stealth: Hunter Alpha 95.53% 91.32% 96.45% 98.81% 87.34%
47 Grok 4.20 (Beta) 95.49% 92.06% 94.41% 100.00% 83.85%
48 GPT-4o, May 13th (temp=0) 95.35% 87.89% 98.16% 100.00% 85.36%
49 Aion 2.0 95.34% 93.58% 97.21% 95.24% 89.21%
50 Z.AI GLM 4.5 95.32% 95.45% 91.69% 98.81% 86.27%
51 Mistral Large 95.14% 90.77% 94.67% 100.00% 80.15%
52 ByteDance Seed 2.0 Lite 95.03% 88.71% 97.58% 98.81% 84.80%
53 Qwen 3.5 35B 94.95% 91.82% 94.21% 98.81% 88.00%
54 Gemini 2.5 Flash Lite (Reasoning) 94.54% 94.57% 89.04% 100.00% 85.75%
55 GPT-4.1 94.40% 86.53% 96.66% 100.00% 88.68%
56 o4 Mini High 94.36% 87.18% 97.09% 98.81% 90.29%
57 Mistral Large 2 94.16% 90.87% 91.61% 100.00% 82.41%
58 Mistral Large 3 94.09% 90.68% 91.61% 100.00% 85.43%
59 GPT-4o, Aug. 6th (temp=0) 93.77% 88.11% 96.79% 96.43% 82.45%
60 Mistral Medium 3.1 93.77% 84.76% 96.56% 100.00% 77.83%
61 DeepSeek V3 (2024-12-26) 93.58% 89.33% 92.60% 98.81% 83.68%
62 Qwen 3.5 Flash 92.80% 88.93% 93.04% 96.43% 86.38%
63 GPT-5.4 Mini (Reasoning, Low) 92.63% 85.94% 91.94% 100.00% 85.75%
64 GPT-4o, May 13th (temp=1) 92.41% 79.68% 97.54% 100.00% 83.80%
65 MiniMax M2.7 92.14% 82.95% 95.84% 97.62% 89.10%
66 Gemini 2.5 Flash Lite 92.13% 84.21% 92.17% 100.00% 81.08%
67 Llama 3.1 70B 92.10% 84.01% 94.67% 97.62% 78.40%
68 Qwen3 235B A22B Instruct 2507 91.75% 79.66% 95.60% 100.00% 80.10%
69 ByteDance Seed 1.6 Flash 91.64% 81.97% 95.32% 97.62% 73.27%
70 Writer: Palmyra X5 91.20% 79.31% 94.28% 100.00% 79.57%
71 ByteDance Seed 2.0 Mini 91.08% 86.72% 90.10% 96.43% 86.91%
72 Mistral Small 4 91.00% 81.54% 91.45% 100.00% 76.46%
73 DeepSeek-V2 Chat 90.90% 83.56% 91.53% 97.62% 84.83%
74 o4 Mini 90.61% 77.98% 96.24% 97.62% 88.35%
75 GPT-5.4 Mini 90.60% 83.34% 88.47% 100.00% 82.43%
76 Mistral Small 4 (Reasoning) 90.58% 79.45% 93.47% 98.81% 82.39%
77 Mistral Small Creative 90.31% 73.96% 96.98% 100.00% 73.27%
78 Qwen 3 32B 89.95% 77.20% 93.83% 98.81% 82.21%
79 DeepSeek V3 (2025-03-24) 89.57% 79.33% 90.56% 98.81% 81.99%
80 Mistral Small 3.2 24B 89.48% 76.07% 92.38% 100.00% 78.60%
81 Qwen 2.5 72B 89.18% 71.74% 95.78% 100.00% 75.46%
82 Hermes 3 405B 89.14% 69.94% 97.48% 100.00% 82.86%
83 WizardLM 2 8x22b 88.13% 81.62% 89.67% 93.12% 71.07%
84 DeepSeek V3.1 87.27% 81.88% 84.70% 95.24% 82.39%
85 Llama 3.1 Nemotron 70B 87.26% 73.30% 93.45% 95.04% 74.70%
86 GPT-4o, Aug. 6th (temp=1) 86.72% 72.28% 95.01% 92.86% 82.62%
87 Gemma 3 27B 86.63% 64.41% 95.49% 100.00% 77.85%
88 Arcee AI: Trinity Large (Preview) 86.62% 76.70% 83.17% 100.00% 73.33%
89 Nemotron 3 Super 86.34% 79.06% 89.49% 90.48% 84.56%
90 Ministral 3 14B 86.20% 69.48% 89.12% 100.00% 72.54%
91 Z.AI GLM 4.7 Flash 85.82% 70.63% 90.40% 96.43% 84.82%
92 GPT-4o Mini (temp=1) 85.78% 70.40% 86.94% 100.00% 79.08%
93 Qwen 3.5 9B 85.35% 71.91% 92.48% 91.67% 86.05%
94 Inception Mercury 2 85.26% 70.42% 91.31% 94.05% 83.85%
95 Gemma 3 12B 85.18% 67.87% 87.66% 100.00% 78.41%
96 GPT-4o Mini (temp=0) 84.62% 67.00% 86.87% 100.00% 78.29%
97 GPT-5.4 Nano (Reasoning) 83.32% 59.85% 90.10% 100.00% 81.36%
98 GPT-5 Nano 82.74% 61.06% 91.91% 95.24% 82.60%
99 GPT-5.4 Nano (Reasoning, Low) 82.23% 61.68% 85.00% 100.00% 79.48%
100 Inception Mercury 79.53% 58.12% 87.15% 93.32% 79.50%
101 GPT-5.4 Nano 79.22% 53.44% 84.23% 100.00% 74.40%
102 Ministral 3 8B 78.52% 52.05% 83.53% 100.00% 71.76%
103 Gemma 3 4B 78.38% 40.80% 94.35% 100.00% 68.57%
104 Ministral 8B 77.52% 48.88% 83.69% 100.00% 64.87%
105 GPT-4.1 Nano 76.06% 50.14% 78.02% 100.00% 71.94%
106 Nemotron 3 Nano 75.81% 62.02% 78.40% 87.00% 77.73%
107 Llama 3.1 8B 75.45% 51.53% 80.76% 94.05% 63.37%
108 Arcee AI: Trinity Mini 73.88% 50.86% 85.07% 85.71% 70.90%
109 Mistral NeMO 73.69% 45.51% 82.35% 93.21% 65.04%
110 LFM2 24B 71.56% 66.33% 57.43% 90.94% 58.77%
111 Ministral 3B 70.91% 40.57% 74.55% 97.62% 61.29%
112 Ministral 3 3B 69.80% 39.84% 73.12% 96.43% 67.22%
113 Cohere Command R+ (Aug. 2024) 68.40% 39.15% 70.67% 95.39% 69.03%
114 Claude 3 Haiku 64.36% 48.58% 63.45% 81.03% 71.19%
115 Hermes 3 70B 63.34% 44.57% 66.87% 78.57% 72.57%
116 Rocinante 12B 56.31% 32.19% 63.96% 72.78% 54.55%