Text Editing

18 scenarios across 3 subcategories. 154 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Transformation 83.94% Gemini 3.5 Flash (Reasoning) 98.44%
Preservation 93.68% Claude Opus 4.6 100.00%
Structural Integrity 98.32% Qwen3.7 Max 100.00%

Model Leaderboard

All models ranked by their Text Editing category score.

# Model Text Editing Transformation Preservation Structural Integrity Overall
1 Claude Sonnet 4 99.13% 97.70% 99.69% 100.00% 88.72%
2 Claude Sonnet 4.5 99.02% 97.74% 99.31% 100.00% 88.03%
3 Z.AI GLM 5.1 98.90% 97.82% 98.88% 100.00% 94.37%
4 GPT-5 98.90% 98.11% 98.57% 100.00% 91.93%
5 Gemini 3 Pro (Preview) 98.86% 98.12% 98.47% 100.00% 88.79%
6 Claude Opus 4.6 (Reasoning) 98.86% 97.85% 98.72% 100.00% 95.02%
7 Grok 4.20 (Reasoning) 98.83% 97.60% 98.90% 100.00% 91.39%
8 Gemma 4 31B (Reasoning) 98.83% 97.51% 98.98% 100.00% 91.71%
9 GPT-5.5 (Reasoning) 98.79% 97.78% 98.57% 100.00% 92.98%
10 Claude Opus 4.8 (Reasoning) 98.78% 96.33% 100.00% 100.00% 92.22%
11 Grok 4 98.76% 97.06% 99.23% 100.00% 88.12%
12 Claude Opus 4.8 (Reasoning, Low) 98.71% 96.27% 99.87% 100.00% 92.14%
13 Qwen 3.5 27B 98.69% 97.53% 98.55% 100.00% 90.85%
14 Grok 4.20 (Beta, Reasoning) 98.69% 97.27% 98.80% 100.00% 91.49%
15 Z.AI GLM 5 98.59% 97.15% 98.62% 100.00% 91.23%
16 GPT-5.5 (Reasoning, Low) 98.59% 96.83% 98.93% 100.00% 92.59%
17 Qwen3.6 Max Preview 98.58% 96.97% 98.78% 100.00% 94.54%
18 Gemini 2.5 Pro 98.58% 96.93% 98.80% 100.00% 88.53%
19 DeepSeek V4 Pro (Reasoning) 98.56% 97.09% 98.60% 100.00% 90.10%
20 Gemma 4 31B 98.56% 95.89% 99.80% 100.00% 86.91%
21 GPT-5.1 98.54% 97.43% 98.18% 100.00% 92.54%
22 Gemini 3.1 Pro (Preview) 98.51% 96.51% 99.03% 100.00% 94.37%
23 GPT-5.4 (Reasoning) 98.42% 97.84% 97.42% 100.00% 93.24%
24 ByteDance Seed 1.6 98.40% 97.62% 97.58% 100.00% 90.70%
25 Claude Opus 4.6 98.35% 95.06% 100.00% 100.00% 92.35%
26 Claude Sonnet 4.6 (Reasoning) 98.30% 95.84% 99.06% 100.00% 93.66%
27 Gemini 3.5 Flash (Reasoning, Minimal) 98.26% 96.60% 98.17% 100.00% 86.47%
28 Gemma 4 26B (Reasoning) 98.26% 95.76% 99.01% 100.00% 91.49%
29 Z.AI GLM 4.7 98.22% 95.98% 98.67% 100.00% 88.69%
30 GPT-5.5 98.20% 96.02% 98.59% 100.00% 89.09%
31 Z.AI GLM 5 Turbo 98.17% 96.68% 97.83% 100.00% 94.27%
32 Gemini 3 Flash (Preview, Reasoning) 98.12% 96.17% 98.19% 100.00% 90.50%
33 Gemini 2.5 Flash (Reasoning) 98.12% 96.95% 97.39% 100.00% 86.51%
34 Qwen 3.5 Plus (2026-02-15) 98.10% 94.91% 99.38% 100.00% 85.96%
35 Qwen3.7 Max 98.08% 95.96% 98.27% 100.00% 95.75%
36 Qwen 3.5 397B A17B 98.05% 97.78% 96.38% 100.00% 91.73%
37 GPT-5.4 (Reasoning, Low) 98.01% 96.83% 97.21% 100.00% 91.41%
38 MoonshotAI: Kimi K2.6 98.00% 96.37% 98.83% 98.81% 92.31%
39 DeepSeek V4 Pro 97.98% 95.32% 98.62% 100.00% 82.63%
40 Grok 4.1 Fast 97.87% 96.32% 97.29% 100.00% 89.55%
41 Gemini 2.5 Flash 97.83% 94.82% 98.67% 100.00% 80.60%
42 MoonshotAI: Kimi K2.5 97.79% 96.37% 98.19% 98.81% 91.04%
43 Gemini 3.5 Flash (Reasoning) 97.78% 98.44% 98.47% 96.43% 94.08%
44 Z.AI GLM 4.6 97.78% 95.02% 98.31% 100.00% 89.11%
45 Qwen 3.5 Plus (2026-04-20) 97.70% 96.29% 96.81% 100.00% 91.51%
46 Claude Opus 4.5 97.69% 93.08% 100.00% 100.00% 89.69%
47 Grok 4.3 (Reasoning) 97.64% 96.47% 97.63% 98.81% 93.60%
48 Claude Opus 4.7 (Reasoning) 97.58% 92.89% 99.85% 100.00% 93.23%
49 Claude Opus 4.7 97.55% 92.82% 99.85% 100.00% 89.93%
50 GPT-5.2 97.54% 96.25% 96.38% 100.00% 90.26%
51 Gemini 3 Flash (Preview) 97.54% 93.70% 98.90% 100.00% 85.35%
52 MiniMax M3 97.51% 93.99% 98.55% 100.00% 90.88%
53 Gemini 3.1 Flash Lite 97.35% 93.93% 98.11% 100.00% 85.75%
54 Grok 4 Fast 97.26% 93.19% 98.60% 100.00% 86.15%
55 Claude Opus 4 97.25% 93.42% 98.34% 100.00% 87.69%
56 GPT-5 Mini 97.13% 93.87% 97.51% 100.00% 92.62%
57 Claude 3.7 Sonnet 97.12% 93.64% 97.73% 100.00% 83.39%
58 Gemma 4 26B 97.04% 91.43% 99.69% 100.00% 85.84%
59 Gemini 3.1 Flash Lite (Reasoning) 96.90% 92.71% 97.98% 100.00% 86.41%
60 Claude Haiku 4.5 96.81% 90.77% 99.64% 100.00% 85.14%
61 GPT-5.4 96.73% 94.09% 96.10% 100.00% 84.32%
62 Claude 3.5 Sonnet 96.57% 92.76% 96.96% 100.00% 84.24%
63 Gemini 3.1 Flash Lite (Preview) 96.46% 91.27% 98.12% 100.00% 85.87%
64 Claude Sonnet 4.6 96.37% 89.29% 99.82% 100.00% 91.15%
65 Qwen 3.5 122B 96.31% 91.84% 98.28% 98.81% 91.53%
66 DeepSeek V4 Flash (Reasoning) 96.15% 94.95% 98.26% 95.24% 89.01%
67 Qwen 3.6 Flash 96.09% 91.94% 97.52% 98.81% 90.65%
68 Stealth: Healer Alpha 96.04% 90.26% 97.86% 100.00% 85.93%
69 MiniMax M2.5 96.02% 90.05% 98.01% 100.00% 88.71%
70 Xiaomi MIMO v2.5 Pro 95.96% 91.14% 96.73% 100.00% 87.36%
71 GPT-5.4 Mini (Reasoning) 95.78% 93.03% 95.50% 98.81% 90.65%
72 DeepSeek V3.2 95.78% 95.62% 91.71% 100.00% 82.25%
73 Grok 4.20 95.63% 89.78% 97.12% 100.00% 81.70%
74 GPT-4.1 Mini 95.62% 89.26% 97.60% 100.00% 83.20%
75 Stealth: Hunter Alpha 95.53% 91.32% 96.45% 98.81% 87.34%
76 Grok 4.20 (Beta) 95.49% 92.06% 94.41% 100.00% 83.85%
77 GPT-4o, May 13th (temp=0) 95.35% 87.89% 98.16% 100.00% 85.36%
78 Xiaomi MIMO v2.5 95.34% 87.99% 98.04% 100.00% 85.05%
79 Aion 2.0 95.34% 93.58% 97.21% 95.24% 89.21%
80 Z.AI GLM 4.5 95.32% 95.45% 91.69% 98.81% 86.27%
81 Mistral Large 95.14% 90.77% 94.67% 100.00% 80.15%
82 Qwen 3.6 35B 95.10% 90.37% 97.30% 97.62% 89.05%
83 ByteDance Seed 2.0 Lite 95.03% 88.71% 97.58% 98.81% 84.80%
84 Qwen 3.5 35B 94.95% 91.82% 94.21% 98.81% 88.00%
85 Gemini 2.5 Flash Lite (Reasoning) 94.54% 94.57% 89.04% 100.00% 85.75%
86 GPT-4.1 94.40% 86.53% 96.66% 100.00% 88.68%
87 Z.AI GLM 4.5 Air 94.38% 86.09% 97.05% 100.00% 83.12%
88 o4 Mini High 94.36% 87.18% 97.09% 98.81% 90.29%
89 Mistral Large 2 94.16% 90.87% 91.61% 100.00% 82.41%
90 Mistral Large 3 94.09% 90.68% 91.61% 100.00% 85.43%
91 Qwen 3.6 27B 93.97% 88.52% 95.77% 97.62% 89.72%
92 GPT-4o, Aug. 6th (temp=0) 93.77% 88.11% 96.79% 96.43% 82.45%
93 Mistral Medium 3.1 93.77% 84.76% 96.56% 100.00% 77.83%
94 DeepSeek V3 (2024-12-26) 93.58% 89.33% 92.60% 98.81% 83.68%
95 DeepSeek V4 Flash 93.25% 89.68% 90.06% 100.00% 82.02%
96 Qwen 3.5 Flash 92.80% 88.93% 93.04% 96.43% 86.38%
97 GPT-5.4 Mini (Reasoning, Low) 92.63% 85.94% 91.94% 100.00% 85.75%
98 GPT-4o, May 13th (temp=1) 92.41% 79.68% 97.54% 100.00% 83.80%
99 MiniMax M2.7 92.14% 82.95% 95.84% 97.62% 89.10%
100 Gemini 2.5 Flash Lite 92.13% 84.21% 92.17% 100.00% 81.08%
101 Llama 3.1 70B 92.10% 84.01% 94.67% 97.62% 78.40%
102 Qwen3 235B A22B Instruct 2507 91.75% 79.66% 95.60% 100.00% 80.10%
103 GPT-OSS 120B 91.73% 83.23% 94.34% 97.62% 86.44%
104 ByteDance Seed 1.6 Flash 91.64% 81.97% 95.32% 97.62% 73.27%
105 Writer: Palmyra X5 91.20% 79.31% 94.28% 100.00% 79.57%
106 ByteDance Seed 2.0 Mini 91.08% 86.72% 90.10% 96.43% 86.91%
107 Mistral Small 4 91.00% 81.54% 91.45% 100.00% 76.46%
108 DeepSeek-V2 Chat 90.90% 83.56% 91.53% 97.62% 84.83%
109 o4 Mini 90.61% 77.98% 96.24% 97.62% 88.35%
110 GPT-5.4 Mini 90.60% 83.34% 88.47% 100.00% 82.43%
111 Mistral Small 4 (Reasoning) 90.58% 79.45% 93.47% 98.81% 82.39%
112 Mistral Small Creative 90.31% 73.96% 96.98% 100.00% 73.27%
113 Grok 4.3 90.19% 81.66% 88.90% 100.00% 78.66%
114 Qwen 3 32B 89.95% 77.20% 93.83% 98.81% 82.21%
115 DeepSeek V3 (2025-03-24) 89.57% 79.33% 90.56% 98.81% 81.99%
116 Mistral Small 3.2 24B 89.48% 76.07% 92.38% 100.00% 78.58%
117 Qwen 2.5 72B 89.18% 71.74% 95.78% 100.00% 75.46%
118 Hermes 3 405B 89.14% 69.94% 97.48% 100.00% 82.86%
119 WizardLM 2 8x22b 88.13% 81.62% 89.67% 93.12% 71.06%
120 DeepSeek V3.1 87.27% 81.88% 84.70% 95.24% 82.39%
121 Llama 3.1 Nemotron 70B 87.26% 73.30% 93.45% 95.04% 74.70%
122 GPT-4o, Aug. 6th (temp=1) 86.72% 72.28% 95.01% 92.86% 82.62%
123 Gemma 3 27B 86.63% 64.41% 95.49% 100.00% 77.85%
124 Arcee AI: Trinity Large (Preview) 86.62% 76.70% 83.17% 100.00% 73.33%
125 Nemotron 3 Super 86.34% 79.06% 89.49% 90.48% 84.56%
126 Ministral 3 14B 86.20% 69.48% 89.12% 100.00% 72.54%
127 Cydonia 24B V4.1 86.15% 65.80% 92.65% 100.00% 75.09%
128 Z.AI GLM 4.7 Flash 85.82% 70.63% 90.40% 96.43% 84.82%
129 GPT-4o Mini (temp=1) 85.78% 70.40% 86.94% 100.00% 79.08%
130 Qwen 3.5 9B 85.35% 71.91% 92.48% 91.67% 86.05%
131 Inception Mercury 2 85.26% 70.42% 91.31% 94.05% 83.85%
132 Gemma 3 12B 85.18% 67.87% 87.66% 100.00% 78.41%
133 GPT-4o Mini (temp=0) 84.62% 67.00% 86.87% 100.00% 78.29%
134 GPT-5.4 Nano (Reasoning) 83.32% 59.85% 90.10% 100.00% 81.36%
135 GPT-5 Nano 82.74% 61.06% 91.91% 95.24% 82.60%
136 GPT-5.4 Nano (Reasoning, Low) 82.23% 61.68% 85.00% 100.00% 79.48%
137 Inception Mercury 79.53% 58.12% 87.15% 93.32% 79.50%
138 GPT-5.4 Nano 79.22% 53.44% 84.23% 100.00% 74.40%
139 Ministral 3 8B 78.52% 52.05% 83.53% 100.00% 71.76%
140 Gemma 3 4B 78.38% 40.80% 94.35% 100.00% 68.57%
141 Ministral 8B 77.52% 48.88% 83.69% 100.00% 64.87%
142 Skyfall 36B V2 76.69% 55.99% 82.94% 91.15% 65.76%
143 GPT-4.1 Nano 76.06% 50.14% 78.02% 100.00% 71.94%
144 Nemotron 3 Nano 75.81% 62.02% 78.40% 87.00% 77.73%
145 Llama 3.1 8B 75.45% 51.53% 80.76% 94.05% 63.35%
146 Arcee AI: Trinity Mini 73.88% 50.86% 85.07% 85.71% 70.90%
147 Mistral NeMO 73.69% 45.51% 82.35% 93.21% 65.04%
148 LFM2 24B 71.56% 66.33% 57.43% 90.94% 58.77%
149 Ministral 3B 70.91% 40.57% 74.55% 97.62% 61.29%
150 Ministral 3 3B 69.80% 39.84% 73.12% 96.43% 67.22%
151 Cohere Command R+ (Aug. 2024) 68.40% 39.15% 70.67% 95.39% 69.03%
152 Claude 3 Haiku 64.36% 48.58% 63.45% 81.03% 71.19%
153 Hermes 3 70B 63.34% 44.57% 66.87% 78.57% 72.57%
154 Rocinante 12B 56.31% 32.19% 63.96% 72.78% 54.54%