Transformation

Subcategory of Text Editing. 154 models scored.

Model Leaderboard

All models ranked by their Transformation subcategory score.

#	Model	Transformation	Text Editing	Overall
1	Gemini 3.5 Flash (Reasoning)	98.44%	97.78%	94.08%
2	Gemini 3 Pro (Preview)	98.12%	98.86%	88.79%
3	GPT-5	98.11%	98.90%	91.93%
4	Claude Opus 4.6 (Reasoning)	97.85%	98.86%	95.02%
5	GPT-5.4 (Reasoning)	97.84%	98.42%	93.24%
6	Z.AI GLM 5.1	97.82%	98.90%	94.37%
7	GPT-5.5 (Reasoning)	97.78%	98.79%	92.98%
8	Qwen 3.5 397B A17B	97.78%	98.05%	91.73%
9	Claude Sonnet 4.5	97.74%	99.02%	88.03%
10	Claude Sonnet 4	97.70%	99.13%	88.72%
11	ByteDance Seed 1.6	97.62%	98.40%	90.70%
12	Grok 4.20 (Reasoning)	97.60%	98.83%	91.39%
13	Qwen 3.5 27B	97.53%	98.69%	90.85%
14	Gemma 4 31B (Reasoning)	97.51%	98.83%	91.71%
15	GPT-5.1	97.43%	98.54%	92.54%
16	Grok 4.20 (Beta, Reasoning)	97.27%	98.69%	91.49%
17	Z.AI GLM 5	97.15%	98.59%	91.23%
18	DeepSeek V4 Pro (Reasoning)	97.09%	98.56%	90.10%
19	Grok 4	97.06%	98.76%	88.12%
20	Qwen3.6 Max Preview	96.97%	98.58%	94.54%
21	Gemini 2.5 Flash (Reasoning)	96.95%	98.12%	86.51%
22	Gemini 2.5 Pro	96.93%	98.58%	88.53%
23	GPT-5.4 (Reasoning, Low)	96.83%	98.01%	91.41%
24	GPT-5.5 (Reasoning, Low)	96.83%	98.59%	92.59%
25	Z.AI GLM 5 Turbo	96.68%	98.17%	94.27%
26	Gemini 3.5 Flash (Reasoning, Minimal)	96.60%	98.26%	86.47%
27	Gemini 3.1 Pro (Preview)	96.51%	98.51%	94.37%
28	Grok 4.3 (Reasoning)	96.47%	97.64%	93.60%
29	MoonshotAI: Kimi K2.6	96.37%	98.00%	92.31%
30	MoonshotAI: Kimi K2.5	96.37%	97.79%	91.04%
31	Claude Opus 4.8 (Reasoning)	96.33%	98.78%	92.22%
32	Grok 4.1 Fast	96.32%	97.87%	89.55%
33	Qwen 3.5 Plus (2026-04-20)	96.29%	97.70%	91.51%
34	Claude Opus 4.8 (Reasoning, Low)	96.27%	98.71%	92.14%
35	GPT-5.2	96.25%	97.54%	90.26%
36	Gemini 3 Flash (Preview, Reasoning)	96.17%	98.12%	90.50%
37	GPT-5.5	96.02%	98.20%	89.09%
38	Z.AI GLM 4.7	95.98%	98.22%	88.69%
39	Qwen3.7 Max	95.96%	98.08%	95.75%
40	Gemma 4 31B	95.89%	98.56%	86.91%
41	Claude Sonnet 4.6 (Reasoning)	95.84%	98.30%	93.66%
42	Gemma 4 26B (Reasoning)	95.76%	98.26%	91.49%
43	DeepSeek V3.2	95.62%	95.78%	82.25%
44	Z.AI GLM 4.5	95.45%	95.32%	86.27%
45	DeepSeek V4 Pro	95.32%	97.98%	82.63%
46	Claude Opus 4.6	95.06%	98.35%	92.35%
47	Z.AI GLM 4.6	95.02%	97.78%	89.11%
48	DeepSeek V4 Flash (Reasoning)	94.95%	96.15%	89.01%
49	Qwen 3.5 Plus (2026-02-15)	94.91%	98.10%	85.96%
50	Gemini 2.5 Flash	94.82%	97.83%	80.60%
51	Gemini 2.5 Flash Lite (Reasoning)	94.57%	94.54%	85.75%
52	GPT-5.4	94.09%	96.73%	84.32%
53	MiniMax M3	93.99%	97.51%	90.88%
54	Gemini 3.1 Flash Lite	93.93%	97.35%	85.75%
55	GPT-5 Mini	93.87%	97.13%	92.62%
56	Gemini 3 Flash (Preview)	93.70%	97.54%	85.35%
57	Claude 3.7 Sonnet	93.64%	97.12%	83.39%
58	Aion 2.0	93.58%	95.34%	89.21%
59	Claude Opus 4	93.42%	97.25%	87.69%
60	Grok 4 Fast	93.19%	97.26%	86.15%
61	Claude Opus 4.5	93.08%	97.69%	89.69%
62	GPT-5.4 Mini (Reasoning)	93.03%	95.78%	90.65%
63	Claude Opus 4.7 (Reasoning)	92.89%	97.58%	93.23%
64	Claude Opus 4.7	92.82%	97.55%	89.93%
65	Claude 3.5 Sonnet	92.76%	96.57%	84.24%
66	Gemini 3.1 Flash Lite (Reasoning)	92.71%	96.90%	86.41%
67	Grok 4.20 (Beta)	92.06%	95.49%	83.85%
68	Qwen 3.6 Flash	91.94%	96.09%	90.65%
69	Qwen 3.5 122B	91.84%	96.31%	91.53%
70	Qwen 3.5 35B	91.82%	94.95%	88.00%
71	Gemma 4 26B	91.43%	97.04%	85.84%
72	Stealth: Hunter Alpha	91.32%	95.53%	87.34%
73	Gemini 3.1 Flash Lite (Preview)	91.27%	96.46%	85.87%
74	Xiaomi MIMO v2.5 Pro	91.14%	95.96%	87.36%
75	Mistral Large 2	90.87%	94.16%	82.41%
76	Claude Haiku 4.5	90.77%	96.81%	85.14%
77	Mistral Large	90.77%	95.14%	80.15%
78	Mistral Large 3	90.68%	94.09%	85.43%
79	Qwen 3.6 35B	90.37%	95.10%	89.05%
80	Stealth: Healer Alpha	90.26%	96.04%	85.93%
81	MiniMax M2.5	90.05%	96.02%	88.71%
82	Grok 4.20	89.78%	95.63%	81.70%
83	DeepSeek V4 Flash	89.68%	93.25%	82.02%
84	DeepSeek V3 (2024-12-26)	89.33%	93.58%	83.68%
85	Claude Sonnet 4.6	89.29%	96.37%	91.15%
86	GPT-4.1 Mini	89.26%	95.62%	83.20%
87	Qwen 3.5 Flash	88.93%	92.80%	86.38%
88	ByteDance Seed 2.0 Lite	88.71%	95.03%	84.80%
89	Qwen 3.6 27B	88.52%	93.97%	89.72%
90	GPT-4o, Aug. 6th (temp=0)	88.11%	93.77%	82.45%
91	Xiaomi MIMO v2.5	87.99%	95.34%	85.05%
92	GPT-4o, May 13th (temp=0)	87.89%	95.35%	85.36%
93	o4 Mini High	87.18%	94.36%	90.29%
94	ByteDance Seed 2.0 Mini	86.72%	91.08%	86.91%
95	GPT-4.1	86.53%	94.40%	88.68%
96	Z.AI GLM 4.5 Air	86.09%	94.38%	83.12%
97	GPT-5.4 Mini (Reasoning, Low)	85.94%	92.63%	85.75%
98	Mistral Medium 3.1	84.76%	93.77%	77.83%
99	Gemini 2.5 Flash Lite	84.21%	92.13%	81.08%
100	Llama 3.1 70B	84.01%	92.10%	78.40%
101	DeepSeek-V2 Chat	83.56%	90.90%	84.83%
102	GPT-5.4 Mini	83.34%	90.60%	82.43%
103	GPT-OSS 120B	83.23%	91.73%	86.44%
104	MiniMax M2.7	82.95%	92.14%	89.10%
105	ByteDance Seed 1.6 Flash	81.97%	91.64%	73.27%
106	DeepSeek V3.1	81.88%	87.27%	82.39%
107	Grok 4.3	81.66%	90.19%	78.66%
108	WizardLM 2 8x22b	81.62%	88.13%	71.06%
109	Mistral Small 4	81.54%	91.00%	76.46%
110	GPT-4o, May 13th (temp=1)	79.68%	92.41%	83.80%
111	Qwen3 235B A22B Instruct 2507	79.66%	91.75%	80.10%
112	Mistral Small 4 (Reasoning)	79.45%	90.58%	82.39%
113	DeepSeek V3 (2025-03-24)	79.33%	89.57%	81.99%
114	Writer: Palmyra X5	79.31%	91.20%	79.57%
115	Nemotron 3 Super	79.06%	86.34%	84.56%
116	o4 Mini	77.98%	90.61%	88.35%
117	Qwen 3 32B	77.20%	89.95%	82.21%
118	Arcee AI: Trinity Large (Preview)	76.70%	86.62%	73.33%
119	Mistral Small 3.2 24B	76.07%	89.48%	78.58%
120	Mistral Small Creative	73.96%	90.31%	73.27%
121	Llama 3.1 Nemotron 70B	73.30%	87.26%	74.70%
122	GPT-4o, Aug. 6th (temp=1)	72.28%	86.72%	82.62%
123	Qwen 3.5 9B	71.91%	85.35%	86.05%
124	Qwen 2.5 72B	71.74%	89.18%	75.46%
125	Z.AI GLM 4.7 Flash	70.63%	85.82%	84.82%
126	Inception Mercury 2	70.42%	85.26%	83.85%
127	GPT-4o Mini (temp=1)	70.40%	85.78%	79.08%
128	Hermes 3 405B	69.94%	89.14%	82.86%
129	Ministral 3 14B	69.48%	86.20%	72.54%
130	Gemma 3 12B	67.87%	85.18%	78.41%
131	GPT-4o Mini (temp=0)	67.00%	84.62%	78.29%
132	LFM2 24B	66.33%	71.56%	58.77%
133	Cydonia 24B V4.1	65.80%	86.15%	75.09%
134	Gemma 3 27B	64.41%	86.63%	77.85%
135	Nemotron 3 Nano	62.02%	75.81%	77.73%
136	GPT-5.4 Nano (Reasoning, Low)	61.68%	82.23%	79.48%
137	GPT-5 Nano	61.06%	82.74%	82.60%
138	GPT-5.4 Nano (Reasoning)	59.85%	83.32%	81.36%
139	Inception Mercury	58.12%	79.53%	79.50%
140	Skyfall 36B V2	55.99%	76.69%	65.76%
141	GPT-5.4 Nano	53.44%	79.22%	74.40%
142	Ministral 3 8B	52.05%	78.52%	71.76%
143	Llama 3.1 8B	51.53%	75.45%	63.35%
144	Arcee AI: Trinity Mini	50.86%	73.88%	70.90%
145	GPT-4.1 Nano	50.14%	76.06%	71.94%
146	Ministral 8B	48.88%	77.52%	64.87%
147	Claude 3 Haiku	48.58%	64.36%	71.19%
148	Mistral NeMO	45.51%	73.69%	65.04%
149	Hermes 3 70B	44.57%	63.34%	72.57%
150	Gemma 3 4B	40.80%	78.38%	68.57%
151	Ministral 3B	40.57%	70.91%	61.29%
152	Ministral 3 3B	39.84%	69.80%	67.22%
153	Cohere Command R+ (Aug. 2024)	39.15%	68.40%	69.03%
154	Rocinante 12B	32.19%	56.31%	54.54%