Passive voice → active voice

Text Replacement

Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.

Text Editing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
Claude Opus 4.8 (Reasoning)	98%
Z.AI GLM 5.2 (Reasoning, High)	98%
Claude Opus 4.8 (Reasoning, Low)	98%
Claude Opus 4.6	98%
Gemma 4 31B	98%
Claude Opus 4.5	97%
Aion 3.0	97%
GPT-5	97%
GPT-5.4 (Reasoning)	97%
GPT-5.5 (Reasoning, Low)	97%
Claude Sonnet 4	97%
Grok 4.20 (Reasoning)	97%
Z.AI GLM 5.1	97%
GPT-5.5 (Reasoning)	97%
Gemini 3.5 Flash (Reasoning)	97%
Qwen 3.5 397B A17B	97%
Qwen 3.5 27B	97%
Grok 4.5 (Reasoning, High)	97%
GPT-5.4 (Reasoning, Low)	97%
Grok 4.5 (Reasoning, Low)	97%

	Score	Cost	Time
Z.AI GLM 5.2 (Reasoning, High)	98%	$0.0057	16.7s
Gemma 4 31B	98%	$0.0005	39.7s
Qwen 3.5 Plus (2026-02-15)	96%	$0.0021	10.9s
DeepSeek V4 Flash	89%	$0.0002	9.5s
Gemma 4 26B	95%	$0.0003	23.2s
Gemini 2.5 Flash	95%	$0.0021	2.9s
Gemini 3.1 Flash Lite	94%	$0.0013	2.5s
Gemini 3.1 Flash Lite (Preview)	93%	$0.0013	2.4s
Gemini 3 Flash (Preview)	94%	$0.0026	4.3s
Gemini 2.5 Flash Lite	90%	$0.0004	2.5s
DeepSeek V3.1	89%	$0.0008	37.0s
Gemini 3.1 Flash Lite (Reasoning)	93%	$0.0013	2.5s
DeepSeek V3.2	96%	$0.0008	52.0s
Gemini 3.5 Flash (Reasoning, Minimal)	96%	$0.0080	3.5s
Claude Haiku 4.5	95%	$0.0050	5.3s
DeepSeek V4 Flash (Reasoning)	94%	$0.0009	3.2m
DeepSeek V4 Pro	93%	$0.0019	19.3s
DeepSeek-V2 Chat	93%	$0.0011	22.6s
DeepSeek V3 (2024-12-26)	93%	$0.0012	23.8s
Xiaomi MIMO v2.5	94%	$0.0046	17.2s

	Score	Cost	Speed	Stability
Z.AI GLM 5.2 (Reasoning, High)	98%	$0.0057	16.7s	96%
Qwen 3.5 Plus (2026-02-15)	96%	$0.0021	10.9s	94%
Gemini 3.5 Flash (Reasoning, Minimal)	96%	$0.0080	3.5s	94%
Gemma 4 31B	98%	$0.0005	39.7s	96%
Gemini 2.5 Flash	95%	$0.0021	2.9s	92%
Gemini 3.1 Flash Lite	94%	$0.0013	2.5s	92%
GPT-5.4 (Reasoning, Low)	97%	$0.013	8.3s	95%
Gemma 4 26B	95%	$0.0003	23.2s	94%
Gemini 3 Flash (Preview)	94%	$0.0026	4.3s	92%
Claude Sonnet 4	97%	$0.015	9.5s	96%
Gemini 3.1 Flash Lite (Preview)	93%	$0.0013	2.4s	91%
GPT-5.6 Terra	96%	$0.013	3.9s	94%
Gemini 3.1 Flash Lite (Reasoning)	93%	$0.0013	2.5s	90%
Claude Haiku 4.5	95%	$0.0050	5.3s	90%
Claude Opus 4.5	97%	$0.025	8.2s	97%
Claude Sonnet 4.5	96%	$0.015	7.4s	93%
Claude Opus 4.6	98%	$0.025	8.7s	96%
Grok 4.5 (Reasoning, Low)	97%	$0.015	22.8s	96%
GPT-5.6 Luna (Reasoning)	94%	$0.0079	7.6s	91%
Grok 4.20	93%	$0.0029	6.1s	90%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	# 6	# 7	Total
35	Claude Opus 4.8 (Reasoning)	$0.036	11.1s	97%	100	98	98	98	98	98	97	98%
1	Z.AI GLM 5.2 (Reasoning, High)	$0.0057	16.7s	96%	100	99	99	99	98	98	96	98%
41	Claude Opus 4.8 (Reasoning, Low)	$0.036	11.5s	96%	99	98	98	98	98	98	95	98%
17	Claude Opus 4.6	$0.025	8.7s	96%	98	98	98	98	98	97	97	98%
4	Gemma 4 31B	$0.0005	39.7s	96%	99	98	98	98	98	96	96	98%
15	Claude Opus 4.5	$0.025	8.2s	97%	98	98	98	98	98	97	97	97%
112	Aion 3.0	$0.047	2.4m	96%	99	98	98	98	97	97	97	97%
81	GPT-5	$0.045	1.1m	96%	98	98	98	98	97	97	96	97%
63	GPT-5.4 (Reasoning)	$0.039	34.2s	96%	98	98	98	97	97	97	96	97%
27	GPT-5.5 (Reasoning, Low)	$0.026	8.6s	95%	98	98	97	97	97	96	96	97%
10	Claude Sonnet 4	$0.015	9.5s	96%	97	97	97	96	96	96	96	97%
53	Grok 4.20 (Reasoning)	$0.015	59.7s	96%	98	98	98	98	97	96	95	97%
106	Z.AI GLM 5.1	$0.033	2.7m	95%	98	98	97	97	97	97	95	97%
70	GPT-5.5 (Reasoning)	$0.050	20.2s	96%	98	98	97	97	97	96	96	97%
85	Gemini 3.5 Flash (Reasoning)	$0.061	25.7s	95%	98	98	98	97	97	96	95	97%
93	Qwen 3.5 397B A17B	$0.0084	3.1m	95%	98	98	98	97	97	96	95	97%
72	Qwen 3.5 27B	$0.021	1.6m	96%	98	98	97	97	97	96	96	97%
77	Grok 4.5 (Reasoning, High)	$0.038	1.1m	95%	98	98	97	97	96	96	96	97%
7	GPT-5.4 (Reasoning, Low)	$0.013	8.3s	95%	98	98	97	97	96	96	96	97%
18	Grok 4.5 (Reasoning, Low)	$0.015	22.8s	96%	98	97	97	97	97	96	96	97%
80	Qwen 3.5 Plus (2026-04-20)	$0.018	1.9m	95%	98	98	97	97	96	96	95	97%
123	Gemma 4 31B (Reasoning)	$0.0021	4.5m	95%	98	98	97	97	96	96	95	97%
51	Z.AI GLM 5 Turbo	$0.020	46.1s	96%	97	97	97	97	96	96	95	96%
83	Claude Opus 4.6 (Reasoning)	$0.058	25.1s	94%	98	97	97	96	96	95	95	96%
55	Claude Opus 4.7	$0.035	7.0s	94%	98	98	97	97	95	95	93	96%
16	Claude Sonnet 4.5	$0.015	7.4s	93%	98	97	96	96	96	96	95	96%
61	Claude Opus 4.7 (Reasoning)	$0.041	8.4s	94%	98	97	97	97	96	95	93	96%
45	GPT-5.1	$0.023	24.6s	94%	98	98	97	96	96	95	95	96%
131	Qwen3.6 Max Preview	$0.062	3.4m	94%	98	98	96	96	95	95	95	96%
46	Claude Sonnet 5 (Reasoning, Low)	$0.024	19.0s	93%	98	98	96	96	95	95	94	96%
140	Gemini 3.1 Pro (Preview)	$0.160	2.4m	95%	97	97	96	96	96	95	95	96%
30	GPT-5.5	$0.026	6.3s	95%	97	97	96	96	96	96	94	96%
52	Aion 2.0	$0.0054	1.1m	94%	98	98	96	96	96	95	94	96%
78	Z.AI GLM 5	$0.016	1.8m	93%	98	98	97	96	95	94	94	96%
3	Gemini 3.5 Flash (Reasoning, Minimal)	$0.0080	3.5s	94%	98	97	96	96	95	95	95	96%
116	Z.AI GLM 4.7	$0.013	3.6m	95%	97	97	96	96	95	95	95	96%
2	Qwen 3.5 Plus (2026-02-15)	$0.0021	10.9s	94%	96	96	96	96	96	96	93	96%
95	Aion 3.0 Mini	$0.0099	2.9m	94%	97	97	96	96	96	96	93	96%
23	DeepSeek V3.2	$0.0008	52.0s	94%	97	97	96	96	96	95	94	96%
22	Claude Sonnet 5	$0.014	11.5s	93%	98	98	97	96	94	94	93	96%
64	GPT-5.6 Sol (Reasoning)	$0.040	15.5s	94%	97	97	96	95	95	95	94	96%
122	Claude Sonnet 4.6 (Reasoning)	$0.092	59.3s	95%	97	96	96	96	95	95	94	96%
86	DeepSeek V4 Pro (Reasoning)	$0.0086	2.3m	95%	97	96	96	96	95	95	95	96%
12	GPT-5.6 Terra	$0.013	3.9s	94%	97	96	96	96	96	95	93	96%
8	Gemma 4 26B	$0.0003	23.2s	94%	97	96	96	96	95	95	93	95%
65	ByteDance Seed 1.6	$0.0074	1.4m	93%	97	97	97	95	95	95	94	95%
87	Gemini 2.5 Pro	$0.052	40.0s	93%	98	97	96	95	94	94	93	95%
50	Claude Sonnet 5 (Reasoning)	$0.024	18.0s	93%	98	96	95	95	95	94	94	95%
142	MoonshotAI: Kimi K2.6	$0.060	5.9m	93%	97	96	96	95	95	95	92	95%
128	MoonshotAI: Kimi K2.5	$0.023	4.1m	93%	97	96	96	95	95	95	93	95%
100	Gemma 4 26B (Reasoning)	$0.0038	3.2m	93%	97	96	96	95	95	94	92	95%
47	Gemini 3 Flash (Preview, Reasoning)	$0.018	29.1s	93%	96	96	96	95	94	94	94	95%
117	Qwen3.7 Max	$0.060	1.8m	93%	96	96	96	95	94	94	93	95%
14	Claude Haiku 4.5	$0.0050	5.3s	90%	99	96	95	94	94	93	93	95%
43	GPT-5.6 Sol	$0.026	6.1s	93%	97	96	95	95	95	94	94	95%
5	Gemini 2.5 Flash	$0.0021	2.9s	92%	97	96	95	95	94	94	93	95%
54	Z.AI GLM 4.6	$0.0052	58.2s	92%	98	96	95	95	94	94	93	95%
36	GPT-5 Mini	$0.0068	39.9s	93%	96	96	95	95	95	93	93	95%
29	Gemini 2.5 Flash Lite (Reasoning)	$0.0034	40.3s	93%	97	96	95	95	94	94	93	95%
28	GPT-5.6 Terra (Reasoning)	$0.017	7.4s	93%	96	96	95	94	94	94	93	95%
62	MiniMax M2.5	$0.0018	1.3m	91%	97	96	96	95	94	92	91	94%
34	Gemini 2.5 Flash (Reasoning)	$0.014	21.5s	93%	96	95	95	95	94	94	93	94%
19	GPT-5.6 Luna (Reasoning)	$0.0079	7.6s	91%	96	95	95	95	94	94	89	94%
9	Gemini 3 Flash (Preview)	$0.0026	4.3s	92%	95	95	95	94	94	94	91	94%
40	Z.AI GLM 4.5	$0.0055	44.8s	93%	95	95	95	95	94	93	92	94%
108	DeepSeek V4 Flash (Reasoning)	$0.0009	3.2m	89%	97	96	95	95	94	94	87	94%
129	MiniMax M3	$0.013	4.4m	91%	96	95	94	94	94	93	92	94%
6	Gemini 3.1 Flash Lite	$0.0013	2.5s	92%	96	94	94	94	94	93	93	94%
31	GPT-5.4	$0.013	8.5s	91%	96	96	96	96	92	91	90	94%
25	Xiaomi MIMO v2.5	$0.0046	17.2s	90%	96	96	95	94	93	92	89	94%
60	Qwen 3.6 Flash	$0.013	41.4s	91%	96	95	94	94	93	92	92	94%
48	GPT-5.2	$0.018	14.2s	91%	95	95	95	94	93	93	91	94%
32	Claude Sonnet 4.6	$0.015	7.5s	92%	95	94	94	93	93	93	93	94%
21	DeepSeek V4 Pro	$0.0019	19.3s	91%	95	95	94	94	92	91	91	93%
89	Grok 4.3 (Reasoning)	$0.017	1.3m	84%	97	97	96	95	95	95	79	93%
73	Qwen 3.5 Flash	$0.0041	1.0m	84%	97	96	96	95	95	94	79	93%
13	Gemini 3.1 Flash Lite (Reasoning)	$0.0013	2.5s	90%	95	94	94	93	92	92	91	93%
11	Gemini 3.1 Flash Lite (Preview)	$0.0013	2.4s	91%	94	94	94	94	93	93	90	93%
26	DeepSeek-V2 Chat	$0.0011	22.6s	91%	94	94	93	93	93	92	91	93%
33	DeepSeek V3 (2024-12-26)	$0.0012	23.8s	90%	95	94	93	93	93	91	90	93%
110	Qwen 3.5 122B	$0.035	1.5m	83%	98	98	97	96	96	84	81	93%
76	ByteDance Seed 2.0 Lite	$0.0076	1.4m	90%	94	94	94	93	91	91	90	93%
20	Grok 4.20	$0.0029	6.1s	90%	94	94	93	92	92	91	91	93%
56	GPT-5.4 Mini (Reasoning)	$0.013	25.4s	91%	93	93	93	92	92	91	91	92%
119	Claude Opus 4	$0.075	13.4s	83%	96	96	95	95	94	94	77	92%
69	MiniMax M2.7	$0.0048	47.5s	87%	96	94	92	91	91	89	89	92%
42	Xiaomi MIMO v2.5 Pro	$0.0034	15.4s	88%	95	93	92	92	92	89	88	92%
37	Mistral Large 2	$0.0061	10.7s	90%	92	92	91	91	91	91	91	91%
24	Mistral Large 3	$0.0015	10.5s	90%	92	91	91	91	91	91	91	91%
44	GPT-5.4 Mini (Reasoning, Low)	$0.0040	4.3s	87%	94	93	91	90	89	89	88	91%
38	GPT-5.4 Mini	$0.0039	3.8s	88%	93	91	91	90	90	89	89	91%
59	ByteDance Seed 1.6 Flash	$0.0011	19.6s	85%	94	94	92	89	89	88	88	90%
39	GPT-4.1 Mini	$0.0015	10.8s	88%	92	92	91	91	91	89	87	90%
49	GPT-5.6 Luna	$0.0052	7.5s	88%	92	91	90	90	90	89	88	90%
67	GPT-OSS 120B	$0.0009	25.8s	83%	94	92	92	92	91	90	79	90%
79	Z.AI GLM 4.5 Air	$0.0023	38.9s	79%	97	96	95	93	92	80	77	90%
57	Gemini 2.5 Flash Lite	$0.0004	2.5s	82%	94	93	93	93	89	88	77	90%
124	Qwen 3.5 35B	$0.023	1.3m	72%	97	96	96	96	94	85	61	89%
84	DeepSeek V4 Flash	$0.0002	9.5s	71%	96	95	95	95	95	90	59	89%
71	GPT-4.1	$0.0074	7.2s	81%	94	94	94	93	90	81	77	89%
113	DeepSeek V3.1	$0.0008	37.0s	60%	98	97	97	96	95	95	43	89%
58	Mistral Small 4	$0.0006	4.6s	85%	90	89	89	88	87	86	85	88%
82	DeepSeek V3 (2025-03-24)	$0.0008	36.8s	79%	95	93	93	92	86	78	77	88%
94	Qwen 3.6 35B	$0.0084	50.0s	76%	94	94	93	92	92	78	70	88%
127	ByteDance Seed 2.0 Mini	$0.0028	3.0m	78%	93	92	92	91	88	79	75	87%
74	Qwen 3 32B	$0.0009	25.8s	81%	91	91	90	88	87	83	79	87%
75	Mistral Small 4 (Reasoning)	$0.0024	22.5s	80%	93	91	91	90	86	79	78	87%
68	Qwen 2.5 72B	$0.0004	16.0s	83%	90	89	88	88	85	84	83	87%
66	Mistral Small 3.2 24B	$0.0003	7.0s	84%	86	86	86	86	85	85	82	85%
91	Grok 4.3	$0.0029	6.4s	68%	97	96	92	90	83	74	62	85%
133	o4 Mini High	$0.043	1.1m	64%	94	94	92	92	86	83	49	84%
90	Writer: Palmyra X5	$0.0049	17.6s	74%	90	90	87	86	86	73	72	83%
92	Llama 3.1 70B	$0.0007	13.1s	71%	91	90	87	80	80	78	74	83%
126	Z.AI GLM 4.7 Flash	$0.0027	1.8m	72%	90	89	87	84	79	78	68	82%
98	GPT-4o Mini (temp=1)	$0.0005	11.4s	69%	89	89	88	78	76	76	76	82%
130	GPT-4o, Aug. 6th (temp=0)	$0.0085	3.7s	46%	93	93	92	92	91	87	20	81%
88	Mistral Medium 3.1	$0.0018	6.2s	75%	84	83	83	79	79	78	78	81%
103	Inception Mercury 2	$0.0020	2.6s	67%	90	90	78	78	77	74	71	80%
138	Qwen 3.5 9B	$0.0018	2.7m	57%	96	94	82	79	79	79	49	80%
107	Qwen3 235B A22B Instruct 2507	$0.0005	21.2s	69%	86	86	83	78	77	73	73	79%
141	Qwen 3.6 27B	$0.026	1.8m	42%	97	97	97	96	94	34	34	78%
114	Hermes 3 70B	$0.0005	22.6s	67%	89	82	81	81	79	78	59	78%
104	Cydonia 24B V4.1	$0.0006	15.3s	72%	82	82	77	77	76	75	72	77%
105	Hermes 3 405B	$0.0016	36.3s	75%	79	78	78	78	76	76	75	77%
97	Gemma 3 4B	$0.0001	8.4s	73%	80	79	78	76	76	76	74	77%
109	Ministral 3 14B	$0.0003	5.5s	67%	82	82	82	74	74	74	71	77%
102	Gemma 3 27B	$0.0003	23.9s	75%	77	77	77	77	77	75	75	76%
132	GPT-4o, Aug. 6th (temp=1)	$0.0084	4.6s	46%	91	89	89	86	81	78	20	76%
96	GPT-4o Mini (temp=0)	$0.0005	13.6s	76%	76	76	76	76	76	76	76	76%
99	Arcee AI: Trinity Mini	$0.0003	10.7s	74%	79	78	78	78	73	73	73	76%
101	Gemma 3 12B	$0.0001	13.5s	74%	76	76	76	76	76	76	74	75%
118	GPT-5.4 Nano (Reasoning, Low)	$0.0011	4.1s	66%	83	81	74	73	71	71	69	75%
111	Ministral 3 8B	$0.0002	4.3s	70%	77	77	76	75	73	70	67	74%
121	GPT-5.4 Nano (Reasoning)	$0.0020	14.2s	69%	78	75	73	72	72	71	71	73%
115	GPT-5.4 Nano	$0.0010	3.8s	69%	75	74	74	72	71	71	69	72%
125	GPT-4.1 Nano	$0.0004	5.0s	61%	84	81	78	73	66	64	60	72%
136	o4 Mini	$0.022	35.1s	54%	91	81	80	78	77	49	47	72%
120	Ministral 8B	$0.0002	4.0s	68%	76	72	72	71	71	69	68	71%
139	Nemotron 3 Super	$0.0000	1.4m	45%	96	95	77	76	58	58	34	71%
135	GPT-5 Nano	$0.0039	1.5m	62%	77	76	75	71	66	63	62	70%
144	WizardLM 2 8x22b	$0.0016	2.3m	27%	94	93	92	86	72	27	3	67%
143	Nemotron 3 Nano	$0.0017	2.1m	35%	88	85	60	56	48	40	35	59%
134	Ministral 3B	$0.0001	2.8s	55%	65	62	62	61	55	52	51	58%
137	Ministral 3 3B	$0.0002	2.9s	49%	67	64	58	57	57	49	48	57%
145	Mistral NeMO	$0.0002	3.9s	22%	63	52	49	32	24	20	17	37%
146	Cohere Command R+ (Aug. 2024)	$0.0088	16.4s	24%	40	29	27	27	27	23	22	28%
88.63%

Median	Evaluator	Top 3	Flop 3
95.7%	Dialogue content preserved unchanged	100GPT-5.5 (Reasoning, Low) 100GPT-5.4 (Reasoning, Low) 100Qwen3.7 Max	14Cohere Command R+ (Aug. 2024) 36Mistral NeMO 64Nemotron 3 Super
100.0%	No hallucinated or fabricated content	100Qwen 3.6 35B 100Claude Haiku 4.5 100Gemini 2.5 Flash Lite (Reasoning)	15Cohere Command R+ (Aug. 2024) 52Mistral NeMO 71Qwen3 235B A22B Instruct 2507
87.5%	Non-passive narration preserved	100Claude Sonnet 4 100Claude Opus 4.6 100Gemma 4 26B	11Cohere Command R+ (Aug. 2024) 16Mistral NeMO 38Ministral 3 3B
81.9%	Passive → active voice transformations	98Claude Sonnet 4.5 98Grok 4.20 (Reasoning) 98GPT-5	0Gemma 3 12B 2Mistral NeMO 2Ministral 3 3B
100.0%	Structural similarity to original	100Gemini 2.5 Flash (Reasoning) 100Gemini 2.5 Flash 100Ministral 3B	72WizardLM 2 8x22b 73Nemotron 3 Nano 76Nemotron 3 Super

Text Replacement

Passive voice → active voice

Performance Score Distribution (Top 20)

Price-Performance Score Distribution (Top 20)

Most Stable Models (Top 20)

Top Overall Models (Top 20)