Passive voice → active voice

Text Replacement

Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.

Text Editing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
Gemini 3.1 Pro (Preview)	99%
GPT-5.5 (Reasoning)	99%
Z.AI GLM 5.2 (Reasoning, High)	99%
Aion 3.0	99%
Gemini 2.5 Pro	99%
Claude Opus 4.8 (Reasoning)	99%
Gemini 3.5 Flash (Reasoning, Minimal)	99%
GPT-5.6 Sol (Reasoning)	99%
Claude Opus 4.8 (Reasoning, Low)	99%
Z.AI GLM 5.1	99%
Gemma 4 31B (Reasoning)	99%
GPT-5.6 Luna (Reasoning)	99%
Claude Sonnet 4.5	99%
Claude Opus 4.6 (Reasoning)	99%
Grok 4.5 (Reasoning, High)	99%
Claude Sonnet 4.6 (Reasoning)	99%
Grok 4.20 (Reasoning)	99%
Claude Opus 4.7	99%
Gemini 3 Flash (Preview)	99%
Gemma 4 31B	99%

	Score	Cost	Time
Gemini 3 Flash (Preview)	99%	$0.0027	4.4s
DeepSeek V4 Flash	98%	$0.0003	11.7s
Gemini 3.5 Flash (Reasoning, Minimal)	99%	$0.0082	3.4s
Grok 4.20	97%	$0.0029	6.9s
Gemini 2.5 Flash	97%	$0.0021	3.1s
Qwen 3.5 Plus (2026-02-15)	97%	$0.0022	9.5s
Gemini 3.1 Flash Lite (Preview)	95%	$0.0014	2.4s
Gemma 4 31B	99%	$0.0004	48.7s
Gemini 2.5 Flash Lite (Reasoning)	95%	$0.0031	25.5s
Mistral Large 3	96%	$0.0016	10.4s
DeepSeek V4 Pro	96%	$0.0021	25.4s
Gemini 3.1 Flash Lite	95%	$0.0014	2.4s
Claude Haiku 4.5	96%	$0.0052	6.2s
DeepSeek V3.2	99%	$0.0006	54.4s
Gemini 3.1 Flash Lite (Reasoning)	95%	$0.0014	13.9s
GPT-5.6 Terra	98%	$0.013	4.5s
DeepSeek V3.1	90%	$0.0009	33.9s
GPT-5.6 Luna (Reasoning)	99%	$0.013	13.6s
Claude Sonnet 4.5	99%	$0.016	7.0s
GPT-5.6 Luna	95%	$0.0054	3.3s

	Score	Cost	Speed	Stability
Gemini 3 Flash (Preview)	99%	$0.0027	4.4s	97%
Gemini 3.5 Flash (Reasoning, Minimal)	99%	$0.0082	3.4s	99%
DeepSeek V4 Flash	98%	$0.0003	11.7s	97%
Grok 4.20	97%	$0.0029	6.9s	97%
Gemini 2.5 Flash	97%	$0.0021	3.1s	95%
Qwen 3.5 Plus (2026-02-15)	97%	$0.0022	9.5s	95%
GPT-5.6 Luna (Reasoning)	99%	$0.013	13.6s	98%
Gemma 4 31B	99%	$0.0004	48.7s	98%
Claude Sonnet 4.5	99%	$0.016	7.0s	97%
GPT-5.6 Terra	98%	$0.013	4.5s	97%
Mistral Large 3	96%	$0.0016	10.4s	95%
Claude Sonnet 4.6	98%	$0.016	7.4s	98%
DeepSeek V3.2	99%	$0.0006	54.4s	97%
Gemini 3.1 Flash Lite	95%	$0.0014	2.4s	93%
Claude Sonnet 5	98%	$0.015	12.0s	97%
Grok 4.5 (Reasoning, Low)	98%	$0.013	19.7s	97%
Gemini 3.1 Flash Lite (Preview)	95%	$0.0014	2.4s	92%
GPT-5.6 Luna	95%	$0.0054	3.3s	93%
DeepSeek V4 Pro	96%	$0.0021	25.4s	94%
Mistral Large 2	96%	$0.0064	10.4s	95%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	# 6	# 7	Total
138	Gemini 3.1 Pro (Preview)	$0.176	2.9m	99%	99	99	99	99	99	99	99	99%
93	GPT-5.5 (Reasoning)	$0.085	29.9s	98%	100	99	99	99	99	99	98	99%
49	Z.AI GLM 5.2 (Reasoning, High)	$0.021	1.2m	98%	100	99	99	99	99	99	98	99%
115	Aion 3.0	$0.061	3.3m	99%	99	99	99	99	99	99	99	99%
78	Gemini 2.5 Pro	$0.060	43.9s	97%	100	100	100	99	99	98	98	99%
108	Claude Opus 4.8 (Reasoning)	$0.107	38.4s	99%	99	99	99	99	99	99	98	99%
2	Gemini 3.5 Flash (Reasoning, Minimal)	$0.0082	3.4s	99%	99	99	99	99	99	99	98	99%
82	GPT-5.6 Sol (Reasoning)	$0.075	29.3s	98%	100	99	99	99	98	98	98	99%
107	Claude Opus 4.8 (Reasoning, Low)	$0.106	38.3s	99%	99	99	99	99	99	98	98	99%
120	Z.AI GLM 5.1	$0.056	4.2m	97%	100	99	99	99	99	98	97	99%
128	Gemma 4 31B (Reasoning)	$0.0034	8.1m	97%	100	99	99	99	99	98	97	99%
7	GPT-5.6 Luna (Reasoning)	$0.013	13.6s	98%	99	99	99	99	98	98	98	99%
9	Claude Sonnet 4.5	$0.016	7.0s	97%	100	99	99	99	98	98	97	99%
140	Claude Opus 4.6 (Reasoning)	$0.201	1.8m	97%	100	100	99	99	98	98	98	99%
85	Grok 4.5 (Reasoning, High)	$0.051	1.6m	98%	99	99	99	99	98	98	98	99%
132	Claude Sonnet 4.6 (Reasoning)	$0.154	1.9m	97%	100	99	99	98	98	98	98	99%
59	Grok 4.20 (Reasoning)	$0.021	1.2m	97%	99	99	99	98	98	98	98	99%
41	Claude Opus 4.7	$0.037	7.5s	98%	99	99	98	98	98	98	98	99%
1	Gemini 3 Flash (Preview)	$0.0027	4.4s	97%	99	99	99	98	98	98	98	99%
8	Gemma 4 31B	$0.0004	48.7s	98%	99	99	98	98	98	98	98	99%
50	GPT-5.5 (Reasoning, Low)	$0.038	14.4s	97%	99	99	99	98	98	98	98	99%
13	DeepSeek V3.2	$0.0006	54.4s	97%	99	99	99	99	98	98	97	99%
91	GPT-5.4 (Reasoning)	$0.066	1.0m	96%	100	100	98	98	98	98	96	98%
53	Z.AI GLM 5 Turbo	$0.024	54.7s	97%	99	99	98	98	98	98	98	98%
27	Claude Opus 4.6	$0.026	8.7s	98%	98	98	98	98	98	98	98	98%
71	Claude Sonnet 5 (Reasoning)	$0.044	34.3s	98%	99	98	98	98	98	98	98	98%
33	GPT-5.5	$0.027	7.3s	97%	99	99	98	98	98	98	98	98%
126	Qwen3.6 Max Preview	$0.071	4.7m	96%	100	100	99	99	98	97	96	98%
113	DeepSeek V4 Pro (Reasoning)	$0.018	4.6m	95%	100	100	99	98	98	98	97	98%
96	Claude Opus 4.7 (Reasoning)	$0.087	19.5s	97%	99	99	98	98	98	98	97	98%
34	GPT-5.6 Sol	$0.027	7.0s	95%	100	100	100	99	98	98	94	98%
130	MiniMax M3	$0.020	7.6m	95%	100	100	99	99	98	98	94	98%
89	GPT-5	$0.055	1.5m	97%	99	99	99	99	98	97	96	98%
16	Grok 4.5 (Reasoning, Low)	$0.013	19.7s	97%	99	98	98	98	98	98	97	98%
12	Claude Sonnet 4.6	$0.016	7.4s	98%	98	98	98	98	98	98	98	98%
116	Z.AI GLM 5	$0.031	4.7m	97%	99	99	98	98	98	98	97	98%
72	Claude Sonnet 5 (Reasoning, Low)	$0.044	34.6s	98%	98	98	98	98	98	98	98	98%
88	Gemma 4 26B (Reasoning)	$0.0046	3.7m	96%	99	99	99	99	98	97	95	98%
21	Claude Sonnet 4	$0.016	9.2s	96%	100	99	98	98	97	97	96	98%
10	GPT-5.6 Terra	$0.013	4.5s	97%	98	98	98	98	98	98	96	98%
60	o4 Mini	$0.029	43.2s	96%	99	99	98	98	98	97	96	98%
68	GPT-5.1	$0.035	43.5s	96%	99	98	98	98	98	98	97	98%
123	Qwen 3.5 27B	$0.053	4.2m	95%	100	99	98	98	97	97	97	98%
15	Claude Sonnet 5	$0.015	12.0s	97%	98	98	98	98	98	97	97	98%
104	Gemini 3.5 Flash (Reasoning)	$0.093	40.3s	95%	100	100	98	98	97	97	97	98%
81	Z.AI GLM 4.7	$0.020	2.4m	95%	99	99	99	98	98	98	94	98%
58	GPT-5.6 Terra (Reasoning)	$0.034	17.6s	95%	99	99	98	98	97	97	96	98%
3	DeepSeek V4 Flash	$0.0003	11.7s	97%	98	98	98	98	98	98	98	98%
105	Qwen 3.5 397B A17B	$0.013	4.3m	94%	100	100	98	97	97	97	96	98%
145	MoonshotAI: Kimi K2.6	$0.083	9.3m	91%	100	100	100	100	99	98	87	98%
4	Grok 4.20	$0.0029	6.9s	97%	98	98	98	98	98	98	96	97%
83	Qwen 3.6 27B	$0.028	2.0m	93%	99	99	99	96	96	96	95	97%
36	Claude Opus 4.5	$0.026	8.0s	96%	98	98	98	97	97	97	97	97%
45	GPT-5 Mini	$0.0095	1.0m	95%	99	98	98	98	96	95	95	97%
122	Qwen3.7 Max	$0.084	2.5m	94%	99	98	98	97	97	97	95	97%
46	DeepSeek V4 Flash (Reasoning)	$0.0018	1.3m	94%	99	99	98	98	96	95	94	97%
69	GPT-5.2	$0.032	30.1s	93%	100	99	98	97	97	95	95	97%
56	GPT-5.4 (Reasoning, Low)	$0.027	19.9s	94%	99	98	97	96	96	96	96	97%
94	Claude Opus 4	$0.078	13.3s	94%	99	98	98	97	96	96	95	97%
52	Xiaomi MIMO v2.5 Pro	$0.011	49.1s	92%	100	99	99	98	96	95	92	97%
61	Gemini 3 Flash (Preview, Reasoning)	$0.026	44.8s	95%	98	97	97	97	97	97	96	97%
76	Grok 4.3 (Reasoning)	$0.022	1.8m	95%	98	97	97	97	96	96	96	97%
5	Gemini 2.5 Flash	$0.0021	3.1s	95%	98	98	97	96	96	96	96	97%
37	Gemini 2.5 Flash (Reasoning)	$0.015	23.2s	94%	98	98	97	97	97	96	94	97%
48	Z.AI GLM 4.6	$0.010	57.5s	95%	98	98	98	97	96	96	95	97%
106	Qwen 3.5 122B	$0.048	2.4m	92%	99	98	98	96	96	95	94	97%
57	ByteDance Seed 1.6	$0.0081	1.5m	97%	97	97	97	97	97	97	97	97%
6	Qwen 3.5 Plus (2026-02-15)	$0.0022	9.5s	95%	98	98	96	96	96	96	96	97%
54	Z.AI GLM 4.5	$0.0070	1.0m	92%	99	98	98	96	96	94	93	97%
26	GPT-5.4	$0.013	8.2s	95%	98	97	97	97	96	96	95	97%
19	DeepSeek V4 Pro	$0.0021	25.4s	94%	98	97	97	96	96	95	95	96%
112	o4 Mini High	$0.068	1.7m	93%	99	98	98	97	96	95	92	96%
87	Qwen 3.5 Plus (2026-04-20)	$0.022	2.3m	94%	98	97	97	96	95	95	94	96%
22	Claude Haiku 4.5	$0.0052	6.2s	92%	99	97	96	96	95	93	93	96%
65	Qwen 3.6 Flash	$0.016	49.1s	92%	99	96	96	96	95	95	92	96%
11	Mistral Large 3	$0.0016	10.4s	95%	96	96	96	96	96	96	95	96%
20	Mistral Large 2	$0.0064	10.4s	95%	96	96	96	96	96	95	95	96%
74	Qwen 3.6 35B	$0.014	1.2m	91%	98	98	96	94	94	94	94	95%
42	GPT-4o, Aug. 6th (temp=0)	$0.0097	4.1s	88%	99	99	99	95	92	92	92	95%
40	WizardLM 2 8x22b	$0.0011	48.1s	93%	98	97	96	96	95	94	93	95%
73	GPT-5.4 Mini (Reasoning)	$0.022	50.2s	93%	97	96	96	95	95	95	94	95%
18	GPT-5.6 Luna	$0.0054	3.3s	93%	97	96	96	95	95	95	94	95%
142	MoonshotAI: Kimi K2.5	$0.035	8.9m	86%	99	98	98	98	97	96	80	95%
80	Qwen 3.5 35B	$0.028	1.4m	92%	98	98	95	95	94	94	93	95%
29	DeepSeek V3 (2024-12-26)	$0.0012	22.5s	92%	97	97	97	95	94	94	93	95%
14	Gemini 3.1 Flash Lite	$0.0014	2.4s	93%	97	96	95	95	95	95	94	95%
17	Gemini 3.1 Flash Lite (Preview)	$0.0014	2.4s	92%	97	97	96	95	94	94	92	95%
117	ByteDance Seed 2.0 Mini	$0.0044	5.1m	92%	98	96	96	95	94	94	93	95%
55	Gemini 2.5 Flash Lite (Reasoning)	$0.0031	25.5s	86%	99	98	98	98	96	95	81	95%
31	Writer: Palmyra X5	$0.0052	9.7s	93%	97	95	95	95	94	94	93	95%
86	Z.AI GLM 4.5 Air	$0.0095	2.4m	91%	98	96	96	96	94	93	90	95%
63	MiniMax M2.5	$0.0025	1.2m	92%	96	96	95	95	94	94	92	95%
25	Gemini 3.1 Flash Lite (Reasoning)	$0.0014	13.9s	93%	96	95	95	95	95	94	92	95%
32	Grok 4.3	$0.0031	5.6s	91%	98	96	95	95	93	92	92	94%
35	Gemma 4 26B	$0.0004	31.0s	92%	96	96	96	94	94	93	92	94%
28	Qwen3 235B A22B Instruct 2507	$0.0004	17.2s	93%	95	95	95	95	94	93	93	94%
23	Gemini 2.5 Flash Lite	$0.0004	2.4s	92%	96	94	94	93	93	93	93	94%
43	Xiaomi MIMO v2.5	$0.0034	12.7s	88%	98	96	96	94	92	91	88	94%
66	Hermes 3 405B	$0.0018	27.1s	84%	98	98	97	96	95	94	78	94%
24	Mistral Medium 3.1	$0.0019	6.3s	93%	94	94	94	94	94	93	92	94%
30	Gemma 3 12B	$0.0001	12.1s	92%	95	94	94	94	94	93	91	94%
38	GPT-4.1 Mini	$0.0016	9.7s	90%	95	94	94	92	92	92	91	93%
39	Mistral Small 3.2 24B	$0.0003	6.7s	89%	95	95	93	92	92	90	90	93%
51	GPT-5.4 Mini (Reasoning, Low)	$0.0065	11.5s	89%	94	94	93	92	92	92	89	92%
47	GPT-4.1	$0.0078	5.6s	90%	94	93	92	92	92	92	91	92%
44	GPT-4o Mini (temp=0)	$0.0006	14.1s	90%	92	92	91	91	91	91	91	91%
109	ByteDance Seed 2.0 Lite	$0.0094	1.8m	73%	96	96	95	95	94	94	62	90%
101	DeepSeek V3.1	$0.0009	33.9s	62%	98	98	98	97	95	95	46	90%
62	GPT-5.4 Mini	$0.0040	2.6s	87%	92	90	90	89	89	89	87	90%
70	Qwen 2.5 72B	$0.0004	16.5s	86%	92	90	89	89	88	88	86	89%
64	Mistral Small 4	$0.0006	4.5s	86%	92	89	89	89	89	88	87	89%
75	Llama 3.1 70B	$0.0007	24.8s	83%	93	92	89	88	87	87	85	89%
67	Ministral 3 14B	$0.0003	5.9s	86%	90	89	88	88	88	87	86	88%
114	Qwen 3 32B	$0.0011	1.0m	60%	98	96	95	95	94	93	43	88%
134	Aion 3.0 Mini	$0.011	2.9m	44%	99	99	99	99	99	98	20	88%
79	ByteDance Seed 1.6 Flash	$0.0013	24.3s	79%	93	93	91	88	88	83	78	88%
77	GPT-4o Mini (temp=1)	$0.0006	13.5s	81%	91	90	90	89	89	89	76	88%
90	Cydonia 24B V4.1	$0.0006	20.7s	74%	95	93	91	85	84	80	76	86%
84	Gemma 3 27B	$0.0003	24.0s	77%	94	93	92	91	79	77	77	86%
129	Qwen 3.5 Flash	$0.0059	1.8m	45%	98	98	98	97	97	94	20	86%
97	GPT-5.4 Nano (Reasoning)	$0.0049	30.4s	75%	95	88	87	84	83	82	78	85%
119	Mistral Small 4 (Reasoning)	$0.0048	46.1s	59%	93	93	93	92	90	89	40	84%
100	GPT-5.4 Nano (Reasoning, Low)	$0.0017	10.7s	66%	94	93	91	78	78	77	77	84%
121	GPT-OSS 120B	$0.0011	1.1m	57%	98	98	97	96	95	52	52	84%
136	Aion 2.0	$0.0085	1.8m	31%	99	98	98	98	97	96	0	84%
143	MiniMax M2.7	$0.023	5.4m	52%	98	97	96	96	96	59	37	83%
131	Z.AI GLM 4.7 Flash	$0.0041	3.1m	59%	96	95	94	92	90	63	48	83%
95	Arcee AI: Trinity Mini	$0.0004	11.6s	76%	88	83	82	81	80	80	78	82%
125	DeepSeek V3 (2025-03-24)	$0.0009	48.3s	45%	97	97	96	95	94	70	20	81%
133	Nemotron 3 Super	$0.0000	3.6m	56%	99	97	96	94	78	52	49	81%
124	GPT-5 Nano	$0.0064	2.3m	70%	89	89	83	82	80	72	67	80%
98	GPT-5.4 Nano	$0.0011	3.7s	73%	88	82	81	79	78	77	77	80%
99	Mistral NeMO	$0.0003	2.3s	71%	88	85	78	78	78	78	78	80%
92	Gemma 3 4B	$0.0001	11.8s	80%	80	80	80	80	80	80	80	80%
144	Nemotron 3 Nano	$0.0052	5.9m	48%	94	93	92	90	90	49	33	77%
127	GPT-4o, Aug. 6th (temp=1)	$0.0088	3.5s	45%	98	93	92	88	75	74	20	77%
110	GPT-4.1 Nano	$0.0004	4.9s	70%	84	80	76	76	75	72	72	76%
137	Qwen 3.5 9B	$0.0023	3.2m	48%	99	96	96	95	52	52	39	76%
111	Ministral 3B	$0.0001	2.9s	70%	79	78	78	78	76	76	64	75%
102	Ministral 3 8B	$0.0003	4.0s	75%	76	76	76	76	76	75	75	75%
103	Ministral 8B	$0.0002	3.9s	74%	76	75	75	75	75	75	74	75%
135	DeepSeek-V2 Chat	$0.0011	20.5s	31%	97	94	94	92	91	20	20	73%
118	Ministral 3 3B	$0.0002	2.9s	67%	78	78	78	78	64	64	62	72%
141	Inception Mercury 2	$0.0034	4.7s	28%	98	97	52	52	52	42	37	61%
139	Cohere Command R+ (Aug. 2024)	$0.0095	20.5s	41%	78	76	71	54	53	50	46	61%
146	Hermes 3 70B	$0.0039	5.3m	0%	92	91	0	0	0	0	0	26%
92.32%

Median	Evaluator	Top 3	Flop 3
97.9%	Dialogue content preserved unchanged	100Qwen 3.5 Plus (2026-04-20) 100MoonshotAI: Kimi K2.6 100Grok 4.5 (Reasoning, High)	24Hermes 3 70B 57Inception Mercury 2 69DeepSeek-V2 Chat
100.0%	No hallucinated or fabricated content	100WizardLM 2 8x22b 100GPT-4.1 Nano 100MoonshotAI: Kimi K2.6	29Hermes 3 70B 61Cohere Command R+ (Aug. 2024) 79Nemotron 3 Nano
96.4%	Non-passive narration preserved	100Z.AI GLM 5 Turbo 100Qwen3 235B A22B Instruct 2507 100GPT-5.6 Luna (Reasoning)	29Hermes 3 70B 55Inception Mercury 2 63Cohere Command R+ (Aug. 2024)
89.3%	Passive → active voice transformations	99Claude Opus 4.6 (Reasoning) 99GPT-5.6 Sol 98Gemini 3.5 Flash (Reasoning)	0Gemma 3 4B 0Ministral 3 3B 1Ministral 3B
100.0%	Structural similarity to original	100MiniMax M2.5 100Llama 3.1 70B 100Gemini 3.1 Flash Lite (Preview)	29Hermes 3 70B 76Inception Mercury 2 76Qwen 3.5 9B

Text Replacement

Passive voice → active voice

Performance Score Distribution (Top 20)

Price-Performance Score Distribution (Top 20)

Most Stable Models (Top 20)

Top Overall Models (Top 20)

	Score	Consistency	Stability
Gemini 3.1 Pro (Preview)	99%	100%	99%
Aion 3.0	99%	100%	99%
Claude Opus 4.8 (Reasoning)	99%	99%	99%
Gemini 3.5 Flash (Reasoning, Minimal)	99%	99%	99%
Claude Opus 4.8 (Reasoning, Low)	99%	99%	99%
Claude Opus 4.6	98%	100%	98%
GPT-5.5 (Reasoning)	99%	99%	98%
Z.AI GLM 5.2 (Reasoning, High)	99%	99%	98%
GPT-5.6 Sol (Reasoning)	99%	99%	98%
GPT-5.6 Luna (Reasoning)	99%	99%	98%
Grok 4.5 (Reasoning, High)	99%	99%	98%
Claude Sonnet 4.6	98%	99%	98%
Claude Opus 4.7	99%	99%	98%
Gemma 4 31B	99%	99%	98%
Claude Sonnet 5 (Reasoning, Low)	98%	99%	98%
Claude Sonnet 5 (Reasoning)	98%	99%	98%
Gemini 2.5 Pro	99%	98%	97%
Claude Sonnet 4.5	99%	98%	97%
DeepSeek V3.2	99%	98%	97%
Grok 4.20 (Reasoning)	99%	99%	97%