Bad Writing Habits

Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4	90%
GPT-5.4 (Reasoning)	90%
GPT-5.4 (Reasoning, Low)	90%
GPT-5.5	89%
GPT-5.5 (Reasoning, Low)	89%
GPT-5.5 (Reasoning)	89%
GPT-5.4 Mini (Reasoning)	88%
GPT-5.4 Mini	87%
GPT-5.4 Mini (Reasoning, Low)	87%
Qwen3.6 Max Preview	86%
GPT-5.1	86%
Grok 4.20 (Reasoning)	85%
Qwen 3.5 397B A17B	85%
Qwen3 235B A22B Instruct 2507	85%
Qwen 3.6 Flash	84%
GPT-5	84%
Writer: Palmyra X5	84%
Qwen 3.6 35B	84%
Claude Sonnet 4.5	84%
Z.AI GLM 5 Turbo	84%

	Score	Cost	Time
GPT-5.4 Mini	87%	$0.015	16.8s
GPT-5.4 Mini (Reasoning, Low)	87%	$0.015	16.8s
GPT-5.4 Mini (Reasoning)	88%	$0.022	28.1s
GPT-5.4	90%	$0.049	1.4m
GPT-5.4 (Reasoning, Low)	90%	$0.055	1.4m
Writer: Palmyra X5	84%	$0.011	22.0s
Qwen 3.6 Flash	84%	$0.010	41.4s
Z.AI GLM 5 Turbo	84%	$0.0081	33.2s
Qwen3 235B A22B Instruct 2507	85%	$0.0011	59.2s
Grok 4.20 (Beta)	82%	$0.018	15.8s
Grok 4.3	82%	$0.0069	30.5s
Grok 4.20	83%	$0.0093	45.7s
Qwen 3.6 35B	84%	$0.0083	1.0m
Mistral Small 4 (Reasoning)	82%	$0.0022	30.2s
DeepSeek V4 Flash	82%	$0.0006	31.6s
Grok 4.20 (Reasoning)	85%	$0.018	1.5m
Claude Sonnet 4.5	84%	$0.035	38.1s
DeepSeek V4 Flash (Reasoning)	82%	$0.0007	31.1s
Z.AI GLM 5	83%	$0.0084	1.2m
Mistral Small 4	81%	$0.0014	18.2s

	Score	Consistency	Stability
GPT-5.5 (Reasoning)	89%	96%	86%
GPT-5.5	89%	96%	85%
GPT-5.4	90%	94%	85%
GPT-5.5 (Reasoning, Low)	89%	95%	85%
GPT-5.4 (Reasoning, Low)	90%	94%	85%
GPT-5.4 (Reasoning)	90%	94%	85%
GPT-5.4 Mini	87%	95%	83%
GPT-5.4 Mini (Reasoning, Low)	87%	95%	82%
GPT-5.4 Mini (Reasoning)	88%	94%	82%
Qwen3.6 Max Preview	86%	93%	81%
GPT-5.1	86%	93%	80%
Qwen 3.5 397B A17B	85%	92%	79%
GPT-5	84%	93%	78%
MoonshotAI: Kimi K2.6	84%	92%	78%
Gemini 3.1 Pro (Preview)	83%	92%	77%
Qwen3 235B A22B Instruct 2507	85%	91%	77%
Qwen 3.6 Flash	84%	91%	77%
Qwen 3.6 35B	84%	92%	77%
Grok 4.20 (Reasoning)	85%	89%	76%
Z.AI GLM 5 Turbo	84%	90%	76%

	Score	Cost	Speed	Stability
GPT-5.4 Mini	87%	$0.015	16.8s	83%
GPT-5.4 Mini (Reasoning, Low)	87%	$0.015	16.8s	82%
GPT-5.4 Mini (Reasoning)	88%	$0.022	28.1s	82%
GPT-5.4	90%	$0.049	1.4m	85%
GPT-5.4 (Reasoning, Low)	90%	$0.055	1.4m	85%
Qwen3 235B A22B Instruct 2507	85%	$0.0011	59.2s	77%
Qwen 3.6 Flash	84%	$0.010	41.4s	77%
Writer: Palmyra X5	84%	$0.011	22.0s	76%
Z.AI GLM 5 Turbo	84%	$0.0081	33.2s	76%
Qwen 3.6 35B	84%	$0.0083	1.0m	77%
Grok 4.3	82%	$0.0069	30.5s	76%
Grok 4.20	83%	$0.0093	45.7s	76%
DeepSeek V4 Flash	82%	$0.0006	31.6s	74%
Mistral Small 4 (Reasoning)	82%	$0.0022	30.2s	75%
Grok 4.1 Fast	81%	$0.0018	37.8s	75%
Qwen 3.5 Flash	81%	$0.0025	47.5s	76%
Mistral Small 4	81%	$0.0014	18.2s	74%
Grok 4.20 (Beta)	82%	$0.018	15.8s	74%
Grok 4.20 (Reasoning)	85%	$0.018	1.5m	76%
GPT-5.4 (Reasoning)	90%	$0.089	2.6m	85%

		genre	Novelcrafter Default Prompt	Detailed Writing Rules
GPT-5.4	90%	89%	92%	92%	84%	92%	88%	88%	92%	89%	90%	91%	90%	91%	93%	92%	89%	92%	90%
GPT-5.4 (Reasoning)	90%	88%	91%	91%	89%	92%	88%	86%	93%	89%	90%	93%	92%	87%	91%	88%	92%	91%	93%
GPT-5.4 (Reasoning, Low)	90%	89%	91%	90%	86%	91%	86%	86%	93%	90%	90%	89%	90%	88%	91%	92%	89%	93%	90%
GPT-5.5	89%	90%	91%	90%	87%	91%	86%	87%	89%	90%	88%	89%	90%	88%	90%	89%	90%	90%	91%
GPT-5.5 (Reasoning, Low)	89%	89%	90%	90%	90%	90%	89%	87%	90%	89%	89%	87%	91%	88%	90%	88%	88%	89%	91%
GPT-5.5 (Reasoning)	89%	88%	90%	89%	89%	90%	88%	88%	90%	88%	88%	88%	91%	87%	90%	89%	88%	89%	92%
GPT-5.4 Mini (Reasoning)	88%	86%	90%	89%	84%	89%	85%	84%	90%	87%	85%	89%	86%	87%	90%	87%	87%	90%	90%
GPT-5.4 Mini	87%	88%	89%	88%	83%	89%	86%	87%	87%	90%	86%	87%	85%	88%	88%	89%	86%	89%	86%
GPT-5.4 Mini (Reasoning, Low)	87%	87%	87%	90%	83%	87%	86%	84%	88%	86%	84%	87%	86%	86%	89%	88%	87%	88%	88%
Qwen3.6 Max Preview	86%	84%	90%	85%	83%	87%	88%	84%	88%	83%	84%	92%	89%	85%	89%	87%	84%	87%	87%
GPT-5.1	86%	84%	83%	83%	80%	85%	83%	84%	88%	87%	87%	90%	88%	85%	90%	86%	86%	90%	89%
Grok 4.20 (Reasoning)	85%	85%	75%	81%	80%	82%	80%	94%	90%	86%	87%	92%	86%	88%	86%	87%	84%	86%	85%
Qwen 3.5 397B A17B	85%	78%	87%	83%	80%	84%	86%	85%	86%	85%	86%	85%	86%	87%	91%	87%	84%	88%	87%
Qwen3 235B A22B Instruct 2507	85%	83%	86%	83%	78%	81%	81%	86%	87%	83%	80%	85%	85%	89%	91%	87%	84%	87%	87%
Qwen 3.6 Flash	84%	81%	85%	81%	81%	86%	83%	81%	84%	85%	85%	85%	86%	84%	87%	87%	83%	88%	85%

Literary fiction: old friends reunite

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
Claude Opus 4	92%
GPT-5.4	91%
Writer: Palmyra X5	90%
Qwen3 235B A22B Instruct 2507	89%
Grok 4.20 (Reasoning)	88%
GPT-5.5 (Reasoning, Low)	88%
Grok 4.20 (Beta)	88%
GPT-5.5	88%
DeepSeek V4 Pro	88%
Claude Sonnet 4	88%
Claude Opus 4.7	88%
GPT-5.4 (Reasoning, Low)	88%
Grok 4.20	88%
GPT-5.4 Mini	88%
Z.AI GLM 5	88%
Mistral Medium 3.1	87%
GPT-5.5 (Reasoning)	87%
DeepSeek V4 Pro (Reasoning)	87%
Qwen 3.5 397B A17B	87%
GPT-5.4 (Reasoning)	87%

	Score	Cost	Time
Grok 4.20 (Beta)	88%	$0.017	14.1s
Writer: Palmyra X5	90%	$0.014	23.9s
Qwen3 235B A22B Instruct 2507	89%	$0.0013	1.2m
Mistral Medium 3.1	87%	$0.0052	38.3s
Grok 4.20	88%	$0.011	48.0s
Mistral Small 4 (Reasoning)	85%	$0.0025	29.1s
Mistral Large	86%	$0.018	33.2s
DeepSeek V3 (2025-03-24)	85%	$0.0018	38.5s
Qwen 3 32B	86%	$0.0019	1.7m
GPT-5.4 Mini	88%	$0.015	16.1s
DeepSeek V4 Pro	88%	$0.0043	49.6s
Z.AI GLM 5	88%	$0.011	1.1m
Hermes 3 405B	85%	$0.0054	35.9s
MiniMax M2.7	86%	$0.0040	1.3m
ByteDance Seed 1.6 Flash	84%	$0.0014	27.7s
Mistral Small 4	84%	$0.0022	26.2s
GPT-5.4 Nano (Reasoning, Low)	84%	$0.0044	18.2s
Grok 4.1 Fast	85%	$0.0026	47.0s
GPT-5.4 Nano	84%	$0.0051	18.4s
GPT-5.4 Mini (Reasoning)	87%	$0.030	32.6s

	Score	Consistency	Stability
Claude Opus 4	92%	96%	89%
GPT-5.5	88%	99%	87%
Qwen3 235B A22B Instruct 2507	89%	96%	86%
Grok 4.20 (Reasoning)	88%	98%	86%
Writer: Palmyra X5	90%	96%	86%
GPT-5.4	91%	94%	86%
GPT-5.4 (Reasoning, Low)	88%	97%	85%
GPT-5.5 (Reasoning, Low)	88%	97%	85%
GPT-5.4 Mini (Reasoning)	87%	97%	85%
Z.AI GLM 5	88%	97%	85%
DeepSeek V4 Pro (Reasoning)	87%	96%	84%
GPT-5.4 (Reasoning)	87%	96%	84%
Grok 4.20	88%	94%	84%
Qwen 3.5 397B A17B	87%	95%	84%
Claude Sonnet 4.6 (Reasoning)	85%	97%	83%
GPT-5.5 (Reasoning)	87%	96%	83%
GPT-5.4 Mini	88%	96%	83%
Claude Opus 4.7	88%	94%	83%
o4 Mini	84%	99%	83%
MiniMax M2.7	86%	96%	83%

	Score	Cost	Speed	Stability
Writer: Palmyra X5	90%	$0.014	23.9s	86%
Qwen3 235B A22B Instruct 2507	89%	$0.0013	1.2m	86%
GPT-5.4 Mini	88%	$0.015	16.1s	83%
Grok 4.20	88%	$0.011	48.0s	84%
Z.AI GLM 5	88%	$0.011	1.1m	85%
Grok 4.20 (Beta)	88%	$0.017	14.1s	82%
GPT-5.4	91%	$0.049	1.4m	86%
Grok 4.20 (Reasoning)	88%	$0.021	1.8m	86%
GPT-5.4 Mini (Reasoning)	87%	$0.030	32.6s	85%
DeepSeek V4 Pro	88%	$0.0043	49.6s	81%
Mistral Medium 3.1	87%	$0.0052	38.3s	81%
Mistral Small 4 (Reasoning)	85%	$0.0025	29.1s	82%
Grok 4.1 Fast	85%	$0.0026	47.0s	82%
GPT-5.4 Mini (Reasoning, Low)	86%	$0.015	15.7s	81%
Mistral Large 2	86%	$0.018	32.8s	82%
MiniMax M2.7	86%	$0.0040	1.3m	83%
DeepSeek V3 (2025-03-24)	85%	$0.0018	38.5s	81%
GPT-5.4 Nano	84%	$0.0051	18.4s	81%
o4 Mini	84%	$0.014	25.2s	83%
Mistral Large	86%	$0.018	33.2s	81%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
84	Claude Opus 4	$0.250	1.5m	89%	94	94	92	90	89	92%
7	GPT-5.4	$0.049	1.4m	86%	94	93	91	89	86	91%
1	Writer: Palmyra X5	$0.014	23.9s	86%	93	90	89	89	88	90%
2	Qwen3 235B A22B Instruct 2507	$0.0013	1.2m	86%	92	91	90	89	85	89%
8	Grok 4.20 (Reasoning)	$0.021	1.8m	86%	90	88	88	88	87	88%
58	GPT-5.5 (Reasoning, Low)	$0.129	1.6m	85%	90	89	88	88	86	88%
6	Grok 4.20 (Beta)	$0.017	14.1s	82%	94	90	89	87	81	88%
52	GPT-5.5	$0.132	1.8m	87%	89	88	88	88	87	88%
10	DeepSeek V4 Pro	$0.0043	49.6s	81%	92	92	86	86	84	88%
25	Claude Sonnet 4	$0.043	51.6s	81%	93	89	86	86	86	88%
31	Claude Opus 4.7	$0.089	31.6s	83%	93	88	88	86	85	88%
23	GPT-5.4 (Reasoning, Low)	$0.060	1.4m	85%	90	89	88	87	85	88%
4	Grok 4.20	$0.011	48.0s	84%	91	90	89	85	83	88%
3	GPT-5.4 Mini	$0.015	16.1s	83%	91	88	87	86	86	88%
5	Z.AI GLM 5	$0.011	1.1m	85%	90	89	87	87	85	88%
11	Mistral Medium 3.1	$0.0052	38.3s	81%	92	88	88	87	81	87%
90	GPT-5.5 (Reasoning)	$0.147	1.9m	83%	91	88	87	86	85	87%
65	DeepSeek V4 Pro (Reasoning)	$0.026	4.9m	84%	89	88	88	85	84	87%
45	Qwen 3.5 397B A17B	$0.012	3.7m	84%	89	89	88	84	84	87%
98	GPT-5.4 (Reasoning)	$0.119	3.4m	84%	89	89	88	85	84	87%
49	Claude Opus 4.5	$0.082	51.4s	81%	90	90	86	85	83	87%
9	GPT-5.4 Mini (Reasoning)	$0.030	32.6s	85%	88	87	87	87	84	87%
20	Mistral Large	$0.018	33.2s	81%	90	90	88	84	79	86%
24	Qwen 3 32B	$0.0019	1.7m	82%	90	87	87	86	81	86%
29	Mistral Large 3	$0.0051	40.8s	78%	94	86	85	84	82	86%
34	Claude Sonnet 4.5	$0.048	47.9s	80%	90	90	86	83	82	86%
143	MoonshotAI: Kimi K2.6	$0.083	8.1m	79%	90	90	85	84	81	86%
76	Z.AI GLM 5.1	$0.024	3.4m	79%	91	88	85	84	81	86%
14	GPT-5.4 Mini (Reasoning, Low)	$0.015	15.7s	81%	91	85	85	85	84	86%
15	Mistral Large 2	$0.018	32.8s	82%	89	86	86	86	82	86%
38	Claude Sonnet 4.6	$0.041	39.3s	79%	90	88	85	83	82	86%
66	GPT-5	$0.062	2.5m	82%	90	86	85	84	84	86%
16	MiniMax M2.7	$0.0040	1.3m	83%	87	87	86	86	82	86%
32	MiniMax M2.5	$0.0044	2.2m	82%	89	87	87	83	81	85%
82	Qwen3.6 Max Preview	$0.049	3.1m	80%	90	88	87	81	81	85%
28	Xiaomi MIMO v2.5 Pro	$0.011	1.1m	81%	89	87	86	84	81	85%
12	Mistral Small 4 (Reasoning)	$0.0025	29.1s	82%	88	87	86	85	80	85%
13	Grok 4.1 Fast	$0.0026	47.0s	82%	88	85	85	85	82	85%
33	Qwen 3.5 Plus (2026-04-20)	$0.018	1.7m	82%	87	87	85	83	82	85%
61	Claude Opus 4.6	$0.088	1.1m	81%	88	87	86	82	81	85%
17	DeepSeek V3 (2025-03-24)	$0.0018	38.5s	81%	87	87	85	85	80	85%
106	Claude Opus 4.6 (Reasoning)	$0.114	1.5m	78%	90	87	84	83	81	85%
87	MoonshotAI: Kimi K2.5	$0.024	4.3m	81%	87	87	85	84	80	85%
60	GPT-5.1	$0.055	2.1m	81%	88	86	84	83	83	85%
35	Hermes 3 405B	$0.0054	35.9s	77%	93	86	86	84	76	85%
91	Claude Sonnet 4.6 (Reasoning)	$0.118	2.0m	83%	86	86	85	84	83	85%
37	DeepSeek V4 Flash	$0.0007	22.5s	75%	93	85	83	81	81	85%
54	Claude 3.5 Sonnet	$0.064	41.3s	81%	88	85	85	83	81	84%
43	Qwen 3.6 Flash	$0.012	44.6s	77%	92	85	84	82	78	84%
18	GPT-5.4 Nano	$0.0051	18.4s	81%	87	85	84	83	82	84%
46	Grok 4.3 (Reasoning)	$0.020	2.1m	82%	86	85	84	83	83	84%
21	Grok 4.3	$0.0088	27.4s	81%	87	85	84	84	82	84%
73	Claude Opus 4.7 (Reasoning)	$0.102	35.9s	80%	88	87	86	81	79	84%
57	GPT-5.2	$0.057	1.5m	82%	85	85	84	83	83	84%
22	GPT-5.4 Nano (Reasoning, Low)	$0.0044	18.2s	80%	87	86	84	82	81	84%
51	ByteDance Seed 2.0 Lite	$0.012	2.0m	80%	87	86	85	81	81	84%
47	Z.AI GLM 5 Turbo	$0.011	30.3s	76%	89	87	82	82	79	84%
53	Grok 4.20 (Beta, Reasoning)	$0.041	33.7s	78%	88	85	85	85	76	84%
26	GPT-5.4 Nano (Reasoning)	$0.0059	25.8s	81%	86	86	84	82	81	84%
19	o4 Mini	$0.014	25.2s	83%	84	84	84	83	83	84%
40	WizardLM 2 8x22b	$0.0039	2.1m	82%	85	85	84	82	82	84%
27	ByteDance Seed 1.6 Flash	$0.0014	27.7s	80%	87	85	84	83	79	84%
41	DeepSeek V4 Flash (Reasoning)	$0.0007	32.5s	77%	88	87	82	81	79	84%
30	Claude Haiku 4.5	$0.017	27.9s	81%	86	84	84	84	80	84%
36	Mistral Small 4	$0.0022	26.2s	77%	90	85	84	81	77	84%
39	GPT-4.1	$0.017	54.4s	80%	86	85	84	83	79	83%
48	Qwen 3.6 35B	$0.0085	1.0m	78%	88	84	83	82	79	83%
42	Qwen 3.5 9B	$0.0013	1.2m	79%	87	84	83	83	79	83%
132	Gemini 3.1 Pro (Preview)	$0.148	2.3m	81%	84	84	83	82	80	83%
70	Ministral 3 14B	$0.0012	16.5s	71%	91	87	79	78	77	83%
59	DeepSeek V3.2	$0.0021	1.7m	77%	86	85	82	82	77	82%
123	Qwen 3.5 27B	$0.051	4.1m	78%	86	84	82	81	79	82%
44	Qwen 3.5 Flash	$0.0029	52.9s	79%	85	83	83	79	78	82%
112	ByteDance Seed 2.0 Mini	$0.0041	4.0m	76%	86	83	81	80	79	82%
50	Z.AI GLM 4.6	$0.0050	30.4s	77%	86	83	81	80	78	82%
81	Aion 2.0	$0.0083	1.3m	75%	86	86	81	78	76	82%
68	Qwen 3.5 35B	$0.019	1.0m	77%	84	84	83	82	74	81%
56	Mistral Small Creative	$0.0009	9.0s	75%	87	83	81	79	76	81%
55	GPT-5 Mini	$0.010	1.0m	79%	83	82	82	81	78	81%
116	Qwen 3.5 122B	$0.078	2.8m	80%	83	82	82	80	79	81%
71	DeepSeek V3 (2024-12-26)	$0.0027	44.5s	74%	87	85	80	78	76	81%
67	Rocinante 12B	$0.0018	34.3s	74%	86	86	83	76	74	81%
86	o4 Mini High	$0.042	1.3m	78%	83	81	81	80	79	81%
69	ByteDance Seed 1.6	$0.0093	1.7m	79%	82	82	81	80	79	81%
72	Gemini 2.5 Flash (Reasoning)	$0.011	21.0s	74%	87	82	80	79	76	81%
77	Z.AI GLM 4.5	$0.0062	46.2s	75%	85	83	79	79	77	81%
83	DeepSeek-V2 Chat	$0.0027	50.8s	74%	87	81	79	78	77	80%
114	Z.AI GLM 4.5 Air	$0.0052	1.2m	68%	89	86	76	76	76	80%
63	Xiaomi MIMO v2.5	$0.0055	31.1s	76%	84	82	80	78	77	80%
78	Z.AI GLM 4.7	$0.012	1.8m	79%	81	81	80	80	78	80%
94	Llama 3.1 8B	$0.0003	2.0m	75%	85	80	79	79	77	80%
104	Gemma 4 31B (Reasoning)	$0.0017	2.5m	76%	83	82	80	79	76	80%
88	Ministral 3 8B	$0.0010	11.5s	71%	88	84	82	77	68	80%
62	Stealth: Healer Alpha	$0.0000	28.2s	76%	84	81	80	77	77	80%
96	Claude 3 Haiku	$0.0030	13.7s	70%	90	78	78	77	77	80%
122	Grok 4	$0.060	1.8m	75%	84	80	79	78	78	80%
101	Arcee AI: Trinity Large (Preview)	$0.0000	47.2s	71%	87	85	79	75	72	80%
74	Qwen 3.5 Plus (2026-02-15)	$0.0070	28.0s	75%	83	83	79	79	76	80%
97	Z.AI GLM 4.7 Flash	$0.0018	1.8m	75%	84	81	80	79	75	80%
115	Claude 3.7 Sonnet	$0.056	55.9s	73%	86	82	81	75	74	80%
100	Gemini 2.5 Pro	$0.039	36.9s	75%	83	82	81	81	71	80%
64	Ministral 8B	$0.0006	11.8s	75%	82	81	80	79	73	79%
85	Gemma 4 31B	$0.0012	1.1m	75%	82	81	81	79	73	79%
92	LFM2 24B	$0.0003	29.8s	72%	86	79	78	77	76	79%
79	Grok 4 Fast	$0.0022	25.7s	74%	83	81	80	77	73	79%
89	Gemma 3 27B	$0.0010	55.7s	75%	82	82	80	75	74	79%
75	Ministral 3B	$0.0002	5.4s	74%	83	79	79	77	75	79%
133	Qwen 3.6 27B	$0.032	3.1m	76%	80	80	78	77	76	78%
109	GPT-4o, May 13th (temp=1)	$0.043	15.5s	72%	82	81	77	76	75	78%
105	Gemini 2.5 Flash	$0.0046	9.1s	70%	82	82	76	75	75	78%
80	Qwen 2.5 72B	$0.0016	38.6s	76%	80	79	78	78	75	78%
135	Hermes 3 70B	$0.0017	2.3m	68%	86	84	80	72	67	78%
118	Gemini 3 Pro (Preview)	$0.062	58.2s	75%	80	79	78	77	75	78%
117	Gemma 4 26B (Reasoning)	$0.0012	2.1m	73%	81	80	77	77	74	78%
102	Gemma 3 12B	$0.0004	40.3s	73%	81	80	77	76	74	78%
95	Stealth: Hunter Alpha	$0.0000	48.2s	74%	81	79	77	76	75	78%
93	Gemini 3 Flash (Preview, Reasoning)	$0.012	29.5s	75%	80	79	77	76	76	78%
103	GPT-4.1 Mini	$0.0026	16.2s	72%	82	79	76	75	75	77%
120	Ministral 3 3B	$0.0019	58.4s	70%	84	79	76	75	71	77%
99	Gemma 4 26B	$0.0012	34.1s	74%	81	78	77	76	73	77%
110	GPT-4o, May 13th (temp=0)	$0.039	10.4s	73%	81	79	78	76	72	77%
111	GPT-4o, Aug. 6th (temp=1)	$0.021	23.6s	72%	80	80	77	76	71	77%
121	DeepSeek V3.1	$0.0023	1.9m	73%	80	79	78	75	72	77%
113	Mistral NeMO	$0.0009	9.6s	69%	81	81	75	74	70	76%
108	Nemotron 3 Super	$0.0000	43.1s	72%	81	78	77	73	71	76%
134	Cohere Command R+ (Aug. 2024)	$0.023	46.6s	69%	82	75	74	74	73	76%
107	Arcee AI: Trinity Mini	$0.0004	9.5s	72%	78	78	76	74	71	76%
139	GPT-4o, Aug. 6th (temp=0)	$0.050	34.3s	68%	80	77	73	72	71	75%
130	Llama 3.1 70B	$0.0025	35.8s	69%	81	76	76	71	68	75%
127	Gemini 3 Flash (Preview)	$0.0084	17.9s	69%	79	76	74	73	70	75%
138	Llama 3.1 Nemotron 70B	$0.0065	35.8s	66%	81	77	73	72	67	74%
124	Gemini 3.1 Flash Lite	$0.0035	15.3s	70%	79	75	74	71	71	74%
129	Gemini 2.5 Flash Lite (Reasoning)	$0.0031	33.0s	70%	76	76	75	73	68	74%
126	Gemini 3.1 Flash Lite (Preview)	$0.0037	8.9s	69%	78	74	73	72	72	74%
125	Gemini 2.5 Flash Lite	$0.0011	11.3s	70%	77	76	74	70	70	73%
131	Inception Mercury 2	$0.0031	6.3s	69%	77	76	75	72	66	73%
137	Gemini 3.1 Flash Lite (Reasoning)	$0.0036	9.5s	66%	82	76	74	67	67	73%
119	Stealth: Aurora Alpha	$0.0000	9.4s	72%	75	74	74	72	71	73%
128	GPT-4o Mini (temp=0)	$0.0015	46.9s	72%	74	73	73	72	72	73%
140	GPT-OSS 120B	$0.0011	2.1m	69%	77	74	73	71	69	73%
142	GPT-4o Mini (temp=1)	$0.0016	42.6s	63%	81	76	70	69	67	73%
147	Mistral Small 3.2 24B	$0.010	7.0m	66%	78	75	74	73	62	72%
136	GPT-4.1 Nano	$0.0008	16.6s	67%	77	73	71	70	69	72%
144	GPT-5 Nano	$0.0042	1.5m	65%	77	71	70	69	67	71%
141	Gemma 3 4B	$0.0003	22.0s	64%	78	73	71	69	64	71%
145	Nemotron 3 Nano	$0.0013	1.8m	64%	78	76	72	65	64	71%
146	Inception Mercury	$0.022	33.5s	62%	74	73	67	67	64	69%
81.45%

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Gemma 3 27B 100Claude Sonnet 4.6 (Reasoning) 100MiniMax M2.7	0GPT-5 Nano 61Stealth: Hunter Alpha 76GPT-5.4 (Reasoning, Low)
45.7%	Adverb-first sentence starts	100Writer: Palmyra X5 100Mistral Large 100Qwen3 235B A22B Instruct 2507	0Inception Mercury 0Gemini 3.1 Flash Lite (Reasoning) 0Grok 4.3 (Reasoning)
100.0%	Adverbs in dialogue tags	100ByteDance Seed 2.0 Lite 100Z.AI GLM 5 Turbo 100MiniMax M2.7	13GPT-4.1 Nano 43Claude Haiku 4.5 55Llama 3.1 Nemotron 70B
89.9%	AI-ism adverb frequency	100Grok 4.1 Fast 99ByteDance Seed 2.0 Lite 99MoonshotAI: Kimi K2.6	63GPT-4.1 Nano 66Z.AI GLM 4.5 66Gemma 3 4B
100.0%	AI-ism character names	100Mistral Small Creative 100Gemini 3.1 Flash Lite (Reasoning) 100Grok 4.20 (Reasoning)	76Claude Opus 4 80Z.AI GLM 5.1 80Claude Opus 4.5
100.0%	AI-ism location names	100GPT-5 Nano 100Gemini 2.5 Pro 100GPT-4o, May 13th (temp=1)	96Rocinante 12B
53.0%	AI-ism word frequency	93Claude Opus 4.7 88Claude Opus 4.7 (Reasoning) 86GPT-5	0Stealth: Aurora Alpha 0GPT-4o, May 13th (temp=0) 0GPT-4o Mini (temp=0)
100.0%	Cliché density	100Claude Sonnet 4.6 (Reasoning) 100Claude Opus 4.6 (Reasoning) 100Gemini 2.5 Flash	13GPT-4o Mini (temp=0) 13Inception Mercury 33Inception Mercury 2
99.7%	Dialogue tag variety (said vs. fancy)	100Mistral Large 2 100MiniMax M2.7 100Mistral Medium 3.1	9Hermes 3 70B 18Stealth: Aurora Alpha 19GPT-4o Mini (temp=1)
96.0%	Em-dash & semicolon overuse	100Qwen 3.5 9B 100GPT-5.4 Nano (Reasoning, Low) 100DeepSeek V3.1	0GPT-5 Nano 0GPT-4o, Aug. 6th (temp=1) 0ByteDance Seed 1.6
100.0%	Emotion telling (show vs. tell)	100DeepSeek V4 Pro 100Qwen 2.5 72B 100DeepSeek V4 Pro (Reasoning)	76Mistral Small 3.2 24B 85Llama 3.1 70B 91GPT-4o, Aug. 6th (temp=0)
100.0%	Filter word density	100GPT-5.4 Mini 100Qwen 3.5 35B 100Mistral Small Creative	64Llama 3.1 8B 64Nemotron 3 Nano 67Llama 3.1 70B
100.0%	Gibberish response detection	100Qwen 3.5 Flash 100GPT-4o, May 13th (temp=0) 100GPT-5.4 (Reasoning, Low)	23Llama 3.1 8B 80Qwen 3.5 9B 80Z.AI GLM 4.5 Air
100.0%	Markdown formatting overuse	100DeepSeek V4 Flash (Reasoning) 100Gemma 3 12B 100GPT-5.1	80Ministral 3 3B 80Qwen 3.5 35B 80Ministral 3 8B
100.0%	Missing dialogue indicators (quotation marks)	100Gemini 3.1 Flash Lite (Preview) 100Llama 3.1 Nemotron 70B 100Gemini 3 Flash (Preview)	60GPT-5 60Qwen3.6 Max Preview 78Qwen 3.5 Flash
48.8%	Name drop frequency	100Gemma 3 12B 97GPT-5 Nano 97GPT-4.1 Nano	0Qwen 3.5 35B 0GPT-5.4 (Reasoning) 0GPT-5.2
87.3%	Narrator intent-glossing	100Qwen 3.6 Flash 100ByteDance Seed 1.6 100Qwen 3.6 35B	0GPT-5 Nano 26Llama 3.1 Nemotron 70B 31Gemini 2.5 Flash Lite
100.0%	Overuse of "that" (subordinate clause padding)	100Claude Opus 4.7 100GPT-5.4 (Reasoning) 100Gemini 3 Pro (Preview)	86ByteDance Seed 1.6 86Rocinante 12B 88Claude Haiku 4.5
100.0%	Paragraph length variance	100GPT-5.4 100ByteDance Seed 2.0 Lite 100MiniMax M2.7	42Inception Mercury 58Mistral Small 3.2 24B 70Grok 4.3 (Reasoning)
99.1%	Passive voice overuse	100Qwen 3.5 Plus (2026-04-20) 100GPT-5.5 (Reasoning) 100Qwen 3.5 397B A17B	84Llama 3.1 70B 89Rocinante 12B 92Claude Sonnet 4.6
100.0%	Past progressive (was/were + -ing) overuse	100ByteDance Seed 1.6 Flash 100Claude Sonnet 4 100GPT-5	35Claude Opus 4.7 (Reasoning) 44Z.AI GLM 4.7 Flash 49Z.AI GLM 4.7
73.8%	Pronoun-first sentence starts	100Inception Mercury 2 100GPT-5.5 (Reasoning) 100GPT-5.5 (Reasoning, Low)	0Gemma 3 4B 1Gemini 3.1 Flash Lite (Reasoning) 1GPT-4.1 Nano
97.6%	Purple prose (modifier overload)	100DeepSeek V4 Flash (Reasoning) 100Qwen 3.5 Flash 100Claude Opus 4.7 (Reasoning)	74Mistral Small 3.2 24B 77Gemma 3 4B 80Gemma 3 27B
100.0%	Repeated phrase echo	100LFM2 24B 100Gemini 3.1 Pro (Preview) 100Mistral Small 3.2 24B	—
100.0%	Sentence length variance	100GPT-5.4 Mini 100Claude Sonnet 4 100Ministral 8B	84Mistral Small 3.2 24B 89Inception Mercury 98GPT-4o, Aug. 6th (temp=0)
55.1%	Sentence opener variety	95Grok 4.1 Fast 94GPT-4o, Aug. 6th (temp=1) 91Llama 3.1 Nemotron 70B	35Qwen 3.5 35B 36Gemma 4 26B 36Qwen 3.5 Flash
14.7%	Subject-first sentence starts	96Writer: Palmyra X5 90Qwen3 235B A22B Instruct 2507 84Llama 3.1 8B	0Gemini 3 Flash (Preview) 0DeepSeek V3.1 0Gemini 3.1 Flash Lite (Reasoning)
30.3%	Subordinate conjunction sentence starts	73Llama 3.1 8B 72GPT-4.1 Nano 71Claude Haiku 4.5	0Z.AI GLM 4.7 Flash 0Gemini 3 Flash (Preview) 0DeepSeek V3 (2025-03-24)
81.0%	Technical jargon density	100Qwen 3.5 35B 100DeepSeek V3 (2025-03-24) 100Qwen 3.5 9B	8GPT-5 Nano 28Gemini 3.1 Flash Lite (Reasoning) 30Gemma 3 4B
67.4%	Useless dialogue additions	100Claude Opus 4.7 100Qwen 3.6 35B 100GPT-5.5	0Inception Mercury 2 0Gemini 2.5 Pro 0Mistral Small 3.2 24B

Thriller: chase through city streets

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4	93%
GPT-5.4 (Reasoning, Low)	91%
Qwen3 235B A22B Instruct 2507	91%
Qwen 3.5 397B A17B	91%
Z.AI GLM 5	91%
GPT-5.4 (Reasoning)	91%
Z.AI GLM 5 Turbo	90%
Claude Sonnet 4.6 (Reasoning)	90%
Claude Opus 4.6 (Reasoning)	90%
GPT-5.5 (Reasoning)	90%
Writer: Palmyra X5	90%
GPT-5.4 Mini (Reasoning)	90%
GPT-5.5 (Reasoning, Low)	90%
GPT-5.5	90%
Claude Sonnet 4.6	90%
Claude Sonnet 4.5	90%
GPT-5.1	90%
DeepSeek V4 Flash (Reasoning)	89%
Z.AI GLM 5.1	89%
Claude Opus 4.7	89%

	Score	Cost	Time
Z.AI GLM 5 Turbo	90%	$0.0078	26.6s
DeepSeek V4 Pro	86%	$0.0050	50.1s
Writer: Palmyra X5	90%	$0.011	18.7s
Qwen3 235B A22B Instruct 2507	91%	$0.0014	59.9s
DeepSeek V4 Flash	88%	$0.0007	26.2s
DeepSeek V4 Flash (Reasoning)	89%	$0.0005	21.4s
Xiaomi MIMO v2.5 Pro	89%	$0.0081	44.9s
Z.AI GLM 5	91%	$0.0075	44.3s
GPT-5.4 Mini (Reasoning)	90%	$0.023	28.1s
GPT-5.4 Mini (Reasoning, Low)	89%	$0.014	16.8s
GPT-5.4 (Reasoning, Low)	91%	$0.050	1.2m
Rocinante 12B	87%	$0.0015	36.4s
GPT-5.4	93%	$0.046	1.4m
GPT-5.4 Mini	88%	$0.015	16.7s
Qwen 3.6 Flash	87%	$0.012	47.6s
Z.AI GLM 5.1	89%	$0.020	1.8m
Claude Sonnet 4.6	90%	$0.036	40.3s
Qwen 3.5 Plus (2026-04-20)	87%	$0.017	1.6m
Grok 4.20 (Reasoning)	86%	$0.016	1.1m
Claude Sonnet 4.5	90%	$0.041	37.3s

	Score	Consistency	Stability
GPT-5.4	93%	99%	91%
Qwen3 235B A22B Instruct 2507	91%	98%	90%
GPT-5.5 (Reasoning)	90%	99%	90%
GPT-5.4 (Reasoning, Low)	91%	97%	89%
GPT-5.4 (Reasoning)	91%	98%	89%
GPT-5.5	90%	96%	87%
Qwen 3.5 397B A17B	91%	96%	87%
Writer: Palmyra X5	90%	96%	87%
Z.AI GLM 5.1	89%	98%	87%
GPT-5.4 Mini	88%	99%	87%
GPT-5.4 Mini (Reasoning)	90%	97%	87%
GPT-5.5 (Reasoning, Low)	90%	97%	87%
Xiaomi MIMO v2.5 Pro	89%	96%	87%
Claude Sonnet 4.6 (Reasoning)	90%	96%	86%
Z.AI GLM 5 Turbo	90%	93%	86%
GPT-5.1	90%	96%	86%
Z.AI GLM 5	91%	95%	86%
GPT-5.4 Mini (Reasoning, Low)	89%	97%	86%
DeepSeek V4 Pro (Reasoning)	89%	94%	85%
Claude Sonnet 4.5	90%	95%	85%

	Score	Cost	Speed	Stability
Qwen3 235B A22B Instruct 2507	91%	$0.0014	59.9s	90%
GPT-5.4	93%	$0.046	1.4m	91%
Writer: Palmyra X5	90%	$0.011	18.7s	87%
Z.AI GLM 5 Turbo	90%	$0.0078	26.6s	86%
Z.AI GLM 5	91%	$0.0075	44.3s	86%
GPT-5.4 Mini (Reasoning)	90%	$0.023	28.1s	87%
GPT-5.4 (Reasoning, Low)	91%	$0.050	1.2m	89%
DeepSeek V4 Flash (Reasoning)	89%	$0.0005	21.4s	84%
Xiaomi MIMO v2.5 Pro	89%	$0.0081	44.9s	87%
GPT-5.4 Mini	88%	$0.015	16.7s	87%
GPT-5.4 Mini (Reasoning, Low)	89%	$0.014	16.8s	86%
DeepSeek V4 Flash	88%	$0.0007	26.2s	84%
Z.AI GLM 5.1	89%	$0.020	1.8m	87%
Qwen 3.5 397B A17B	91%	$0.0049	3.3m	87%
Claude Sonnet 4.5	90%	$0.041	37.3s	85%
Claude Sonnet 4.6 (Reasoning)	90%	$0.065	1.1m	86%
Claude Sonnet 4.6	90%	$0.036	40.3s	82%
Rocinante 12B	87%	$0.0015	36.4s	80%
Qwen 3.6 Flash	87%	$0.012	47.6s	81%
Grok 4.20 (Reasoning)	86%	$0.016	1.1m	83%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
2	GPT-5.4	$0.046	1.4m	91%	94	93	93	92	92	93%
7	GPT-5.4 (Reasoning, Low)	$0.050	1.2m	89%	93	93	92	90	89	91%
1	Qwen3 235B A22B Instruct 2507	$0.0014	59.9s	90%	92	92	91	91	90	91%
14	Qwen 3.5 397B A17B	$0.0049	3.3m	87%	93	93	91	90	87	91%
5	Z.AI GLM 5	$0.0075	44.3s	86%	94	92	90	89	88	91%
38	GPT-5.4 (Reasoning)	$0.109	3.0m	89%	92	91	90	90	89	91%
4	Z.AI GLM 5 Turbo	$0.0078	26.6s	86%	94	93	92	88	85	90%
16	Claude Sonnet 4.6 (Reasoning)	$0.065	1.1m	86%	93	92	91	90	86	90%
28	Claude Opus 4.6 (Reasoning)	$0.091	1.2m	85%	93	93	89	88	88	90%
37	GPT-5.5 (Reasoning)	$0.141	1.8m	90%	91	91	90	90	89	90%
3	Writer: Palmyra X5	$0.011	18.7s	87%	93	91	91	89	87	90%
6	GPT-5.4 Mini (Reasoning)	$0.023	28.1s	87%	93	91	90	89	88	90%
35	GPT-5.5 (Reasoning, Low)	$0.118	1.6m	87%	92	92	89	89	88	90%
36	GPT-5.5	$0.121	1.7m	87%	92	91	91	90	86	90%
17	Claude Sonnet 4.6	$0.036	40.3s	82%	95	94	89	87	84	90%
15	Claude Sonnet 4.5	$0.041	37.3s	85%	94	90	89	89	87	90%
22	GPT-5.1	$0.052	2.2m	86%	92	91	89	89	86	90%
8	DeepSeek V4 Flash (Reasoning)	$0.0005	21.4s	84%	93	92	89	88	85	89%
13	Z.AI GLM 5.1	$0.020	1.8m	87%	91	90	89	89	88	89%
34	Claude Opus 4.7	$0.085	31.4s	82%	94	93	90	89	81	89%
32	Claude Opus 4.6	$0.087	1.2m	84%	94	90	89	88	86	89%
11	GPT-5.4 Mini (Reasoning, Low)	$0.014	16.8s	86%	91	90	89	88	86	89%
9	Xiaomi MIMO v2.5 Pro	$0.0081	44.9s	87%	91	91	90	87	87	89%
59	DeepSeek V4 Pro (Reasoning)	$0.026	5.5m	85%	92	91	90	87	84	89%
42	Qwen3.6 Max Preview	$0.048	3.1m	84%	92	91	88	87	86	89%
10	GPT-5.4 Mini	$0.015	16.7s	87%	90	88	88	88	88	88%
95	MoonshotAI: Kimi K2.6	$0.066	6.3m	85%	91	88	88	87	86	88%
12	DeepSeek V4 Flash	$0.0007	26.2s	84%	91	90	89	86	83	88%
27	Qwen 3.5 Plus (2026-04-20)	$0.017	1.6m	82%	92	91	90	82	81	87%
19	Qwen 3.6 Flash	$0.012	47.6s	81%	92	89	88	87	80	87%
29	Claude Opus 4.5	$0.069	43.9s	85%	89	88	88	86	84	87%
18	Rocinante 12B	$0.0015	36.4s	80%	92	89	87	86	81	87%
24	MiniMax M2.7	$0.0035	1.1m	80%	92	90	87	84	82	87%
66	MoonshotAI: Kimi K2.5	$0.026	3.4m	79%	91	90	86	85	81	87%
141	Claude Opus 4	$0.292	1.9m	80%	91	89	88	85	78	86%
56	Claude Opus 4.7 (Reasoning)	$0.091	33.5s	82%	89	88	85	85	85	86%
20	Grok 4.20 (Reasoning)	$0.016	1.1m	83%	89	89	88	83	82	86%
41	Claude Sonnet 4	$0.038	44.7s	80%	91	90	89	84	77	86%
23	MiniMax M2.5	$0.0034	1.6m	83%	89	88	87	86	82	86%
83	GPT-5	$0.068	4.0m	83%	89	86	86	85	85	86%
25	Qwen 3.6 35B	$0.012	1.0m	81%	89	89	86	84	82	86%
26	Grok 4.20	$0.011	38.2s	80%	91	87	86	84	81	86%
55	DeepSeek V4 Pro	$0.0050	50.1s	75%	93	93	92	82	69	86%
21	Mistral Medium 3.1	$0.0059	40.6s	82%	88	87	85	84	84	86%
33	Grok 4.3	$0.0090	24.3s	78%	91	87	84	83	81	85%
48	Hermes 3 405B	$0.0054	37.5s	77%	90	89	85	80	78	84%
30	GPT-5.4 Nano	$0.0048	16.7s	80%	87	87	84	82	81	84%
31	DeepSeek V3 (2025-03-24)	$0.0016	14.5s	80%	88	86	85	82	79	84%
109	Gemini 3.1 Pro (Preview)	$0.136	2.2m	81%	86	84	84	84	81	84%
69	Claude 3.5 Sonnet	$0.060	38.2s	78%	88	86	83	83	79	84%
61	Aion 2.0	$0.0076	1.4m	78%	88	85	84	82	78	84%
45	Z.AI GLM 4.5	$0.0039	28.5s	78%	88	86	84	84	76	84%
40	ByteDance Seed 1.6 Flash	$0.0012	24.5s	78%	87	85	85	85	76	84%
43	Z.AI GLM 4.7 Flash	$0.0017	1.2m	80%	87	84	83	83	80	84%
47	Grok 4.20 (Beta)	$0.020	17.0s	79%	87	84	83	81	81	83%
84	WizardLM 2 8x22b	$0.0040	2.8m	76%	90	83	82	82	79	83%
57	o4 Mini High	$0.021	44.5s	79%	87	86	84	80	78	83%
63	Claude Haiku 4.5	$0.014	22.0s	76%	87	87	83	82	76	83%
49	Grok 4.1 Fast	$0.0021	30.9s	78%	87	85	83	82	78	83%
58	Qwen 3 32B	$0.0017	32.0s	76%	89	83	82	81	80	83%
44	Mistral Small 4	$0.0017	21.1s	78%	87	85	84	80	77	83%
50	DeepSeek V3.2	$0.0020	1.3m	80%	84	84	82	82	80	83%
65	Grok 4.3 (Reasoning)	$0.016	1.3m	78%	87	86	85	78	77	83%
39	Qwen 3.5 Flash	$0.0022	51.7s	81%	85	83	83	82	81	83%
51	Qwen 3.5 122B	$0.017	37.1s	80%	84	84	82	82	81	83%
46	GPT-5.4 Nano (Reasoning)	$0.0056	21.4s	79%	87	83	83	80	79	82%
62	Grok 4.20 (Beta, Reasoning)	$0.036	30.1s	79%	85	84	83	80	79	82%
89	GPT-5.2	$0.059	1.8m	79%	85	84	83	81	79	82%
60	GPT-5 Mini	$0.010	1.0m	79%	85	83	82	82	80	82%
52	Mistral Small 4 (Reasoning)	$0.0028	30.6s	78%	86	84	83	82	77	82%
94	Qwen 3.5 9B	$0.0016	1.7m	73%	91	82	81	79	76	82%
54	Hermes 3 70B	$0.0015	21.9s	78%	85	85	84	78	76	82%
85	Gemini 2.5 Pro	$0.036	36.4s	75%	87	85	83	79	74	82%
75	o4 Mini	$0.017	29.4s	75%	86	85	81	79	77	82%
64	GPT-5.4 Nano (Reasoning, Low)	$0.0049	18.3s	76%	85	84	80	80	79	81%
67	Gemma 3 27B	$0.0008	51.3s	76%	86	85	83	79	74	81%
73	Xiaomi MIMO v2.5	$0.0052	29.9s	75%	85	85	79	79	78	81%
79	Z.AI GLM 4.7	$0.011	1.3m	77%	85	83	82	78	78	81%
106	Z.AI GLM 4.6	$0.0066	1.0m	69%	89	86	77	77	75	81%
88	Gemma 4 26B (Reasoning)	$0.0014	1.5m	74%	87	84	83	78	72	81%
53	Gemini 2.5 Flash	$0.0057	10.8s	79%	82	81	81	81	79	81%
71	Gemini 2.5 Flash (Reasoning)	$0.0099	17.3s	75%	85	83	80	78	78	81%
86	Z.AI GLM 4.5 Air	$0.0028	41.6s	73%	86	85	80	78	75	81%
70	Gemini 2.5 Flash Lite	$0.0009	8.1s	74%	86	84	81	81	73	81%
91	Stealth: Hunter Alpha	$0.0000	57.2s	72%	88	84	82	79	70	81%
93	DeepSeek V3.1	$0.0028	2.4m	76%	84	81	80	79	78	81%
74	Arcee AI: Trinity Large (Preview)	$0.0000	45.9s	75%	85	85	84	76	73	81%
87	GPT-4.1	$0.019	55.8s	75%	85	83	80	79	77	81%
76	Cohere Command R+ (Aug. 2024)	$0.025	31.5s	77%	84	81	81	80	77	80%
68	Mistral Large 3	$0.0043	29.5s	76%	84	84	82	79	74	80%
81	Mistral Large	$0.018	33.8s	76%	84	82	82	80	74	80%
77	Mistral Large 2	$0.017	30.1s	77%	83	81	80	80	76	80%
118	Qwen 3.5 27B	$0.044	2.9m	77%	84	81	81	78	76	80%
72	Mistral Small Creative	$0.0009	9.6s	75%	84	83	80	78	75	80%
80	Stealth: Healer Alpha	$0.0000	21.5s	74%	84	81	79	79	76	80%
142	Qwen 3.6 27B	$0.036	3.4m	65%	91	83	82	82	60	80%
82	Qwen 3.5 35B	$0.0095	32.9s	76%	82	82	80	78	75	80%
98	ByteDance Seed 2.0 Lite	$0.012	2.1m	76%	83	80	79	78	77	79%
99	Claude 3 Haiku	$0.0029	11.5s	70%	85	83	77	77	74	79%
92	Qwen 3.5 Plus (2026-02-15)	$0.0068	28.0s	74%	84	79	78	77	76	79%
105	GPT-4o Mini (temp=1)	$0.0013	41.8s	70%	87	80	77	76	74	79%
101	DeepSeek V3 (2024-12-26)	$0.0027	1.2m	73%	83	82	77	76	76	79%
127	DeepSeek-V2 Chat	$0.0026	58.1s	66%	86	85	77	76	67	78%
90	GPT-4.1 Nano	$0.0008	14.4s	73%	82	81	78	77	73	78%
78	Grok 4 Fast	$0.0020	24.4s	77%	80	79	78	78	77	78%
125	Claude 3.7 Sonnet	$0.047	45.4s	72%	82	82	77	75	74	78%
119	GPT-4o, May 13th (temp=1)	$0.043	17.4s	72%	83	82	79	76	69	78%
112	ByteDance Seed 1.6	$0.0098	1.7m	73%	83	78	77	77	74	78%
115	Grok 4	$0.046	1.4m	75%	80	79	79	77	75	78%
96	Arcee AI: Trinity Mini	$0.0004	7.3s	72%	83	80	78	77	72	78%
97	GPT-4.1 Mini	$0.0027	17.2s	72%	83	78	77	77	72	78%
108	Ministral 3 8B	$0.0028	1.5m	73%	82	80	79	74	72	77%
104	Ministral 3 14B	$0.0012	11.9s	70%	83	82	77	73	72	77%
117	Gemma 4 31B (Reasoning)	$0.0018	2.3m	74%	79	79	78	78	72	77%
135	ByteDance Seed 2.0 Mini	$0.0041	4.2m	71%	83	77	76	75	74	77%
110	GPT-4o, Aug. 6th (temp=1)	$0.020	16.4s	72%	82	77	76	75	74	77%
100	Ministral 3B	$0.0002	5.0s	72%	82	80	78	74	70	77%
107	Llama 3.1 Nemotron 70B	$0.0061	26.6s	72%	82	79	79	73	70	77%
102	GPT-4o Mini (temp=0)	$0.0014	54.5s	74%	79	78	76	76	74	77%
124	Qwen 2.5 72B	$0.0014	42.7s	70%	82	79	75	73	72	76%
103	Ministral 8B	$0.0006	8.3s	72%	81	77	76	75	72	76%
113	Gemma 3 4B	$0.0003	21.4s	70%	81	80	77	71	70	76%
126	Gemini 3 Flash (Preview, Reasoning)	$0.011	26.8s	69%	83	78	76	73	69	76%
116	Gemini 2.5 Flash Lite (Reasoning)	$0.0030	31.4s	71%	79	78	76	74	70	76%
131	Gemma 4 31B	$0.0015	1.8m	68%	83	75	74	73	72	75%
122	Gemma 4 26B	$0.0012	35.8s	71%	80	75	75	73	73	75%
129	GPT-4o, May 13th (temp=0)	$0.051	19.6s	72%	77	77	75	74	72	75%
114	Ministral 3 3B	$0.0005	3.0s	70%	79	77	76	75	67	75%
138	Gemini 3 Pro (Preview)	$0.055	53.0s	68%	82	76	73	72	72	75%
123	Gemma 3 12B	$0.0004	35.6s	71%	80	78	77	70	70	75%
132	Mistral NeMO	$0.0009	9.5s	64%	87	74	73	70	70	75%
121	GPT-4o, Aug. 6th (temp=0)	$0.022	15.4s	73%	77	77	76	73	72	75%
111	Llama 3.1 70B	$0.0020	22.2s	72%	77	76	75	75	72	75%
134	Llama 3.1 8B	$0.0004	2.3m	68%	79	79	78	73	63	74%
120	LFM2 24B	$0.0003	30.0s	72%	76	75	75	74	70	74%
130	Gemini 3 Flash (Preview)	$0.0075	17.5s	69%	79	75	73	71	71	74%
128	Gemini 3.1 Flash Lite (Reasoning)	$0.0035	9.4s	69%	77	76	73	71	71	74%
140	Gemini 3.1 Flash Lite	$0.0034	13.9s	63%	80	75	69	69	68	72%
137	Gemini 3.1 Flash Lite (Preview)	$0.0039	9.2s	66%	77	71	71	70	67	71%
133	Nemotron 3 Super	$0.0000	46.2s	70%	72	72	72	69	69	71%
139	Inception Mercury	$0.012	18.8s	67%	76	71	71	69	67	71%
136	Stealth: Aurora Alpha	$0.0000	9.1s	67%	72	72	71	68	65	70%
143	GPT-5 Nano	$0.0045	1.7m	68%	70	70	70	68	67	69%
144	GPT-OSS 120B	$0.0018	1.2m	64%	74	68	68	67	65	68%
146	Nemotron 3 Nano	$0.0010	42.3s	61%	74	68	66	64	62	67%
147	Mistral Small 3.2 24B	$0.012	11.5m	62%	72	68	67	67	60	67%
145	Inception Mercury 2	$0.0032	6.6s	61%	72	68	66	64	63	66%
81.61%

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Grok 4.1 Fast 100ByteDance Seed 1.6 Flash 100Llama 3.1 8B	32LFM2 24B 43GPT-5 Nano 44Gemini 2.5 Flash Lite (Reasoning)
57.4%	Adverb-first sentence starts	100Qwen3 235B A22B Instruct 2507 100Z.AI GLM 5 100Writer: Palmyra X5	0ByteDance Seed 2.0 Lite 0Inception Mercury 0Inception Mercury 2
100.0%	Adverbs in dialogue tags	100GPT-5 100MiniMax M2.7 100Claude Opus 4.7 (Reasoning)	44GPT-4o, Aug. 6th (temp=1) 56Claude 3.5 Sonnet 59Aion 2.0
94.2%	AI-ism adverb frequency	100GPT-5.4 100Ministral 3 3B 100Grok 4.1 Fast	74Claude 3.7 Sonnet 75Qwen 3.6 27B 79GPT-4.1 Nano
100.0%	AI-ism character names	100Ministral 3 14B 100Grok 4.20 (Reasoning) 100Claude Sonnet 4.5	88Llama 3.1 8B 88Gemma 3 4B 92Grok 4.1 Fast
100.0%	AI-ism location names	100Gemini 3.1 Flash Lite (Preview) 100Gemma 4 31B 100ByteDance Seed 2.0 Mini	—
45.7%	AI-ism word frequency	85Claude Opus 4.7 (Reasoning) 82GPT-5.5 (Reasoning) 81ByteDance Seed 2.0 Lite	0GPT-4o, Aug. 6th (temp=0) 0GPT-4.1 Nano 0GPT-4o, Aug. 6th (temp=1)
100.0%	Cliché density	100GPT-5.4 (Reasoning, Low) 100Gemini 3.1 Flash Lite 100LFM2 24B	27Mistral NeMO 33Qwen 2.5 72B 40GPT-4o, May 13th (temp=0)
66.6%	Dialogue tag variety (said vs. fancy)	100Claude Opus 4.6 100Claude Sonnet 4.6 100Qwen 3.5 397B A17B	0GPT-4o Mini (temp=0) 0Stealth: Aurora Alpha 0GPT-4o Mini (temp=1)
80.0%	Em-dash & semicolon overuse	100ByteDance Seed 2.0 Lite 100Claude Sonnet 4.6 (Reasoning) 100Claude Opus 4.5	0GPT-4.1 Mini 0Mistral Large 3 0Mistral Large 2
100.0%	Emotion telling (show vs. tell)	100Grok 4 Fast 100Ministral 3 8B 100Hermes 3 70B	85Mistral Small 3.2 24B 88GPT-4o, May 13th (temp=0) 91Claude 3 Haiku
97.1%	Filter word density	100Qwen 3.5 9B 100Writer: Palmyra X5 100Qwen 3.5 Plus (2026-04-20)	35Nemotron 3 Nano 37Inception Mercury 2 39Stealth: Aurora Alpha
100.0%	Gibberish response detection	100Gemma 4 31B (Reasoning) 100Claude Sonnet 4.5 100Gemma 4 26B (Reasoning)	20Llama 3.1 8B 80Nemotron 3 Nano 96Hermes 3 70B
100.0%	Markdown formatting overuse	100Claude Sonnet 4.6 (Reasoning) 100GPT-4o, Aug. 6th (temp=1) 100Gemini 3 Flash (Preview)	49Ministral 3B 80Ministral 3 3B 80Llama 3.1 Nemotron 70B
100.0%	Missing dialogue indicators (quotation marks)	100Hermes 3 70B 100GPT-5 Mini 100Grok 4.3 (Reasoning)	48GPT-5 69Gemini 2.5 Flash Lite (Reasoning) 87Xiaomi MIMO v2.5 Pro
76.1%	Name drop frequency	100Gemini 3.1 Flash Lite (Reasoning) 100Gemini 2.5 Pro 100Gemini 3.1 Flash Lite	7GPT-5.2 13Claude 3 Haiku 16Qwen 3.5 35B
83.6%	Narrator intent-glossing	100GPT-5.5 100Grok 4.3 (Reasoning) 100GPT-4.1 Mini	15GPT-5 Nano 24Gemini 2.5 Flash Lite (Reasoning) 31Gemma 4 26B
100.0%	Overuse of "that" (subordinate clause padding)	100GPT-5.2 100Z.AI GLM 4.6 100Gemini 3.1 Pro (Preview)	47Mistral Small 3.2 24B 82DeepSeek V4 Pro (Reasoning) 91Llama 3.1 8B
100.0%	Paragraph length variance	100Gemini 3 Flash (Preview, Reasoning) 100Claude Haiku 4.5 100GPT-5.4 Nano	33Gemini 2.5 Flash Lite (Reasoning) 33Mistral Small 3.2 24B 43Gemini 2.5 Flash Lite
98.5%	Passive voice overuse	100Grok 4.20 100GPT-5 100GPT-4.1 Mini	77ByteDance Seed 2.0 Mini 88Mistral NeMO 91Gemini 2.5 Flash Lite
96.8%	Past progressive (was/were + -ing) overuse	100Grok 4.1 Fast 100Inception Mercury 100GPT-5.4 Mini (Reasoning, Low)	23Z.AI GLM 4.7 23Claude Opus 4.7 (Reasoning) 29Gemma 4 31B
93.7%	Pronoun-first sentence starts	100Claude Opus 4.5 100Claude Sonnet 4 100MiniMax M2.7	1Mistral Small 3.2 24B 9Gemini 3.1 Flash Lite (Reasoning) 23Gemini 3.1 Flash Lite
97.6%	Purple prose (modifier overload)	100Nemotron 3 Super 100GPT-5 Mini 100DeepSeek V4 Flash (Reasoning)	80Gemini 3 Flash (Preview, Reasoning) 82Gemini 3.1 Pro (Preview) 84Qwen 3.6 27B
100.0%	Repeated phrase echo	100GPT-4o, May 13th (temp=0) 100Gemini 3.1 Flash Lite (Reasoning) 100Claude Sonnet 4.5	—
100.0%	Sentence length variance	100Gemini 3 Flash (Preview, Reasoning) 100GPT-5.5 (Reasoning, Low) 100Grok 4.20 (Reasoning)	79Mistral Small 3.2 24B 91Qwen 3.6 27B 91Llama 3.1 70B
54.2%	Sentence opener variety	94Grok 4.1 Fast 91DeepSeek V3 (2025-03-24) 88Hermes 3 70B	25Mistral Small 3.2 24B 29Inception Mercury 31Qwen 3.5 27B
38.9%	Subject-first sentence starts	99Qwen3 235B A22B Instruct 2507 96Writer: Palmyra X5 95Rocinante 12B	0Qwen 3.5 122B 0Inception Mercury 2 0Qwen 3.5 27B
25.9%	Subordinate conjunction sentence starts	78Cohere Command R+ (Aug. 2024) 78Gemini 2.5 Flash (Reasoning) 76Rocinante 12B	0Ministral 3 3B 0ByteDance Seed 2.0 Lite 0Aion 2.0
80.5%	Technical jargon density	100GPT-4.1 Mini 100Qwen 3.6 Flash 100Gemini 3.1 Pro (Preview)	8GPT-5 Nano 12Claude Haiku 4.5 19ByteDance Seed 2.0 Lite
60.9%	Useless dialogue additions	100MoonshotAI: Kimi K2.6 100Qwen3.6 Max Preview 100Claude Sonnet 4.5	0Gemini 3.1 Flash Lite (Preview) 0Stealth: Aurora Alpha 0Gemma 3 4B

Romance: separated couple reunites

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning, Low)	92%
GPT-5.4	92%
Writer: Palmyra X5	90%
GPT-5.5	89%
Claude Opus 4.6 (Reasoning)	89%
GPT-5.4 Mini	89%
GPT-5.5 (Reasoning)	89%
Claude Sonnet 4.5	88%
DeepSeek V4 Flash (Reasoning)	88%
GPT-5.4 Mini (Reasoning, Low)	88%
Claude Sonnet 4	88%
GPT-5.5 (Reasoning, Low)	88%
Z.AI GLM 5	88%
Claude Opus 4.7	88%
Claude Opus 4	88%
GPT-5.4 (Reasoning)	88%
DeepSeek V4 Pro	88%
Z.AI GLM 5.1	87%
Xiaomi MIMO v2.5 Pro	87%
Grok 4.20 (Reasoning)	87%

	Score	Cost	Time
Writer: Palmyra X5	90%	$0.013	22.8s
GPT-5.4 (Reasoning, Low)	92%	$0.056	1.3m
GPT-5.4 Mini	89%	$0.014	15.7s
GPT-5.4 Mini (Reasoning, Low)	88%	$0.014	16.1s
GPT-5.4	92%	$0.051	1.3m
DeepSeek V4 Flash (Reasoning)	88%	$0.0007	25.0s
Grok 4.20	87%	$0.011	46.5s
Grok 4.20 (Reasoning)	87%	$0.014	1.0m
Grok 4.20 (Beta)	87%	$0.019	15.1s
Claude Sonnet 4	88%	$0.045	54.2s
DeepSeek V4 Flash	86%	$0.0005	17.3s
Hermes 3 405B	84%	$0.0054	49.2s
DeepSeek V4 Pro	88%	$0.0027	1.1m
Qwen 3.5 Flash	85%	$0.0024	35.9s
Qwen3 235B A22B Instruct 2507	87%	$0.0011	49.4s
Z.AI GLM 5.1	87%	$0.014	1.2m
MiniMax M2.5	85%	$0.0043	1.8m
Z.AI GLM 5	88%	$0.012	1.8m
Claude Sonnet 4.5	88%	$0.045	41.1s
GPT-5.4 Nano (Reasoning)	83%	$0.0055	23.1s

	Score	Consistency	Stability
GPT-5.4 (Reasoning, Low)	92%	97%	89%
GPT-5.4	92%	95%	88%
GPT-5.5 (Reasoning)	89%	98%	87%
DeepSeek V4 Pro	88%	99%	87%
GPT-5.5 (Reasoning, Low)	88%	97%	87%
GPT-5.5	89%	97%	86%
GPT-5.4 Mini	89%	97%	86%
Writer: Palmyra X5	90%	95%	85%
Qwen 3.5 397B A17B	87%	98%	85%
Claude Opus 4.6 (Reasoning)	89%	95%	85%
GPT-5.4 Mini (Reasoning, Low)	88%	96%	85%
GPT-5.4 Mini (Reasoning)	87%	98%	85%
Xiaomi MIMO v2.5 Pro	87%	97%	84%
Qwen3.6 Max Preview	87%	97%	84%
Grok 4.20 (Beta, Reasoning)	85%	99%	84%
Grok 4.20 (Beta)	87%	96%	84%
Grok 4.20 (Reasoning)	87%	94%	83%
GPT-5.4 (Reasoning)	88%	95%	83%
Claude Opus 4.6	87%	96%	83%
Qwen 3.6 Flash	87%	96%	83%

	Score	Cost	Speed	Stability
GPT-5.4 (Reasoning, Low)	92%	$0.056	1.3m	89%
Writer: Palmyra X5	90%	$0.013	22.8s	85%
GPT-5.4 Mini	89%	$0.014	15.7s	86%
GPT-5.4	92%	$0.051	1.3m	88%
GPT-5.4 Mini (Reasoning, Low)	88%	$0.014	16.1s	85%
DeepSeek V4 Pro	88%	$0.0027	1.1m	87%
DeepSeek V4 Flash (Reasoning)	88%	$0.0007	25.0s	83%
Grok 4.20 (Beta)	87%	$0.019	15.1s	84%
GPT-5.4 Mini (Reasoning)	87%	$0.022	26.8s	85%
Qwen3 235B A22B Instruct 2507	87%	$0.0011	49.4s	83%
Grok 4.20	87%	$0.011	46.5s	83%
Xiaomi MIMO v2.5 Pro	87%	$0.013	1.2m	84%
Grok 4.20 (Reasoning)	87%	$0.014	1.0m	83%
Qwen 3.6 Flash	87%	$0.013	52.6s	83%
Claude Sonnet 4.5	88%	$0.045	41.1s	82%
Grok 4.20 (Beta, Reasoning)	85%	$0.034	25.6s	84%
Z.AI GLM 5	88%	$0.012	1.8m	83%
DeepSeek V4 Flash	86%	$0.0005	17.3s	79%
Grok 4.1 Fast	86%	$0.0021	39.6s	79%
Claude Opus 4.6 (Reasoning)	89%	$0.098	1.3m	85%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
1	GPT-5.4 (Reasoning, Low)	$0.056	1.3m	89%	95	94	93	91	90	92%
4	GPT-5.4	$0.051	1.3m	88%	95	93	92	90	88	92%
2	Writer: Palmyra X5	$0.013	22.8s	85%	94	91	90	89	86	90%
21	GPT-5.5	$0.118	1.5m	86%	92	90	89	88	88	89%
20	Claude Opus 4.6 (Reasoning)	$0.098	1.3m	85%	91	91	89	89	85	89%
3	GPT-5.4 Mini	$0.014	15.7s	86%	91	89	88	88	87	89%
33	GPT-5.5 (Reasoning)	$0.140	1.7m	87%	90	89	89	88	87	89%
15	Claude Sonnet 4.5	$0.045	41.1s	82%	94	90	88	87	83	88%
7	DeepSeek V4 Flash (Reasoning)	$0.0007	25.0s	83%	93	90	88	86	86	88%
5	GPT-5.4 Mini (Reasoning, Low)	$0.014	16.1s	85%	91	90	88	88	85	88%
29	Claude Sonnet 4	$0.045	54.2s	79%	96	91	90	86	78	88%
30	GPT-5.5 (Reasoning, Low)	$0.122	1.5m	87%	89	89	89	89	86	88%
17	Z.AI GLM 5	$0.012	1.8m	83%	92	90	88	86	83	88%
22	Claude Opus 4.7	$0.085	30.7s	82%	93	90	89	85	82	88%
141	Claude Opus 4	$0.434	2.8m	80%	94	89	86	85	85	88%
57	GPT-5.4 (Reasoning)	$0.092	2.6m	83%	91	89	88	87	84	88%
6	DeepSeek V4 Pro	$0.0027	1.1m	87%	88	88	88	87	87	88%
27	Z.AI GLM 5.1	$0.014	1.2m	80%	93	91	87	85	81	87%
12	Xiaomi MIMO v2.5 Pro	$0.013	1.2m	84%	90	88	87	86	86	87%
13	Grok 4.20 (Reasoning)	$0.014	1.0m	83%	91	89	89	85	82	87%
10	Qwen3 235B A22B Instruct 2507	$0.0011	49.4s	83%	90	89	86	85	85	87%
11	Grok 4.20	$0.011	46.5s	83%	90	89	89	87	81	87%
8	Grok 4.20 (Beta)	$0.019	15.1s	84%	90	88	87	86	84	87%
77	Qwen 3.5 397B A17B	$0.0045	6.2m	85%	88	88	87	87	86	87%
9	GPT-5.4 Mini (Reasoning)	$0.022	26.8s	85%	88	88	86	86	86	87%
14	Qwen 3.6 Flash	$0.013	52.6s	83%	90	87	87	86	84	87%
49	Claude Sonnet 4.6 (Reasoning)	$0.085	1.3m	82%	89	89	88	88	80	87%
48	Claude Opus 4.5	$0.086	54.6s	80%	93	86	86	85	84	87%
63	Qwen3.6 Max Preview	$0.057	3.7m	84%	89	88	87	86	84	87%
36	Claude Opus 4.6	$0.088	1.2m	83%	89	88	87	85	84	87%
19	Grok 4.1 Fast	$0.0021	39.6s	79%	92	88	86	85	81	86%
18	DeepSeek V4 Flash	$0.0005	17.3s	79%	93	89	86	82	82	86%
86	MoonshotAI: Kimi K2.5	$0.019	5.2m	82%	90	89	88	83	81	86%
35	Qwen 3.5 Plus (2026-04-20)	$0.018	1.7m	81%	91	86	85	85	84	86%
50	GPT-5.1	$0.054	1.7m	82%	90	87	86	84	82	86%
32	Claude Opus 4.7 (Reasoning)	$0.086	28.8s	83%	88	86	85	85	84	86%
40	DeepSeek V4 Pro (Reasoning)	$0.016	2.1m	82%	89	87	86	85	82	86%
45	Claude 3.5 Sonnet	$0.061	33.1s	79%	91	85	85	85	81	86%
38	MiniMax M2.5	$0.0043	1.8m	81%	89	87	86	85	80	85%
44	Z.AI GLM 5 Turbo	$0.013	40.7s	77%	92	88	85	81	79	85%
23	Grok 4.3	$0.0088	25.8s	80%	88	87	84	83	82	85%
24	Qwen 3.5 Flash	$0.0024	35.9s	80%	89	86	86	84	79	85%
16	Grok 4.20 (Beta, Reasoning)	$0.034	25.6s	84%	86	85	85	84	84	85%
42	Grok 4.3 (Reasoning)	$0.016	1.7m	81%	88	85	85	84	82	85%
25	Stealth: Hunter Alpha	$0.0000	48.4s	81%	86	86	85	84	81	84%
39	MiniMax M2.7	$0.0053	1.3m	80%	88	87	84	82	81	84%
43	Qwen 3.6 35B	$0.0075	1.0m	79%	89	86	84	83	80	84%
37	Claude Sonnet 4.6	$0.038	37.6s	80%	86	86	84	84	80	84%
144	MoonshotAI: Kimi K2.6	$0.085	9.1m	79%	87	86	83	82	81	84%
41	Hermes 3 405B	$0.0054	49.2s	79%	88	87	87	81	76	84%
83	GPT-5	$0.059	2.8m	80%	87	84	83	83	82	84%
96	Gemini 3.1 Pro (Preview)	$0.134	2.1m	81%	87	84	84	83	81	84%
31	Mistral Medium 3.1	$0.0058	42.8s	81%	86	83	83	83	83	84%
46	Qwen 3.5 27B	$0.012	58.0s	79%	86	85	83	82	81	83%
26	GPT-5.4 Nano (Reasoning)	$0.0055	23.1s	81%	86	85	85	81	80	83%
55	o4 Mini	$0.015	26.6s	77%	90	83	82	81	81	83%
28	GPT-5.4 Nano	$0.0050	19.6s	81%	85	84	83	82	82	83%
34	Mistral Small 4 (Reasoning)	$0.0027	32.6s	81%	85	84	83	83	81	83%
72	WizardLM 2 8x22b	$0.0042	3.0m	81%	85	84	84	81	81	83%
52	Hermes 3 70B	$0.0015	38.9s	78%	87	85	84	83	76	83%
53	Mistral Large	$0.019	34.4s	79%	87	83	83	82	79	83%
54	GPT-5.4 Nano (Reasoning, Low)	$0.0045	16.6s	77%	87	86	82	81	78	83%
69	DeepSeek V3.2	$0.0019	1.3m	76%	88	85	82	80	79	83%
70	Qwen 3.5 Plus (2026-02-15)	$0.0071	31.1s	74%	91	86	83	77	76	83%
94	ByteDance Seed 1.6	$0.010	1.8m	73%	89	88	81	79	76	83%
51	Qwen 3.5 122B	$0.020	45.8s	80%	84	83	82	82	81	82%
60	Gemma 4 31B (Reasoning)	$0.0016	1.2m	78%	86	82	81	81	80	82%
56	DeepSeek V3 (2025-03-24)	$0.0017	44.1s	79%	85	83	81	81	80	82%
67	Mistral Small Creative	$0.0010	9.8s	73%	88	87	81	78	75	82%
98	Cohere Command R+ (Aug. 2024)	$0.028	52.8s	71%	89	85	80	80	72	82%
81	Rocinante 12B	$0.0015	34.1s	72%	90	84	80	77	77	82%
61	Z.AI GLM 4.5 Air	$0.0024	36.5s	76%	87	82	81	79	79	82%
62	Mistral Large 2	$0.017	30.9s	77%	86	83	82	79	77	81%
66	Xiaomi MIMO v2.5	$0.0055	29.8s	76%	85	84	81	81	76	81%
108	Qwen 3.6 27B	$0.036	3.0m	78%	85	82	82	79	78	81%
79	o4 Mini High	$0.026	48.1s	75%	86	81	80	79	79	81%
84	Stealth: Healer Alpha	$0.0000	21.5s	71%	92	80	80	77	77	81%
59	Mistral Large 3	$0.0048	36.1s	78%	84	83	81	80	78	81%
47	Ministral 3 14B	$0.0012	11.9s	79%	83	82	82	80	79	81%
65	Qwen 3 32B	$0.0013	32.4s	76%	85	83	82	80	74	81%
64	ByteDance Seed 1.6 Flash	$0.0015	30.8s	77%	85	82	82	79	76	81%
58	Mistral Small 4	$0.0013	11.7s	77%	84	81	81	81	76	81%
73	Gemini 2.5 Flash	$0.0065	12.1s	74%	85	85	81	77	75	81%
80	Aion 2.0	$0.0079	1.2m	76%	85	83	82	78	76	81%
68	Qwen 3.5 9B	$0.0011	1.1m	78%	83	82	82	79	77	81%
76	GPT-5 Mini	$0.0098	1.1m	77%	84	81	80	79	78	81%
78	Gemini 3 Pro (Preview)	$0.058	56.3s	79%	81	81	81	81	78	81%
71	Qwen 3.5 35B	$0.018	1.2m	79%	82	82	81	79	79	81%
74	Z.AI GLM 4.7 Flash	$0.0018	1.1m	77%	83	83	83	79	75	80%
115	ByteDance Seed 2.0 Mini	$0.0044	4.4m	80%	81	81	80	80	80	80%
88	Z.AI GLM 4.7	$0.012	1.7m	77%	83	81	80	79	78	80%
82	Gemini 3 Flash (Preview, Reasoning)	$0.012	35.6s	74%	85	82	79	79	76	80%
97	DeepSeek V3 (2024-12-26)	$0.0027	1.4m	74%	85	80	79	79	75	80%
107	Claude Haiku 4.5	$0.015	24.9s	70%	89	78	78	77	76	79%
121	Grok 4	$0.051	1.6m	74%	83	81	80	80	73	79%
113	GPT-5.2	$0.056	1.5m	75%	83	80	78	78	77	79%
92	Claude 3.7 Sonnet	$0.050	47.0s	77%	82	81	80	77	76	79%
93	Claude 3 Haiku	$0.0031	14.9s	72%	84	83	77	76	76	79%
89	GPT-4.1	$0.020	38.4s	75%	81	80	79	79	74	79%
75	Grok 4 Fast	$0.0020	19.3s	76%	81	80	78	78	76	79%
85	Gemini 2.5 Pro	$0.037	35.2s	77%	80	80	79	78	77	79%
105	Gemma 4 31B	$0.0014	1.3m	74%	84	79	79	77	74	79%
123	DeepSeek V3.1	$0.0024	2.3m	73%	83	80	79	78	73	79%
101	Z.AI GLM 4.5	$0.0061	35.2s	72%	83	81	78	78	72	78%
90	Gemini 2.5 Flash (Reasoning)	$0.011	20.8s	74%	82	80	78	77	75	78%
91	GPT-4o, May 13th (temp=1)	$0.042	14.6s	76%	80	80	78	77	76	78%
100	Z.AI GLM 4.6	$0.0097	46.5s	74%	82	80	78	76	75	78%
124	ByteDance Seed 2.0 Lite	$0.011	1.7m	72%	84	80	77	75	74	78%
102	GPT-4o, Aug. 6th (temp=1)	$0.020	16.0s	73%	83	80	78	76	73	78%
87	LFM2 24B	$0.0003	28.1s	75%	81	80	79	77	73	78%
109	Ministral 8B	$0.0007	20.3s	71%	86	79	78	76	71	78%
122	Gemma 4 26B	$0.0012	43.6s	69%	84	82	81	78	64	78%
99	Mistral NeMO	$0.0008	7.5s	72%	82	82	80	73	70	77%
114	Gemma 3 27B	$0.0010	57.7s	72%	82	81	78	73	73	77%
110	Gemma 3 12B	$0.0004	45.9s	73%	81	79	77	75	74	77%
104	Gemma 4 26B (Reasoning)	$0.0019	49.8s	74%	80	79	78	75	74	77%
118	Gemini 3.1 Flash Lite (Preview)	$0.0037	8.7s	69%	85	79	78	77	67	77%
106	Ministral 3 3B	$0.0006	6.6s	72%	81	79	78	76	70	77%
95	Gemini 3.1 Flash Lite	$0.0036	9.3s	74%	79	77	76	76	75	76%
103	Qwen 2.5 72B	$0.0012	30.2s	74%	78	77	76	75	74	76%
111	GPT-4.1 Mini	$0.0034	20.9s	72%	80	79	76	74	73	76%
128	DeepSeek-V2 Chat	$0.0026	52.7s	68%	85	76	74	73	73	76%
112	GPT-4o, May 13th (temp=0)	$0.041	11.2s	74%	78	78	77	75	73	76%
116	Arcee AI: Trinity Mini	$0.0004	9.1s	71%	81	76	75	75	72	76%
119	Gemini 3 Flash (Preview)	$0.0089	20.4s	72%	79	78	76	73	72	76%
125	Llama 3.1 8B	$0.0007	58.2s	71%	80	79	78	72	69	76%
117	GPT-4o, Aug. 6th (temp=0)	$0.021	12.5s	73%	78	76	75	74	74	75%
120	Ministral 3B	$0.0002	7.0s	72%	77	76	75	74	71	75%
133	Ministral 3 8B	$0.0012	22.3s	66%	80	78	72	71	69	74%
126	Gemini 2.5 Flash Lite	$0.0012	10.8s	70%	78	76	74	73	70	74%
127	Inception Mercury 2	$0.0029	6.1s	70%	76	76	73	72	72	74%
129	GPT-4o Mini (temp=0)	$0.0016	32.9s	69%	77	73	72	72	72	73%
130	GPT-4o Mini (temp=1)	$0.0014	39.7s	70%	77	75	74	72	69	73%
136	Arcee AI: Trinity Large (Preview)	$0.0000	36.3s	64%	79	78	71	69	67	73%
132	Llama 3.1 70B	$0.0025	16.4s	68%	77	76	76	68	65	72%
140	GPT-5 Nano	$0.0042	1.5m	66%	78	73	71	69	67	72%
131	Gemma 3 4B	$0.0003	21.4s	70%	74	73	72	70	70	72%
135	Stealth: Aurora Alpha	$0.0000	8.5s	67%	76	74	72	68	67	71%
143	Llama 3.1 Nemotron 70B	$0.0064	35.7s	62%	77	75	68	68	66	71%
134	Gemini 3.1 Flash Lite (Reasoning)	$0.0034	10.1s	68%	73	73	72	72	65	71%
138	Inception Mercury	$0.011	19.3s	64%	77	74	70	68	65	71%
139	GPT-OSS 120B	$0.0010	1.6m	68%	72	71	70	70	68	70%
142	Nemotron 3 Super	$0.0000	40.4s	66%	72	71	68	68	66	69%
137	GPT-4.1 Nano	$0.0010	14.3s	66%	74	70	70	67	65	69%
147	Mistral Small 3.2 24B	$0.0078	6.5m	59%	81	68	67	67	63	69%
145	Gemini 2.5 Flash Lite (Reasoning)	$0.0036	40.3s	61%	78	71	69	64	62	69%
146	Nemotron 3 Nano	$0.0016	1.7m	64%	73	73	71	65	61	69%
81.15%

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Z.AI GLM 4.5 100GPT-5.4 Mini 100Qwen3.6 Max Preview	45GPT-5 Nano 50Aion 2.0 56Gemini 2.5 Flash Lite (Reasoning)
48.2%	Adverb-first sentence starts	100GPT-5.4 Nano (Reasoning, Low) 97Writer: Palmyra X5 96Gemma 3 12B	0Inception Mercury 2 0Llama 3.1 70B 0Gemini 3.1 Flash Lite (Reasoning)
100.0%	Adverbs in dialogue tags	100Gemma 4 26B (Reasoning) 100Gemma 4 31B (Reasoning) 100Grok 4.20 (Beta)	19GPT-4.1 Nano 40Arcee AI: Trinity Large (Preview) 45Llama 3.1 Nemotron 70B
88.8%	AI-ism adverb frequency	100Grok 4.1 Fast 99o4 Mini 99Inception Mercury	52Gemma 3 4B 67Arcee AI: Trinity Large (Preview) 67GPT-4.1 Nano
100.0%	AI-ism character names	100Llama 3.1 70B 100ByteDance Seed 1.6 100GPT-4o, May 13th (temp=0)	60Claude Opus 4 88Z.AI GLM 5 92MoonshotAI: Kimi K2.6
100.0%	AI-ism location names	100GPT-5 Nano 100GPT-OSS 120B 100Gemini 3.1 Flash Lite (Preview)	96Z.AI GLM 4.5 Air 96Gemini 2.5 Pro
48.5%	AI-ism word frequency	91Claude Sonnet 4.6 (Reasoning) 89Claude Opus 4.7 77GPT-5.4 Mini (Reasoning)	0GPT-4o Mini (temp=0) 0Gemma 3 4B 0Inception Mercury 2
93.3%	Cliché density	100DeepSeek V4 Pro 100Writer: Palmyra X5 100Claude Sonnet 4.5	13Mistral Small 3.2 24B 13GPT-4o Mini (temp=0) 20GPT-4o, May 13th (temp=0)
93.0%	Dialogue tag variety (said vs. fancy)	100Qwen3.6 Max Preview 100GPT-5.5 (Reasoning, Low) 100MoonshotAI: Kimi K2.6	11Gemini 2.5 Flash Lite 18GPT-OSS 120B 20Stealth: Aurora Alpha
85.5%	Em-dash & semicolon overuse	100Qwen 3.6 35B 100GPT-5.5 100Qwen 3.5 Plus (2026-02-15)	0GPT-5 Nano 0GPT-4.1 Nano 0ByteDance Seed 1.6
100.0%	Emotion telling (show vs. tell)	100Gemini 2.5 Flash (Reasoning) 100Mistral Small Creative 100Qwen 3.6 35B	76Arcee AI: Trinity Large (Preview) 84GPT-4o, May 13th (temp=0) 87Mistral Small 3.2 24B
99.5%	Filter word density	100Writer: Palmyra X5 100MoonshotAI: Kimi K2.6 100Z.AI GLM 4.7 Flash	34ByteDance Seed 2.0 Mini 46Llama 3.1 Nemotron 70B 48Llama 3.1 8B
100.0%	Gibberish response detection	100GPT-5.4 Nano 100Arcee AI: Trinity Large (Preview) 100GPT-4o, May 13th (temp=1)	20Llama 3.1 8B 78Rocinante 12B 80Llama 3.1 70B
100.0%	Markdown formatting overuse	100Qwen 3.5 397B A17B 100Nemotron 3 Nano 100DeepSeek V4 Flash (Reasoning)	79Rocinante 12B 80Llama 3.1 70B 80Llama 3.1 8B
100.0%	Missing dialogue indicators (quotation marks)	100GPT-5.1 100Qwen 3.6 Flash 100GPT-OSS 120B	63Qwen 3.5 Flash 80Qwen 3.5 Plus (2026-02-15) 89Mistral Small 3.2 24B
83.3%	Name drop frequency	100GPT-4.1 Nano 100Gemini 3.1 Flash Lite 100Gemini 2.5 Flash	9GPT-5.2 9Qwen 3.5 9B 17GPT-5.4 Nano (Reasoning)
86.6%	Narrator intent-glossing	100GPT-4o, Aug. 6th (temp=0) 100Mistral Medium 3.1 100Rocinante 12B	22Nemotron 3 Super 26Claude Haiku 4.5 27GPT-5.2
100.0%	Overuse of "that" (subordinate clause padding)	100DeepSeek V3 (2024-12-26) 100Writer: Palmyra X5 100Grok 4.3 (Reasoning)	49Mistral Small 3.2 24B 64ByteDance Seed 2.0 Mini 79Claude Haiku 4.5
100.0%	Paragraph length variance	100Gemma 4 26B 100Qwen 2.5 72B 100Qwen 3.5 Flash	64Nemotron 3 Nano 64Llama 3.1 Nemotron 70B 67GPT-OSS 120B
99.7%	Passive voice overuse	100Grok 4.20 (Beta, Reasoning) 100Gemini 3.1 Pro (Preview) 100Claude Opus 4.6	80Inception Mercury 80ByteDance Seed 2.0 Lite 87Llama 3.1 Nemotron 70B
100.0%	Past progressive (was/were + -ing) overuse	100Writer: Palmyra X5 100Grok 4.3 100ByteDance Seed 1.6	60Llama 3.1 70B 60Z.AI GLM 4.6 61Rocinante 12B
49.4%	Pronoun-first sentence starts	100GPT-5.5 (Reasoning, Low) 100GPT-5.5 99Mistral Small Creative	0Gemini 3.1 Flash Lite 0ByteDance Seed 2.0 Lite 0Gemini 3 Flash (Preview, Reasoning)
97.3%	Purple prose (modifier overload)	100Qwen 3.5 Plus (2026-02-15) 100Qwen 3.5 397B A17B 100MoonshotAI: Kimi K2.5	69Gemma 3 4B 82Inception Mercury 84Gemini 3 Flash (Preview, Reasoning)
100.0%	Repeated phrase echo	100GPT-OSS 120B 100o4 Mini 100Stealth: Healer Alpha	—
100.0%	Sentence length variance	100Gemini 3 Flash (Preview) 100Xiaomi MIMO v2.5 100Grok 4.20 (Beta, Reasoning)	74Mistral Small 3.2 24B 92Nemotron 3 Nano 94Qwen 3.5 35B
55.3%	Sentence opener variety	92GPT-4o, Aug. 6th (temp=1) 89Grok 4.1 Fast 86GPT-4o Mini (temp=1)	32Qwen 3.5 35B 34Inception Mercury 37Gemini 3 Flash (Preview, Reasoning)
22.6%	Subject-first sentence starts	88Rocinante 12B 79Writer: Palmyra X5 77Llama 3.1 8B	0Gemini 3.1 Flash Lite (Reasoning) 0GPT-OSS 120B 0Stealth: Aurora Alpha
22.3%	Subordinate conjunction sentence starts	90Cohere Command R+ (Aug. 2024) 78DeepSeek V3.2 77Gemma 3 4B	0GPT-4.1 0Ministral 3 3B 0GPT-5.4 Mini (Reasoning)
83.8%	Technical jargon density	100Rocinante 12B 100Gemini 3.1 Pro (Preview) 100Qwen3.6 Max Preview	0GPT-4o Mini (temp=1) 2Ministral 3B 7ByteDance Seed 2.0 Lite
61.8%	Useless dialogue additions	100Qwen 3.5 35B 100Qwen 3.5 Plus (2026-04-20) 100Qwen 3.5 397B A17B	0Ministral 3 3B 0Gemini 2.5 Flash Lite 0GPT-4o Mini (temp=0)

Fantasy: entering an ancient ruin

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	92%
GPT-5.5	90%
GPT-5.4	89%
GPT-5.4 (Reasoning, Low)	89%
Claude Opus 4.7 (Reasoning)	89%
GPT-5.5 (Reasoning, Low)	88%
Claude Opus 4.7	88%
GPT-5.5 (Reasoning)	88%
GPT-5.4 Mini (Reasoning, Low)	87%
GPT-5.4 Mini (Reasoning)	87%
Claude Sonnet 4.5	87%
Claude Sonnet 4.6	87%
Claude Sonnet 4.6 (Reasoning)	87%
Claude Opus 4.6	86%
GPT-5.1	86%
GPT-5.4 Mini	86%
GPT-5	85%
Z.AI GLM 5 Turbo	85%
Claude Opus 4.6 (Reasoning)	85%
Grok 4.20 (Reasoning)	84%

	Score	Cost	Time
GPT-5.4 Mini (Reasoning)	87%	$0.018	22.3s
GPT-5.4 Mini (Reasoning, Low)	87%	$0.014	16.2s
GPT-5.4 Mini	86%	$0.016	17.1s
GPT-5.4	89%	$0.039	1.2m
Z.AI GLM 5 Turbo	85%	$0.0090	25.9s
GPT-5.4 (Reasoning, Low)	89%	$0.056	1.3m
Qwen 3.6 Flash	83%	$0.010	39.2s
Claude Sonnet 4.6	87%	$0.040	37.2s
Writer: Palmyra X5	84%	$0.013	22.1s
Claude Sonnet 4.5	87%	$0.045	39.5s
Grok 4.20 (Reasoning)	84%	$0.018	1.2m
Qwen 3.6 35B	83%	$0.0087	51.9s
Qwen 3.5 35B	81%	$0.043	2.3m
Claude Opus 4.7	88%	$0.092	32.6s
Grok 4.20	81%	$0.012	42.6s
Z.AI GLM 5	83%	$0.0095	1.5m
DeepSeek V3 (2024-12-26)	78%	$0.0029	35.6s
MiniMax M2.7	82%	$0.0032	59.3s
Qwen 3.5 Flash	81%	$0.0033	45.4s
Qwen 3.5 122B	80%	$0.016	39.5s

	Score	Consistency	Stability
GPT-5.4 (Reasoning)	92%	98%	90%
GPT-5.5	90%	97%	88%
GPT-5.4	89%	97%	88%
GPT-5.5 (Reasoning, Low)	88%	98%	87%
GPT-5.5 (Reasoning)	88%	97%	86%
GPT-5.1	86%	98%	85%
GPT-5.4 (Reasoning, Low)	89%	92%	84%
GPT-5.4 Mini (Reasoning, Low)	87%	96%	84%
Claude Sonnet 4.6	87%	96%	84%
Claude Sonnet 4.5	87%	95%	83%
Claude Opus 4.7 (Reasoning)	89%	92%	82%
Claude Opus 4.6	86%	96%	82%
Claude Sonnet 4.6 (Reasoning)	87%	95%	82%
Claude Opus 4.6 (Reasoning)	85%	97%	82%
Claude Opus 4.7	88%	91%	82%
GPT-5.4 Mini	86%	93%	81%
GPT-5.4 Mini (Reasoning)	87%	92%	81%
Qwen 3.5 397B A17B	84%	94%	81%
Qwen3.6 Max Preview	84%	97%	81%
Z.AI GLM 5 Turbo	85%	93%	80%

	Score	Cost	Speed	Stability
GPT-5.4 Mini (Reasoning, Low)	87%	$0.014	16.2s	84%
GPT-5.4	89%	$0.039	1.2m	88%
GPT-5.4 Mini (Reasoning)	87%	$0.018	22.3s	81%
GPT-5.4 (Reasoning)	92%	$0.081	2.3m	90%
GPT-5.4 Mini	86%	$0.016	17.1s	81%
Claude Sonnet 4.6	87%	$0.040	37.2s	84%
Claude Sonnet 4.5	87%	$0.045	39.5s	83%
Z.AI GLM 5 Turbo	85%	$0.0090	25.9s	80%
GPT-5.4 (Reasoning, Low)	89%	$0.056	1.3m	84%
GPT-5.5	90%	$0.114	1.5m	88%
Claude Opus 4.7 (Reasoning)	89%	$0.100	34.4s	82%
GPT-5.1	86%	$0.053	1.8m	85%
Claude Opus 4.7	88%	$0.092	32.6s	82%
Grok 4.20 (Reasoning)	84%	$0.018	1.2m	80%
Writer: Palmyra X5	84%	$0.013	22.1s	76%
Grok 4.20	81%	$0.012	42.6s	79%
Qwen 3.6 35B	83%	$0.0087	51.9s	76%
Claude Opus 4.6	86%	$0.087	1.1m	82%
GPT-5.5 (Reasoning, Low)	88%	$0.133	1.6m	87%
GPT-5.5 (Reasoning)	88%	$0.136	1.4m	86%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
4	GPT-5.4 (Reasoning)	$0.081	2.3m	90%	94	93	92	92	90	92%
10	GPT-5.5	$0.114	1.5m	88%	91	90	90	89	87	90%
2	GPT-5.4	$0.039	1.2m	88%	91	91	90	88	87	89%
9	GPT-5.4 (Reasoning, Low)	$0.056	1.3m	84%	93	92	91	86	83	89%
11	Claude Opus 4.7 (Reasoning)	$0.100	34.4s	82%	93	93	89	85	84	89%
19	GPT-5.5 (Reasoning, Low)	$0.133	1.6m	87%	89	89	89	87	87	88%
13	Claude Opus 4.7	$0.092	32.6s	82%	95	90	90	84	82	88%
20	GPT-5.5 (Reasoning)	$0.136	1.4m	86%	89	89	88	87	85	88%
1	GPT-5.4 Mini (Reasoning, Low)	$0.014	16.2s	84%	90	89	87	86	85	87%
3	GPT-5.4 Mini (Reasoning)	$0.018	22.3s	81%	92	90	88	84	82	87%
7	Claude Sonnet 4.5	$0.045	39.5s	83%	91	88	87	87	82	87%
6	Claude Sonnet 4.6	$0.040	37.2s	84%	89	89	87	86	84	87%
56	Claude Sonnet 4.6 (Reasoning)	$0.139	2.3m	82%	91	87	86	85	84	87%
18	Claude Opus 4.6	$0.087	1.1m	82%	89	88	86	85	84	86%
12	GPT-5.1	$0.053	1.8m	85%	87	87	86	85	85	86%
5	GPT-5.4 Mini	$0.016	17.1s	81%	90	89	87	84	81	86%
52	GPT-5	$0.064	3.0m	80%	88	87	84	83	82	85%
8	Z.AI GLM 5 Turbo	$0.0090	25.9s	80%	88	87	86	83	79	85%
34	Claude Opus 4.6 (Reasoning)	$0.103	1.4m	82%	87	85	84	84	83	85%
14	Grok 4.20 (Reasoning)	$0.018	1.2m	80%	88	86	84	84	80	84%
33	Qwen 3.5 397B A17B	$0.0048	3.3m	81%	87	86	86	83	79	84%
57	Qwen3.6 Max Preview	$0.053	3.4m	81%	86	86	83	83	83	84%
28	Qwen3 235B A22B Instruct 2507	$0.0017	1.1m	73%	90	90	82	81	77	84%
145	Claude Opus 4	$0.305	1.9m	76%	91	87	85	79	77	84%
21	DeepSeek V4 Pro	$0.012	2.0m	79%	87	85	83	82	82	84%
128	MoonshotAI: Kimi K2.6	$0.066	6.4m	79%	88	84	83	82	82	84%
15	Writer: Palmyra X5	$0.013	22.1s	76%	90	87	83	80	79	84%
17	Qwen 3.6 35B	$0.0087	51.9s	76%	90	84	82	82	79	83%
50	DeepSeek V4 Pro (Reasoning)	$0.015	3.5m	79%	87	86	85	81	78	83%
92	Gemini 3.1 Pro (Preview)	$0.136	2.2m	78%	88	85	84	80	79	83%
40	Claude Opus 4.5	$0.079	48.4s	77%	86	85	85	85	74	83%
26	Qwen 3.6 Flash	$0.010	39.2s	74%	90	88	84	80	72	83%
39	Z.AI GLM 5	$0.0095	1.5m	74%	89	87	84	80	73	83%
23	Z.AI GLM 5.1	$0.017	1.3m	78%	86	84	83	81	77	82%
29	Grok 4.3 (Reasoning)	$0.018	1.6m	79%	85	83	81	81	81	82%
31	MiniMax M2.7	$0.0032	59.3s	74%	90	85	81	78	77	82%
16	Grok 4.20	$0.012	42.6s	79%	84	82	81	80	80	81%
55	Qwen 3.5 35B	$0.043	2.3m	78%	84	82	82	81	77	81%
53	Qwen 3.5 Plus (2026-04-20)	$0.020	2.0m	75%	87	84	82	78	75	81%
22	Qwen 3.5 Flash	$0.0033	45.4s	76%	86	81	81	79	78	81%
27	Qwen 3.5 9B	$0.0011	43.6s	75%	85	82	79	79	79	81%
25	Grok 4.20 (Beta, Reasoning)	$0.030	24.6s	77%	83	83	81	79	77	81%
61	ByteDance Seed 2.0 Lite	$0.011	2.0m	74%	87	83	82	77	73	80%
24	Qwen 3.5 122B	$0.016	39.5s	77%	83	83	81	78	77	80%
38	Grok 4.20 (Beta)	$0.018	14.5s	72%	85	84	78	78	75	80%
103	Qwen 3.6 27B	$0.031	2.5m	70%	89	82	78	76	76	80%
47	DeepSeek V4 Flash	$0.0005	24.4s	69%	89	81	77	76	75	80%
35	GPT-4.1	$0.021	40.8s	76%	82	82	82	78	73	79%
41	DeepSeek V3.2	$0.0022	1.1m	74%	85	79	79	79	76	79%
70	Hermes 3 70B	$0.0021	2.0m	72%	85	82	79	78	73	79%
42	Grok 4.3	$0.0098	28.6s	73%	85	83	80	76	72	79%
36	Qwen 3.5 27B	$0.012	1.1m	77%	81	80	79	79	77	79%
51	Rocinante 12B	$0.0018	23.4s	69%	89	80	78	77	70	79%
32	o4 Mini	$0.019	34.0s	77%	81	80	79	78	76	79%
30	GPT-5.4 Nano (Reasoning)	$0.0060	26.5s	76%	82	79	79	77	76	79%
80	GPT-5.2	$0.057	1.4m	75%	81	80	78	78	74	78%
74	o4 Mini High	$0.028	53.0s	71%	84	81	77	75	74	78%
73	Gemma 4 31B (Reasoning)	$0.0019	2.2m	74%	81	80	79	77	73	78%
49	DeepSeek V3 (2025-03-24)	$0.0017	30.2s	71%	85	77	76	76	75	78%
72	DeepSeek-V2 Chat	$0.0029	42.0s	68%	87	79	79	78	66	78%
65	Mistral Large 2	$0.020	32.2s	71%	86	77	77	75	74	78%
120	ByteDance Seed 2.0 Mini	$0.0046	4.6m	73%	82	80	78	77	73	78%
37	Grok 4.1 Fast	$0.0024	50.4s	76%	79	78	78	78	75	78%
71	Xiaomi MIMO v2.5 Pro	$0.0092	50.2s	70%	85	79	77	77	70	78%
116	MoonshotAI: Kimi K2.5	$0.021	3.3m	72%	83	78	78	77	72	78%
63	Aion 2.0	$0.0088	1.4m	74%	80	80	77	76	75	78%
68	Mistral Large	$0.024	46.8s	72%	82	80	76	76	75	78%
66	Gemini 3 Flash (Preview, Reasoning)	$0.012	26.2s	69%	83	82	77	76	69	78%
44	DeepSeek V3 (2024-12-26)	$0.0029	35.6s	73%	82	81	80	74	71	78%
48	GPT-5.4 Nano (Reasoning, Low)	$0.0051	19.3s	72%	81	81	77	76	72	77%
102	ByteDance Seed 1.6	$0.011	2.0m	70%	83	79	75	75	73	77%
99	Z.AI GLM 4.7	$0.011	1.3m	67%	84	82	79	77	63	77%
45	Gemma 3 27B	$0.0010	45.9s	74%	80	78	77	76	75	77%
86	Hermes 3 405B	$0.0062	49.2s	67%	87	77	76	73	72	77%
105	WizardLM 2 8x22b	$0.0045	3.3m	73%	80	79	76	75	75	77%
46	Gemini 2.5 Flash	$0.0057	10.6s	72%	82	78	78	74	72	77%
75	GPT-5 Mini	$0.010	1.0m	71%	82	77	76	74	74	77%
78	Gemini 3 Pro (Preview)	$0.057	51.6s	74%	78	78	76	76	75	77%
43	GPT-5.4 Nano	$0.0050	18.4s	74%	78	78	76	76	75	77%
88	Gemini 2.5 Pro	$0.039	37.7s	70%	82	80	79	76	65	76%
60	DeepSeek V4 Flash (Reasoning)	$0.0010	30.0s	71%	81	78	76	74	72	76%
83	Qwen 3 32B	$0.0020	45.5s	68%	83	83	78	69	68	76%
59	Ministral 3 14B	$0.0013	9.5s	70%	83	79	77	71	70	76%
54	ByteDance Seed 1.6 Flash	$0.0015	30.1s	72%	79	78	75	74	74	76%
58	Mistral Small 4 (Reasoning)	$0.0027	32.6s	72%	80	76	76	75	72	76%
96	Claude 3.5 Sonnet	$0.066	35.6s	72%	79	78	77	72	71	75%
90	Mistral Large 3	$0.0064	51.4s	68%	83	78	75	71	70	75%
84	DeepSeek V3.1	$0.0027	1.6m	72%	79	77	76	74	71	75%
77	Qwen 3.5 Plus (2026-02-15)	$0.0073	37.0s	70%	79	78	75	75	70	75%
64	Grok 4 Fast	$0.0020	18.4s	71%	79	76	75	74	72	75%
62	Claude 3 Haiku	$0.0034	19.4s	72%	78	75	74	74	73	75%
67	Gemma 3 12B	$0.0004	43.8s	72%	77	76	76	75	70	75%
69	Mistral Small 4	$0.0019	19.9s	71%	79	77	75	73	70	75%
98	GPT-4o, May 13th (temp=1)	$0.048	14.9s	70%	78	77	73	73	72	75%
85	GPT-4o, Aug. 6th (temp=1)	$0.022	18.5s	69%	80	78	77	72	66	75%
76	Qwen 2.5 72B	$0.0017	49.7s	72%	76	76	76	75	69	74%
104	Claude Haiku 4.5	$0.015	23.3s	66%	80	77	72	71	69	74%
109	Stealth: Hunter Alpha	$0.0000	43.3s	65%	82	74	71	71	70	74%
117	Grok 4	$0.045	1.3m	70%	78	74	73	73	71	74%
113	Z.AI GLM 4.7 Flash	$0.0019	1.2m	66%	82	76	73	69	69	74%
94	LFM2 24B	$0.0003	28.9s	67%	80	76	73	71	69	74%
101	MiniMax M2.5	$0.0035	52.2s	68%	79	75	74	72	68	74%
87	Z.AI GLM 4.6	$0.0073	31.3s	70%	77	74	72	72	71	73%
81	Stealth: Healer Alpha	$0.0000	26.0s	70%	77	76	76	72	65	73%
95	Gemma 3 4B	$0.0003	22.4s	67%	79	74	71	71	71	73%
100	GPT-4o, Aug. 6th (temp=0)	$0.020	17.0s	68%	78	75	73	73	67	73%
93	Mistral Small Creative	$0.0010	8.9s	67%	78	78	73	70	67	73%
91	GPT-4o Mini (temp=1)	$0.0016	50.4s	70%	76	74	73	72	71	73%
82	GPT-4.1 Mini	$0.0037	26.1s	70%	76	75	74	70	70	73%
110	Mistral Medium 3.1	$0.0061	44.7s	67%	78	77	72	69	68	73%
130	Gemma 4 26B (Reasoning)	$0.0015	2.5m	67%	78	77	74	70	66	73%
107	Cohere Command R+ (Aug. 2024)	$0.029	41.0s	69%	76	75	74	71	67	73%
122	Claude Sonnet 4	$0.040	40.5s	66%	79	76	71	71	67	73%
89	Gemini 2.5 Flash (Reasoning)	$0.013	22.2s	70%	76	72	72	72	72	73%
79	Mistral NeMO	$0.0010	10.4s	70%	74	73	73	73	68	72%
97	Gemini 2.5 Flash Lite	$0.0012	8.8s	67%	77	75	72	70	67	72%
131	Llama 3.1 8B	$0.0003	55.5s	61%	81	79	71	65	64	72%
112	Ministral 3 8B	$0.0010	8.2s	64%	77	77	70	68	68	72%
118	GPT-4o Mini (temp=0)	$0.0015	40.6s	65%	76	75	72	69	63	71%
115	Z.AI GLM 4.5	$0.0065	40.9s	67%	74	73	72	72	64	71%
114	Ministral 3B	$0.0002	3.4s	64%	77	74	72	71	60	71%
123	GPT-4o, May 13th (temp=0)	$0.053	18.7s	67%	74	73	71	70	67	71%
106	Gemini 3 Flash (Preview)	$0.0083	19.2s	68%	73	72	70	69	69	71%
108	Gemma 4 26B	$0.0011	44.1s	69%	73	72	72	69	68	71%
121	Arcee AI: Trinity Large (Preview)	$0.0000	1.1m	66%	77	72	71	68	65	71%
126	Z.AI GLM 4.5 Air	$0.0040	1.2m	65%	76	70	69	69	68	70%
136	Gemma 4 31B	$0.0015	2.4m	66%	75	72	70	68	68	70%
111	Xiaomi MIMO v2.5	$0.0053	28.6s	68%	73	73	72	68	66	70%
124	Ministral 8B	$0.0007	13.2s	62%	79	72	68	66	66	70%
119	Gemini 3.1 Flash Lite (Preview)	$0.0039	8.4s	65%	75	69	68	68	67	69%
144	Claude 3.7 Sonnet	$0.057	55.5s	65%	73	71	71	69	61	69%
138	Llama 3.1 Nemotron 70B	$0.0072	31.4s	60%	76	73	67	65	62	69%
127	Gemini 2.5 Flash Lite (Reasoning)	$0.0033	29.8s	65%	72	70	69	66	64	68%
135	Llama 3.1 70B	$0.0038	28.0s	61%	75	71	69	67	58	68%
132	GPT-4.1 Nano	$0.0010	15.1s	62%	75	68	68	67	63	68%
134	Nemotron 3 Super	$0.0000	43.3s	63%	73	68	68	67	62	68%
139	GPT-OSS 120B	$0.0014	53.5s	62%	74	72	68	63	62	68%
141	Nemotron 3 Nano	$0.0011	54.1s	62%	74	71	68	63	62	68%
133	Arcee AI: Trinity Mini	$0.0005	8.0s	62%	72	71	68	65	59	67%
129	Inception Mercury	$0.012	16.1s	66%	68	68	68	65	65	67%
143	Inception Mercury 2	$0.0099	13.7s	60%	71	71	66	63	62	67%
125	Gemini 3.1 Flash Lite (Reasoning)	$0.0038	9.2s	65%	68	67	66	66	66	66%
146	GPT-5 Nano	$0.0042	1.3m	62%	69	65	65	64	63	65%
137	Gemini 3.1 Flash Lite	$0.0039	9.4s	63%	67	66	66	64	62	65%
142	Ministral 3 3B	$0.0006	3.5s	61%	69	68	67	62	57	65%
140	Stealth: Aurora Alpha	$0.0000	13.1s	63%	66	66	65	62	62	64%
147	Mistral Small 3.2 24B	$0.0083	7.6m	58%	67	66	63	62	56	63%
77.05%

Median	Evaluator	Top 3	Flop 3
80.0%	"Not X but Y" pattern overuse	100ByteDance Seed 2.0 Lite 100Inception Mercury 100Stealth: Aurora Alpha	0Stealth: Hunter Alpha 0GPT-5 Nano 0Xiaomi MIMO v2.5
47.1%	Adverb-first sentence starts	100Writer: Palmyra X5 99Claude Sonnet 4.5 96Claude Sonnet 4.6	0Gemini 3.1 Flash Lite (Reasoning) 0Inception Mercury 2 0Grok 4 Fast
100.0%	Adverbs in dialogue tags	100Qwen 3.5 Flash 100Qwen3 235B A22B Instruct 2507 100Gemini 3.1 Flash Lite (Reasoning)	30Rocinante 12B 47Hermes 3 70B 47GPT-4.1 Nano
90.6%	AI-ism adverb frequency	100Qwen 3.5 122B 99Nemotron 3 Super 99Qwen 3.6 Flash	70Mistral Small 3.2 24B 72Gemma 3 4B 73Llama 3.1 Nemotron 70B
100.0%	AI-ism character names	100Nemotron 3 Nano 100Z.AI GLM 4.7 Flash 100Grok 4.3 (Reasoning)	92GPT-5.5 92Llama 3.1 8B 96Xiaomi MIMO v2.5 Pro
100.0%	AI-ism location names	100DeepSeek V3.1 100GPT-5.1 100DeepSeek V3 (2024-12-26)	96Gemma 3 27B
24.4%	AI-ism word frequency	78ByteDance Seed 2.0 Lite 71Claude Opus 4.7 71GPT-5.5	0Llama 3.1 Nemotron 70B 0GPT-OSS 120B 0Inception Mercury 2
100.0%	Cliché density	100Rocinante 12B 100Gemini 2.5 Pro 100Claude 3.5 Sonnet	7Mistral Small 3.2 24B 33Stealth: Aurora Alpha 40Inception Mercury
40.0%	Dialogue tag variety (said vs. fancy)	100Z.AI GLM 5.1 100MoonshotAI: Kimi K2.6 100Claude Opus 4.5	0Qwen 3.5 Plus (2026-02-15) 0Gemini 3.1 Flash Lite (Reasoning) 0Gemma 3 27B
60.0%	Em-dash & semicolon overuse	100Qwen3.6 Max Preview 100Mistral NeMO 100Mistral Small 3.2 24B	0Writer: Palmyra X5 0Ministral 3 3B 0Mistral Small 4
100.0%	Emotion telling (show vs. tell)	100Ministral 8B 100Mistral Small Creative 100Inception Mercury 2	65Mistral Small 3.2 24B 65GPT-4o, Aug. 6th (temp=0) 68Llama 3.1 70B
95.1%	Filter word density	100Qwen3.6 Max Preview 100GPT-5.4 Mini 100Qwen3 235B A22B Instruct 2507	0Llama 3.1 Nemotron 70B 0Inception Mercury 14Cohere Command R+ (Aug. 2024)
100.0%	Gibberish response detection	100Ministral 3B 100Gemini 2.5 Flash (Reasoning) 100Nemotron 3 Nano	40Llama 3.1 8B 80Inception Mercury 80ByteDance Seed 2.0 Lite
100.0%	Markdown formatting overuse	100GPT-4o, May 13th (temp=1) 100Z.AI GLM 4.7 100GPT-5.5	69Ministral 3 8B 75Ministral 8B 80Qwen 3.5 Flash
100.0%	Missing dialogue indicators (quotation marks)	100Z.AI GLM 4.7 100Llama 3.1 70B 100Qwen 3.5 122B	40Qwen 3.5 35B 60Qwen 3.6 35B 80o4 Mini High
64.7%	Name drop frequency	100Gemini 3.1 Flash Lite (Reasoning) 99Gemini 3.1 Flash Lite 98GPT-4.1 Nano	0Qwen 3.5 9B 6Qwen 3.5 122B 14Claude 3.7 Sonnet
58.3%	Narrator intent-glossing	100GPT-5.4 (Reasoning) 100Qwen3.6 Max Preview 100GPT-5.5 (Reasoning)	0Mistral Small 3.2 24B 0Claude Sonnet 4 0GPT-5 Nano
100.0%	Overuse of "that" (subordinate clause padding)	100Grok 4.3 100Z.AI GLM 4.7 Flash 100GPT-5.1	50Mistral Small 3.2 24B 80Mistral NeMO 80Llama 3.1 70B
100.0%	Paragraph length variance	100Grok 4.20 (Reasoning) 100Mistral Large 100GPT-5.4 Nano (Reasoning)	35Arcee AI: Trinity Mini 59Stealth: Aurora Alpha 63GPT-OSS 120B
99.2%	Passive voice overuse	100Z.AI GLM 4.5 Air 100GPT-OSS 120B 100GPT-4o, Aug. 6th (temp=1)	84ByteDance Seed 2.0 Lite 89Llama 3.1 70B 93ByteDance Seed 2.0 Mini
100.0%	Past progressive (was/were + -ing) overuse	100Grok 4.20 100GPT-4.1 100GPT-5.4 (Reasoning, Low)	34Ministral 8B 38Ministral 3 3B 44Gemini 3 Flash (Preview)
100.0%	Pronoun-first sentence starts	100Z.AI GLM 4.5 100Claude Opus 4.6 (Reasoning) 100DeepSeek V4 Flash (Reasoning)	19Gemini 3.1 Flash Lite 44ByteDance Seed 2.0 Mini 46Z.AI GLM 4.7
96.4%	Purple prose (modifier overload)	100Z.AI GLM 5.1 100GPT-4o, Aug. 6th (temp=0) 100Arcee AI: Trinity Mini	82Gemini 3.1 Flash Lite 83Rocinante 12B 85Gemma 3 4B
100.0%	Repeated phrase echo	100Claude Opus 4.7 100GPT-5.4 Mini (Reasoning) 100Nemotron 3 Super	—
100.0%	Sentence length variance	100GPT-5.5 100Gemini 2.5 Pro 100Qwen 2.5 72B	86GPT-4o, Aug. 6th (temp=0) 88Mistral Small 3.2 24B 90Stealth: Aurora Alpha
58.9%	Sentence opener variety	95Rocinante 12B 93DeepSeek V3 (2025-03-24) 91GPT-4o Mini (temp=1)	31GPT-5 Nano 35Inception Mercury 37Qwen 3.5 35B
31.2%	Subject-first sentence starts	93Writer: Palmyra X5 92Rocinante 12B 85GPT-5.4 (Reasoning)	0Inception Mercury 0GPT-OSS 120B 0Qwen 3.5 122B
20.0%	Subordinate conjunction sentence starts	93Gemma 3 27B 80Cohere Command R+ (Aug. 2024) 79Gemini 2.5 Flash Lite	0Qwen 3.5 27B 0Mistral Small 3.2 24B 0Gemini 3.1 Flash Lite (Preview)
52.0%	Technical jargon density	100Gemini 3.1 Pro (Preview) 99Hermes 3 405B 99Qwen 3.5 35B	0GPT-5 Nano 0Ministral 3B 0Stealth: Aurora Alpha
49.5%	Useless dialogue additions	100Qwen 3.5 122B 100GPT-5.4 Mini (Reasoning) 100Qwen 3.5 35B	0Gemma 3 4B 0Arcee AI: Trinity Mini 0Inception Mercury 2

Mystery: examining a crime scene

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning, Low)	93%
Claude Opus 4.7 (Reasoning)	92%
Claude Opus 4.6 (Reasoning)	92%
GPT-5.4	92%
GPT-5.4 (Reasoning)	91%
GPT-5.4 Mini (Reasoning)	90%
GPT-5.1	90%
Claude Sonnet 4.5	90%
GPT-5.5	90%
GPT-5.4 Mini	89%
GPT-5.5 (Reasoning, Low)	89%
Writer: Palmyra X5	89%
Claude Sonnet 4.6	89%
GPT-5.5 (Reasoning)	89%
DeepSeek V4 Flash (Reasoning)	89%
Qwen 3.6 Flash	88%
Claude Opus 4	88%
Z.AI GLM 5	88%
Z.AI GLM 5 Turbo	88%
Qwen 3.5 397B A17B	88%

	Score	Cost	Time
DeepSeek V4 Flash (Reasoning)	89%	$0.0008	29.5s
GPT-5.4 Mini	89%	$0.014	15.4s
DeepSeek V4 Flash	88%	$0.0006	24.0s
GPT-5.4 Mini (Reasoning)	90%	$0.026	33.4s
Rocinante 12B	87%	$0.0022	45.1s
Stealth: Hunter Alpha	86%	$0.0000	49.6s
GPT-5.4 Mini (Reasoning, Low)	88%	$0.014	15.7s
Writer: Palmyra X5	89%	$0.013	21.6s
Grok 4.3	87%	$0.0080	21.7s
Grok 4.1 Fast	86%	$0.0017	49.2s
Qwen 3.6 Flash	88%	$0.012	45.2s
Z.AI GLM 5 Turbo	88%	$0.0088	33.3s
Qwen3 235B A22B Instruct 2507	87%	$0.0011	47.2s
Claude Haiku 4.5	87%	$0.015	23.7s
Xiaomi MIMO v2.5 Pro	88%	$0.0093	54.2s
Mistral Large	86%	$0.018	30.5s
Z.AI GLM 5.1	88%	$0.017	1.1m
GPT-5.4 (Reasoning, Low)	93%	$0.054	1.2m
MiniMax M2.7	85%	$0.0043	1.3m
Claude Sonnet 4.5	90%	$0.042	39.8s

	Score	Consistency	Stability
GPT-5.4 (Reasoning, Low)	93%	98%	91%
GPT-5.4 Mini (Reasoning)	90%	99%	90%
GPT-5.1	90%	99%	89%
Claude Opus 4.7 (Reasoning)	92%	95%	89%
Claude Opus 4.6 (Reasoning)	92%	96%	88%
GPT-5.5 (Reasoning)	89%	98%	87%
GPT-5.4 Mini	89%	98%	87%
GPT-5.4	92%	96%	87%
GPT-5.4 (Reasoning)	91%	94%	86%
GPT-5	87%	98%	86%
GPT-5.5	90%	95%	85%
Claude Sonnet 4.5	90%	96%	85%
Writer: Palmyra X5	89%	96%	85%
GPT-5.4 Mini (Reasoning, Low)	88%	95%	84%
Qwen3.6 Max Preview	87%	96%	84%
DeepSeek V4 Flash (Reasoning)	89%	96%	84%
Qwen 3.6 Flash	88%	95%	84%
Grok 4.20 (Beta, Reasoning)	86%	97%	84%
Claude Opus 4	88%	95%	84%
DeepSeek V4 Pro (Reasoning)	86%	97%	84%

	Score	Cost	Speed	Stability
GPT-5.4 Mini (Reasoning)	90%	$0.026	33.4s	90%
GPT-5.4 (Reasoning, Low)	93%	$0.054	1.2m	91%
GPT-5.4 Mini	89%	$0.014	15.4s	87%
Writer: Palmyra X5	89%	$0.013	21.6s	85%
DeepSeek V4 Flash (Reasoning)	89%	$0.0008	29.5s	84%
DeepSeek V4 Flash	88%	$0.0006	24.0s	84%
GPT-5.4 Mini (Reasoning, Low)	88%	$0.014	15.7s	84%
GPT-5.4	92%	$0.044	1.3m	87%
GPT-5.1	90%	$0.052	1.6m	89%
Qwen 3.6 Flash	88%	$0.012	45.2s	84%
Claude Sonnet 4.5	90%	$0.042	39.8s	85%
Grok 4.3	87%	$0.0080	21.7s	84%
Claude Opus 4.7 (Reasoning)	92%	$0.095	35.4s	89%
Qwen3 235B A22B Instruct 2507	87%	$0.0011	47.2s	83%
Z.AI GLM 5 Turbo	88%	$0.0088	33.3s	82%
Rocinante 12B	87%	$0.0022	45.1s	83%
Grok 4.1 Fast	86%	$0.0017	49.2s	83%
Claude Haiku 4.5	87%	$0.015	23.7s	83%
Xiaomi MIMO v2.5 Pro	88%	$0.0093	54.2s	83%
Claude Opus 4.6 (Reasoning)	92%	$0.091	1.2m	88%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
2	GPT-5.4 (Reasoning, Low)	$0.054	1.2m	91%	95	94	93	93	92	93%
13	Claude Opus 4.7 (Reasoning)	$0.095	35.4s	89%	95	95	94	89	89	92%
20	Claude Opus 4.6 (Reasoning)	$0.091	1.2m	88%	94	94	92	89	89	92%
8	GPT-5.4	$0.044	1.3m	87%	94	94	90	90	89	92%
41	GPT-5.4 (Reasoning)	$0.090	2.6m	86%	96	92	92	90	86	91%
1	GPT-5.4 Mini (Reasoning)	$0.026	33.4s	90%	91	91	91	90	89	90%
9	GPT-5.1	$0.052	1.6m	89%	91	91	91	90	89	90%
11	Claude Sonnet 4.5	$0.042	39.8s	85%	94	91	89	88	88	90%
57	GPT-5.5	$0.114	1.5m	85%	93	91	89	89	86	90%
3	GPT-5.4 Mini	$0.014	15.4s	87%	91	90	89	89	88	89%
76	GPT-5.5 (Reasoning, Low)	$0.113	1.5m	82%	94	94	90	85	83	89%
4	Writer: Palmyra X5	$0.013	21.6s	85%	92	90	88	88	87	89%
23	Claude Sonnet 4.6	$0.037	40.6s	82%	93	91	87	87	86	89%
74	GPT-5.5 (Reasoning)	$0.139	1.8m	87%	90	90	89	88	87	89%
5	DeepSeek V4 Flash (Reasoning)	$0.0008	29.5s	84%	91	91	88	87	86	89%
10	Qwen 3.6 Flash	$0.012	45.2s	84%	92	89	88	88	85	88%
135	Claude Opus 4	$0.232	2.4m	84%	92	91	89	86	85	88%
26	Z.AI GLM 5	$0.013	2.1m	82%	93	89	87	87	85	88%
15	Z.AI GLM 5 Turbo	$0.0088	33.3s	82%	93	88	87	87	85	88%
32	Qwen 3.5 397B A17B	$0.013	2.4m	82%	92	91	87	85	85	88%
7	GPT-5.4 Mini (Reasoning, Low)	$0.014	15.7s	84%	91	89	88	87	84	88%
25	Z.AI GLM 5.1	$0.017	1.1m	81%	94	89	87	85	84	88%
6	DeepSeek V4 Flash	$0.0006	24.0s	84%	90	90	87	86	86	88%
24	Qwen 3.5 Plus (2026-04-20)	$0.019	1.8m	84%	91	89	87	87	85	88%
19	Xiaomi MIMO v2.5 Pro	$0.0093	54.2s	83%	92	88	87	86	85	88%
59	Qwen3.6 Max Preview	$0.051	3.2m	84%	90	89	88	86	84	87%
46	GPT-5	$0.063	2.5m	86%	89	87	87	87	86	87%
51	Claude Opus 4.6	$0.077	1.2m	83%	91	89	88	85	84	87%
18	Claude Haiku 4.5	$0.015	23.7s	83%	89	89	87	86	84	87%
14	Qwen3 235B A22B Instruct 2507	$0.0011	47.2s	83%	89	89	86	86	84	87%
73	Claude Sonnet 4.6 (Reasoning)	$0.075	1.2m	79%	93	89	86	84	83	87%
12	Grok 4.3	$0.0080	21.7s	84%	90	88	87	84	84	87%
16	Rocinante 12B	$0.0022	45.1s	83%	88	88	88	88	82	87%
22	Grok 4.20 (Beta)	$0.015	12.7s	81%	90	89	85	84	83	86%
21	Stealth: Hunter Alpha	$0.0000	49.6s	81%	90	89	88	87	79	86%
34	DeepSeek V4 Pro (Reasoning)	$0.011	2.8m	84%	89	87	86	86	84	86%
17	Grok 4.1 Fast	$0.0017	49.2s	83%	89	88	87	84	83	86%
143	MoonshotAI: Kimi K2.6	$0.091	11.0m	83%	89	86	86	85	84	86%
33	Grok 4.20	$0.010	37.8s	78%	91	89	84	84	83	86%
27	Grok 4.20 (Beta, Reasoning)	$0.040	35.2s	84%	88	87	87	85	83	86%
60	Claude Opus 4.7	$0.091	34.2s	84%	88	87	86	85	83	86%
36	Mistral Large	$0.018	30.5s	79%	93	88	87	82	80	86%
38	Grok 4.20 (Reasoning)	$0.017	1.3m	80%	92	87	87	84	80	86%
44	Qwen 3.6 35B	$0.016	2.4m	82%	89	89	87	83	82	86%
31	DeepSeek V4 Pro	$0.011	1.3m	81%	90	87	87	84	80	86%
29	MiniMax M2.7	$0.0043	1.3m	81%	89	87	86	83	82	85%
39	WizardLM 2 8x22b	$0.0039	1.9m	80%	90	87	85	83	82	85%
48	Claude Sonnet 4	$0.035	39.0s	79%	90	88	86	81	80	85%
85	Hermes 3 70B	$0.0018	1.3m	70%	95	91	82	80	75	85%
66	Claude Opus 4.5	$0.067	40.4s	81%	88	87	85	82	81	85%
55	MoonshotAI: Kimi K2.5	$0.021	2.4m	81%	88	86	85	83	82	85%
37	MiniMax M2.5	$0.0038	1.5m	81%	88	86	84	83	82	85%
49	Hermes 3 405B	$0.0049	25.9s	75%	91	88	83	82	79	84%
72	o4 Mini High	$0.037	1.3m	78%	89	85	83	83	81	84%
30	o4 Mini	$0.017	30.1s	81%	87	85	84	83	82	84%
63	Gemma 3 27B	$0.0007	59.9s	75%	91	86	82	81	79	84%
64	ByteDance Seed 2.0 Lite	$0.011	1.9m	78%	89	84	83	81	81	84%
35	Qwen 3.5 122B	$0.015	31.5s	81%	86	84	83	83	82	84%
28	GPT-5.4 Nano	$0.0053	18.9s	82%	85	84	83	83	82	83%
42	Mistral Small 4 (Reasoning)	$0.0026	32.2s	78%	88	83	82	82	81	83%
58	Grok 4.3 (Reasoning)	$0.017	1.6m	80%	86	85	83	82	82	83%
133	Gemini 3.1 Pro (Preview)	$0.143	2.1m	80%	86	84	82	82	82	83%
47	Mistral Large 3	$0.0046	32.8s	77%	90	84	83	81	79	83%
40	GPT-5.4 Nano (Reasoning)	$0.0059	22.2s	79%	85	85	82	82	81	83%
88	GPT-5.2	$0.062	1.5m	79%	87	85	84	81	79	83%
43	Mistral Small 4	$0.0017	17.4s	78%	86	84	83	82	77	83%
80	DeepSeek V3 (2025-03-24)	$0.0022	16.1s	72%	89	89	81	77	77	82%
45	Qwen 3.5 35B	$0.013	41.6s	80%	84	83	82	81	80	82%
50	Qwen 3.5 9B	$0.0013	1.3m	79%	85	84	84	81	78	82%
52	Aion 2.0	$0.0078	1.1m	80%	84	84	82	81	81	82%
62	GPT-5.4 Nano (Reasoning, Low)	$0.0050	22.8s	75%	87	84	83	83	73	82%
54	Stealth: Healer Alpha	$0.0000	26.3s	77%	88	83	82	79	78	82%
53	Qwen 3 32B	$0.0014	44.4s	78%	86	84	83	82	77	82%
83	Mistral Large 2	$0.019	35.7s	74%	89	84	83	81	73	82%
107	Claude 3.5 Sonnet	$0.060	34.1s	72%	90	85	82	79	72	82%
89	ByteDance Seed 1.6 Flash	$0.0012	23.4s	70%	93	82	79	78	77	82%
68	DeepSeek V3.2	$0.0020	1.6m	78%	85	82	81	80	79	82%
84	Xiaomi MIMO v2.5	$0.0061	34.8s	73%	88	84	80	79	76	81%
67	GPT-5 Mini	$0.0100	1.1m	78%	84	83	83	79	77	81%
78	Z.AI GLM 4.5	$0.0054	34.0s	75%	84	84	79	79	78	81%
65	Z.AI GLM 4.5 Air	$0.0020	35.0s	77%	84	83	81	79	77	81%
75	Mistral Medium 3.1	$0.0056	39.4s	76%	86	80	80	79	78	81%
71	LFM2 24B	$0.0003	30.8s	75%	86	81	80	79	78	81%
111	ByteDance Seed 2.0 Mini	$0.0041	4.1m	75%	86	82	80	78	77	81%
61	Qwen 3.5 Flash	$0.0028	41.1s	78%	83	82	81	79	78	80%
56	Mistral NeMO	$0.0009	9.6s	77%	82	82	81	80	76	80%
79	Mistral Small Creative	$0.0009	7.7s	74%	87	80	79	79	77	80%
70	Z.AI GLM 4.6	$0.0068	52.0s	78%	82	81	80	79	79	80%
92	DeepSeek V3.1	$0.0020	2.0m	75%	85	82	80	80	75	80%
82	Z.AI GLM 4.7	$0.011	1.4m	78%	82	82	81	80	76	80%
104	Arcee AI: Trinity Large (Preview)	$0.0000	38.7s	68%	92	80	79	78	70	80%
77	Qwen 3.5 Plus (2026-02-15)	$0.0075	33.3s	76%	82	82	81	80	75	80%
93	GPT-4.1	$0.021	38.6s	74%	84	82	79	78	76	80%
99	Claude 3.7 Sonnet	$0.051	53.7s	76%	82	82	79	77	77	80%
116	Qwen 3.5 27B	$0.041	2.9m	77%	82	81	81	77	77	79%
69	Ministral 3 14B	$0.0010	7.2s	76%	82	81	80	80	74	79%
98	Gemini 2.5 Pro	$0.037	33.9s	74%	84	82	80	75	75	79%
81	DeepSeek-V2 Chat	$0.0028	56.9s	76%	81	81	80	78	76	79%
100	GPT-4o, May 13th (temp=1)	$0.043	12.5s	73%	85	79	78	78	76	79%
86	Gemini 2.5 Flash	$0.0057	10.0s	74%	83	81	78	78	75	79%
106	Gemma 4 31B (Reasoning)	$0.0019	4.1m	77%	80	80	79	78	77	79%
123	Qwen 3.6 27B	$0.024	2.2m	72%	85	81	78	77	74	79%
118	ByteDance Seed 1.6	$0.011	2.0m	72%	84	83	80	78	69	79%
127	Gemini 3 Pro (Preview)	$0.057	52.9s	72%	84	81	78	76	75	79%
101	Cohere Command R+ (Aug. 2024)	$0.024	32.9s	73%	85	80	79	76	73	79%
105	Grok 4	$0.042	1.4m	76%	81	79	79	77	76	78%
90	Z.AI GLM 4.7 Flash	$0.0018	1.5m	76%	81	80	78	77	76	78%
95	Gemma 3 12B	$0.0004	52.6s	72%	85	82	81	72	71	78%
110	GPT-4o, May 13th (temp=0)	$0.040	10.0s	72%	85	79	78	75	73	78%
94	GPT-4o, Aug. 6th (temp=1)	$0.023	19.8s	74%	80	79	78	78	73	78%
102	Gemini 2.5 Flash (Reasoning)	$0.014	24.8s	72%	84	80	78	76	71	78%
91	GPT-4o, Aug. 6th (temp=0)	$0.019	20.5s	76%	80	79	78	76	75	78%
87	Claude 3 Haiku	$0.0028	15.0s	75%	80	79	79	74	73	77%
103	Gemini 2.5 Flash Lite	$0.0011	8.9s	69%	83	81	76	74	73	77%
96	Gemma 4 26B	$0.0010	1.0m	73%	82	78	78	75	73	77%
97	GPT-4o Mini (temp=1)	$0.0014	38.4s	72%	80	80	76	76	73	77%
120	Gemma 4 31B	$0.0013	2.0m	73%	80	78	78	72	72	76%
113	Gemini 3 Flash (Preview, Reasoning)	$0.011	26.6s	71%	80	78	75	75	72	76%
132	Gemma 4 26B (Reasoning)	$0.0019	1.8m	70%	81	77	74	74	73	76%
112	Ministral 3 3B	$0.0005	3.1s	69%	80	77	74	72	72	75%
108	Ministral 3 8B	$0.0008	7.7s	70%	80	75	75	73	72	75%
115	GPT-4.1 Mini	$0.0030	16.8s	71%	80	75	74	74	72	75%
109	Qwen 2.5 72B	$0.0010	31.4s	72%	78	76	75	73	71	75%
114	Gemini 3.1 Flash Lite (Preview)	$0.0036	9.0s	70%	78	76	74	73	72	75%
125	DeepSeek V3 (2024-12-26)	$0.0028	1.1m	71%	78	77	77	73	68	75%
119	Ministral 8B	$0.0006	6.9s	70%	79	78	77	71	67	74%
128	Llama 3.1 70B	$0.0020	17.5s	68%	80	78	76	73	65	74%
126	Llama 3.1 8B	$0.0003	39.1s	70%	79	76	74	72	70	74%
147	Mistral Small 3.2 24B	$0.0071	6.4m	70%	77	77	74	73	68	74%
124	Nemotron 3 Super	$0.0000	38.9s	71%	77	75	74	72	71	74%
134	Arcee AI: Trinity Mini	$0.0004	8.2s	66%	81	76	74	71	66	74%
130	Llama 3.1 Nemotron 70B	$0.0063	28.6s	69%	77	75	73	73	69	74%
129	Gemini 3 Flash (Preview)	$0.0079	19.4s	70%	78	75	74	71	69	73%
121	Gemma 3 4B	$0.0003	21.3s	71%	76	74	74	73	70	73%
117	Ministral 3B	$0.0002	2.6s	71%	76	74	74	72	70	73%
122	GPT-4o Mini (temp=0)	$0.0013	31.3s	72%	75	74	74	72	70	73%
131	GPT-4.1 Nano	$0.0009	16.6s	69%	75	74	74	71	66	72%
136	Grok 4 Fast	$0.0020	21.9s	68%	75	75	72	68	68	72%
138	Gemini 3.1 Flash Lite (Reasoning)	$0.0033	9.5s	66%	78	71	70	70	69	72%
137	Inception Mercury 2	$0.0031	6.3s	67%	75	74	70	70	68	71%
139	Gemini 3.1 Flash Lite	$0.0035	8.5s	63%	80	72	70	66	66	71%
140	Nemotron 3 Nano	$0.0011	55.7s	64%	75	71	68	68	68	70%
144	GPT-OSS 120B	$0.0011	1.7m	64%	75	71	69	67	65	69%
142	Stealth: Aurora Alpha	$0.0000	9.2s	62%	77	72	69	65	62	69%
141	Gemini 2.5 Flash Lite (Reasoning)	$0.0031	24.6s	64%	72	71	68	66	65	68%
145	Inception Mercury	$0.013	19.7s	62%	75	72	69	66	60	68%
146	GPT-5 Nano	$0.0046	1.5m	65%	68	66	66	66	65	66%
81.48%

Median	Evaluator	Top 3	Flop 3
97.5%	"Not X but Y" pattern overuse	100o4 Mini 100GPT-4o, May 13th (temp=0) 100Qwen 3.5 122B	11Grok 4 Fast 14GPT-5 Nano 34Arcee AI: Trinity Large (Preview)
46.6%	Adverb-first sentence starts	100Writer: Palmyra X5 98Qwen3 235B A22B Instruct 2507 93Mistral Large	0Inception Mercury 2 0Llama 3.1 70B 0Claude 3 Haiku
100.0%	Adverbs in dialogue tags	100Ministral 3B 100GPT-5.4 (Reasoning, Low) 100Stealth: Aurora Alpha	28GPT-4.1 Nano 64GPT-4.1 Mini 66Hermes 3 405B
91.5%	AI-ism adverb frequency	99Inception Mercury 99Grok 4.1 Fast 99GPT-5.4 (Reasoning)	64GPT-4.1 Nano 73Gemini 2.5 Flash Lite (Reasoning) 74Z.AI GLM 4.5
100.0%	AI-ism character names	100Aion 2.0 100Gemini 3 Flash (Preview, Reasoning) 100Z.AI GLM 4.7	80Arcee AI: Trinity Large (Preview) 80Z.AI GLM 5 80Grok 4.20
100.0%	AI-ism location names	100ByteDance Seed 2.0 Lite 100Gemma 3 27B 100GPT-5.2	—
47.1%	AI-ism word frequency	90Claude Opus 4.7 (Reasoning) 87Claude Opus 4.7 85GPT-5.4 Mini (Reasoning)	2Llama 3.1 Nemotron 70B 3GPT-4o, Aug. 6th (temp=0) 3Stealth: Aurora Alpha
100.0%	Cliché density	100Qwen 3.5 Plus (2026-02-15) 100Qwen 3.5 Flash 100Gemini 3.1 Flash Lite	27Inception Mercury 27Mistral Small 3.2 24B 33GPT-4o, May 13th (temp=0)
87.4%	Dialogue tag variety (said vs. fancy)	100ByteDance Seed 1.6 100Claude Opus 4.6 100GPT-5.5	0GPT-OSS 120B 0Nemotron 3 Nano 0Gemma 3 4B
86.1%	Em-dash & semicolon overuse	100Grok 4.3 (Reasoning) 100GPT-5 100Qwen 3.5 Flash	0GPT-5 Nano 0GPT-4.1 Nano 0Grok 4
100.0%	Emotion telling (show vs. tell)	100Gemma 4 31B (Reasoning) 100ByteDance Seed 2.0 Mini 100DeepSeek V3.1	93Ministral 3B 94Cohere Command R+ (Aug. 2024) 95Mistral Small 3.2 24B
97.9%	Filter word density	100DeepSeek V4 Pro (Reasoning) 100DeepSeek V4 Pro 100MoonshotAI: Kimi K2.6	32Llama 3.1 8B 33Llama 3.1 Nemotron 70B 46Stealth: Aurora Alpha
100.0%	Gibberish response detection	100Llama 3.1 Nemotron 70B 100Gemini 2.5 Flash 100GPT-4o, Aug. 6th (temp=1)	48Llama 3.1 8B 59Rocinante 12B 80DeepSeek V4 Pro (Reasoning)
100.0%	Markdown formatting overuse	100GPT-5.2 100Z.AI GLM 4.5 100GPT-4.1	80Ministral 3B 80Ministral 3 3B 82Rocinante 12B
100.0%	Missing dialogue indicators (quotation marks)	100Grok 4 Fast 100Z.AI GLM 4.7 Flash 100Stealth: Aurora Alpha	60Qwen3.6 Max Preview 60GPT-5 80Qwen 3.5 27B
49.3%	Name drop frequency	97Claude Opus 4.7 97Claude Opus 4.7 (Reasoning) 93Claude Sonnet 4.6	0GPT-4o, May 13th (temp=0) 0Claude 3 Haiku 2Llama 3.1 Nemotron 70B
87.5%	Narrator intent-glossing	100GPT-5.4 (Reasoning) 100Qwen 3.6 Flash 100Grok 4.1 Fast	2GPT-5 Nano 36Gemma 4 26B (Reasoning) 40Claude Sonnet 4
100.0%	Overuse of "that" (subordinate clause padding)	100Gemma 3 4B 100o4 Mini High 100LFM2 24B	67Llama 3.1 70B 69ByteDance Seed 2.0 Mini 80Inception Mercury
100.0%	Paragraph length variance	100Aion 2.0 100Claude Sonnet 4.6 100GPT-5.5 (Reasoning, Low)	61Stealth: Aurora Alpha 74Inception Mercury 75Nemotron 3 Nano
96.4%	Passive voice overuse	100Stealth: Aurora Alpha 100GPT-5.4 (Reasoning) 100GPT-5.5	77ByteDance Seed 2.0 Lite 84Qwen 2.5 72B 85Stealth: Hunter Alpha
96.1%	Past progressive (was/were + -ing) overuse	100Qwen 3.6 Flash 100GPT-5 100GPT-5.4 (Reasoning)	17Z.AI GLM 4.7 Flash 23Xiaomi MIMO v2.5 29Z.AI GLM 4.7
97.6%	Pronoun-first sentence starts	100GPT-5.4 Nano 100GPT-5.4 100Inception Mercury 2	40Gemini 3.1 Flash Lite (Reasoning) 44GPT-4.1 Nano 49Qwen 3.5 27B
97.6%	Purple prose (modifier overload)	100Claude 3.5 Sonnet 100DeepSeek V4 Flash 100Z.AI GLM 4.5 Air	79Gemini 3 Flash (Preview, Reasoning) 81Gemini 2.5 Flash Lite (Reasoning) 82Gemma 3 4B
100.0%	Repeated phrase echo	100Z.AI GLM 4.6 100Gemini 2.5 Pro 100Claude Opus 4.6 (Reasoning)	—
100.0%	Sentence length variance	100ByteDance Seed 1.6 100Grok 4.1 Fast 100Z.AI GLM 5 Turbo	80Qwen 3.5 27B 93Nemotron 3 Nano 95Qwen 3.5 9B
65.4%	Sentence opener variety	98GPT-4o, Aug. 6th (temp=1) 96Hermes 3 405B 95Grok 4.1 Fast	34GPT-5 Nano 35Qwen 3.5 9B 36Inception Mercury
38.4%	Subject-first sentence starts	100Writer: Palmyra X5 97Qwen3 235B A22B Instruct 2507 93GPT-5.4 (Reasoning, Low)	0Nemotron 3 Nano 0Inception Mercury 2 0GPT-5 Nano
20.0%	Subordinate conjunction sentence starts	93GPT-4o Mini (temp=1) 83Mistral NeMO 73Claude Haiku 4.5	0Grok 4.20 (Beta, Reasoning) 0WizardLM 2 8x22b 0Stealth: Aurora Alpha
79.2%	Technical jargon density	100Gemma 3 27B 100Qwen 3.5 9B 100GPT-5.4 (Reasoning)	1GPT-5 Nano 12Gemini 2.5 Flash Lite (Reasoning) 25MiniMax M2.7
68.3%	Useless dialogue additions	100Qwen 3.5 35B 100GPT-5.5 (Reasoning, Low) 100Qwen 3.5 9B	0Nemotron 3 Super 0GPT-5 Nano 0Gemini 3.1 Flash Lite

Horror: alone in an eerie place at night

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	93%
GPT-5.5 (Reasoning)	92%
Z.AI GLM 5.1	91%
GPT-5.5	91%
GPT-5.5 (Reasoning, Low)	91%
Claude Sonnet 4.6 (Reasoning)	90%
GPT-5.4 (Reasoning, Low)	90%
GPT-5.4	90%
GPT-5.4 Mini (Reasoning)	90%
GPT-5.1	89%
Claude Opus 4.7 (Reasoning)	89%
GPT-5	89%
Claude Sonnet 4.5	89%
Claude Sonnet 4.6	88%
GPT-5.4 Mini (Reasoning, Low)	88%
Z.AI GLM 5	88%
Z.AI GLM 5 Turbo	88%
MoonshotAI: Kimi K2.6	88%
Claude Opus 4.6 (Reasoning)	88%
DeepSeek V4 Pro	88%

	Score	Cost	Time
GPT-5.4 Mini (Reasoning)	90%	$0.017	22.1s
Z.AI GLM 5 Turbo	88%	$0.0072	27.5s
Z.AI GLM 5.1	91%	$0.014	1.2m
GPT-5.4 Nano (Reasoning, Low)	85%	$0.0045	19.4s
DeepSeek V4 Flash (Reasoning)	86%	$0.0006	28.2s
GPT-5.4 Mini (Reasoning, Low)	88%	$0.013	14.9s
Z.AI GLM 5	88%	$0.0075	50.7s
Qwen3 235B A22B Instruct 2507	87%	$0.0015	48.5s
GPT-5.4 Mini	86%	$0.014	15.8s
Mistral Large	86%	$0.015	24.6s
GPT-5.4	90%	$0.039	1.1m
Writer: Palmyra X5	85%	$0.012	19.1s
DeepSeek V4 Pro	88%	$0.0053	55.2s
Claude Sonnet 4.5	89%	$0.040	37.3s
Mistral Large 3	83%	$0.0037	29.1s
Qwen 3.5 397B A17B	87%	$0.012	1.3m
GPT-5.4 Nano	84%	$0.0047	16.7s
DeepSeek V4 Flash	85%	$0.0008	2.1m
MiniMax M2.7	87%	$0.0029	1.0m
Hermes 3 405B	81%	$0.0050	26.4s

	Score	Consistency	Stability
GPT-5.4 (Reasoning)	93%	98%	91%
Z.AI GLM 5.1	91%	99%	90%
GPT-5.5 (Reasoning)	92%	97%	89%
GPT-5.5	91%	97%	88%
GPT-5.4	90%	97%	87%
GPT-5.4 Mini (Reasoning)	90%	95%	87%
GPT-5.5 (Reasoning, Low)	91%	95%	87%
GPT-5.4 (Reasoning, Low)	90%	95%	86%
Claude Opus 4.7 (Reasoning)	89%	97%	86%
MoonshotAI: Kimi K2.6	88%	97%	86%
Z.AI GLM 5 Turbo	88%	95%	86%
Claude Opus 4.6 (Reasoning)	88%	97%	86%
Claude Sonnet 4.6 (Reasoning)	90%	95%	85%
GPT-5	89%	96%	85%
DeepSeek V4 Pro (Reasoning)	87%	97%	85%
Qwen 3.5 Plus (2026-04-20)	86%	97%	84%
GPT-5.4 Mini (Reasoning, Low)	88%	96%	84%
Gemini 3.1 Pro (Preview)	86%	97%	84%
Grok 4.20 (Beta)	86%	98%	84%
DeepSeek V4 Pro	88%	95%	84%

	Score	Cost	Speed	Stability
Z.AI GLM 5.1	91%	$0.014	1.2m	90%
GPT-5.4 Mini (Reasoning)	90%	$0.017	22.1s	87%
Z.AI GLM 5 Turbo	88%	$0.0072	27.5s	86%
GPT-5.4 Mini (Reasoning, Low)	88%	$0.013	14.9s	84%
GPT-5.4	90%	$0.039	1.1m	87%
DeepSeek V4 Pro	88%	$0.0053	55.2s	84%
GPT-5.4 (Reasoning)	93%	$0.075	2.1m	91%
Grok 4.20 (Beta)	86%	$0.017	15.6s	84%
GPT-5.4 (Reasoning, Low)	90%	$0.048	1.1m	86%
DeepSeek V4 Flash (Reasoning)	86%	$0.0006	28.2s	81%
Claude Sonnet 4.5	89%	$0.040	37.3s	83%
Qwen 3.5 397B A17B	87%	$0.012	1.3m	83%
DeepSeek V4 Pro (Reasoning)	87%	$0.0092	2.0m	85%
GPT-5.4 Mini	86%	$0.014	15.8s	81%
Claude Sonnet 4.6	88%	$0.034	35.1s	82%
Mistral Large	86%	$0.015	24.6s	82%
Qwen3 235B A22B Instruct 2507	87%	$0.0015	48.5s	80%
GPT-5.4 Nano	84%	$0.0047	16.7s	82%
MiniMax M2.7	87%	$0.0029	1.0m	80%
Z.AI GLM 5	88%	$0.0075	50.7s	79%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
7	GPT-5.4 (Reasoning)	$0.075	2.1m	91%	94	93	93	92	92	93%
33	GPT-5.5 (Reasoning)	$0.123	1.5m	89%	94	93	92	91	89	92%
1	Z.AI GLM 5.1	$0.014	1.2m	90%	92	92	91	91	91	91%
42	GPT-5.5	$0.116	1.5m	88%	94	92	91	90	89	91%
52	GPT-5.5 (Reasoning, Low)	$0.120	1.5m	87%	94	92	91	89	88	91%
24	Claude Sonnet 4.6 (Reasoning)	$0.074	1.2m	85%	95	92	90	88	87	90%
9	GPT-5.4 (Reasoning, Low)	$0.048	1.1m	86%	92	92	91	89	86	90%
5	GPT-5.4	$0.039	1.1m	87%	92	91	89	89	89	90%
2	GPT-5.4 Mini (Reasoning)	$0.017	22.1s	87%	92	92	91	88	86	90%
30	GPT-5.1	$0.046	2.1m	84%	94	92	89	88	85	89%
23	Claude Opus 4.7 (Reasoning)	$0.086	29.8s	86%	92	90	89	88	88	89%
66	GPT-5	$0.061	3.9m	85%	92	91	89	88	87	89%
11	Claude Sonnet 4.5	$0.040	37.3s	83%	92	92	89	89	83	89%
15	Claude Sonnet 4.6	$0.034	35.1s	82%	94	89	87	86	85	88%
4	GPT-5.4 Mini (Reasoning, Low)	$0.013	14.9s	84%	92	89	87	87	87	88%
20	Z.AI GLM 5	$0.0075	50.7s	79%	93	93	90	88	77	88%
3	Z.AI GLM 5 Turbo	$0.0072	27.5s	86%	90	90	90	86	85	88%
109	MoonshotAI: Kimi K2.6	$0.063	6.5m	86%	90	89	88	87	86	88%
46	Claude Opus 4.6 (Reasoning)	$0.086	1.2m	86%	90	89	88	87	86	88%
6	DeepSeek V4 Pro	$0.0053	55.2s	84%	91	90	89	85	84	88%
31	Claude Opus 4.5	$0.067	41.5s	84%	91	88	88	87	83	87%
13	DeepSeek V4 Pro (Reasoning)	$0.0092	2.0m	85%	89	89	88	86	85	87%
44	Claude Opus 4.7	$0.082	30.0s	83%	90	90	87	87	83	87%
73	Qwen3.6 Max Preview	$0.051	3.4m	82%	91	88	87	86	83	87%
12	Qwen 3.5 397B A17B	$0.012	1.3m	83%	89	89	86	86	85	87%
45	Claude Sonnet 4	$0.037	42.9s	78%	95	90	87	83	80	87%
19	MiniMax M2.7	$0.0029	1.0m	80%	92	89	86	84	83	87%
17	Qwen3 235B A22B Instruct 2507	$0.0015	48.5s	80%	93	88	86	85	82	87%
136	Claude Opus 4	$0.254	1.5m	83%	90	88	87	85	84	87%
27	Qwen 3.5 Plus (2026-04-20)	$0.020	2.0m	84%	88	88	87	85	84	86%
113	Gemini 3.1 Pro (Preview)	$0.146	2.4m	84%	88	87	87	86	84	86%
53	Qwen 3.6 27B	$0.023	2.4m	81%	91	88	87	87	80	86%
91	Claude Opus 4.6	$0.082	1.1m	75%	96	88	87	82	77	86%
14	GPT-5.4 Mini	$0.014	15.8s	81%	90	89	87	85	80	86%
10	DeepSeek V4 Flash (Reasoning)	$0.0006	28.2s	81%	90	89	87	84	80	86%
16	Mistral Large	$0.015	24.6s	82%	88	88	87	86	80	86%
8	Grok 4.20 (Beta)	$0.017	15.6s	84%	87	86	85	85	85	86%
22	Writer: Palmyra X5	$0.012	19.1s	80%	90	87	86	81	81	85%
63	Grok 4.20 (Reasoning)	$0.020	1.6m	77%	92	86	83	82	82	85%
26	Qwen 3.6 Flash	$0.012	48.3s	81%	88	88	86	84	81	85%
35	Gemini 2.5 Pro	$0.035	31.3s	81%	88	88	85	83	82	85%
34	Qwen 3.6 35B	$0.012	1.1m	80%	89	86	84	84	82	85%
21	GPT-5.4 Nano (Reasoning, Low)	$0.0045	19.4s	80%	89	88	87	81	78	85%
57	DeepSeek V4 Flash	$0.0008	2.1m	78%	90	87	84	82	80	85%
25	Claude Haiku 4.5	$0.014	22.8s	81%	87	86	84	83	82	84%
38	Grok 4.3	$0.0090	34.9s	78%	91	85	84	81	81	84%
41	Grok 4.3 (Reasoning)	$0.015	1.4m	81%	87	85	84	83	82	84%
68	Rocinante 12B	$0.0030	1.7m	74%	92	87	81	81	80	84%
96	ByteDance Seed 2.0 Mini	$0.0045	4.9m	80%	88	84	83	83	82	84%
28	Grok 4.20	$0.011	41.9s	81%	86	85	85	83	80	84%
18	GPT-5.4 Nano	$0.0047	16.7s	82%	86	84	84	83	82	84%
88	MoonshotAI: Kimi K2.5	$0.012	4.3m	80%	86	86	86	82	79	84%
84	Claude 3.5 Sonnet	$0.061	38.6s	75%	92	83	82	82	80	84%
32	Mistral Large 3	$0.0037	29.1s	78%	88	85	85	83	77	83%
56	WizardLM 2 8x22b	$0.0038	2.1m	80%	87	84	84	81	81	83%
36	GPT-5.4 Nano (Reasoning)	$0.0047	21.6s	78%	88	83	82	82	80	83%
50	DeepSeek V3 (2025-03-24)	$0.0014	20.7s	74%	90	87	83	81	74	83%
29	Mistral Small 4 (Reasoning)	$0.0022	28.7s	79%	86	85	84	82	78	83%
43	Xiaomi MIMO v2.5 Pro	$0.0090	51.5s	80%	86	85	85	80	78	83%
62	Aion 2.0	$0.0073	1.1m	77%	86	86	83	83	75	83%
61	DeepSeek V3.2	$0.0019	1.4m	77%	87	85	82	81	79	83%
37	Qwen 3.5 Flash	$0.0024	49.3s	80%	86	84	84	80	80	83%
78	GPT-5.2	$0.044	1.2m	78%	87	86	84	78	78	83%
47	Qwen 3.5 9B	$0.0011	1.3m	80%	84	84	83	82	80	82%
49	Grok 4.1 Fast	$0.0016	25.6s	77%	89	83	83	79	78	82%
54	Mistral Large 2	$0.015	25.6s	77%	86	84	84	82	75	82%
59	Qwen 3.5 35B	$0.013	42.6s	77%	87	82	81	80	80	82%
40	Mistral Medium 3.1	$0.0049	39.2s	80%	84	84	83	81	79	82%
48	Qwen 3 32B	$0.0016	54.7s	79%	84	83	82	81	80	82%
81	Gemini 3 Pro (Preview)	$0.053	48.7s	79%	85	84	84	79	77	82%
55	Qwen 3.5 27B	$0.011	50.3s	79%	84	83	83	82	77	82%
92	Grok 4.20 (Beta, Reasoning)	$0.028	24.1s	70%	89	86	79	77	76	81%
58	o4 Mini	$0.014	24.6s	77%	84	84	82	78	78	81%
65	Hermes 3 405B	$0.0050	26.4s	75%	86	86	86	80	69	81%
39	Ministral 3 3B	$0.0005	2.9s	78%	83	83	81	80	79	81%
60	Mistral Small Creative	$0.0009	8.8s	75%	87	85	83	75	75	81%
69	Qwen 3.5 122B	$0.014	33.9s	75%	87	81	80	80	77	81%
67	Arcee AI: Trinity Large (Preview)	$0.0000	49.3s	75%	87	83	82	76	75	81%
51	Mistral Small 4	$0.0013	11.2s	77%	85	81	81	79	77	81%
72	o4 Mini High	$0.024	41.4s	77%	84	81	80	79	78	81%
74	GPT-4o, Aug. 6th (temp=1)	$0.021	30.9s	75%	84	84	79	78	77	81%
80	Stealth: Hunter Alpha	$0.0000	1.1m	73%	86	85	80	76	75	81%
64	Qwen 3.5 Plus (2026-02-15)	$0.0076	36.0s	77%	83	83	81	80	76	80%
82	GPT-4.1	$0.018	49.1s	75%	85	82	80	78	75	80%
99	MiniMax M2.5	$0.0035	2.0m	73%	85	84	79	77	77	80%
100	Z.AI GLM 4.7	$0.0085	1.3m	71%	88	82	79	76	75	80%
71	Gemma 3 27B	$0.0009	41.6s	74%	84	83	79	76	76	80%
87	Z.AI GLM 4.5	$0.0039	23.4s	70%	88	85	81	78	67	80%
89	ByteDance Seed 2.0 Lite	$0.011	1.9m	76%	82	81	81	80	74	80%
70	Z.AI GLM 4.6	$0.0049	28.0s	75%	83	81	81	79	73	79%
75	ByteDance Seed 1.6 Flash	$0.0013	26.4s	74%	83	80	78	76	76	79%
77	LFM2 24B	$0.0003	30.9s	74%	83	81	81	76	72	78%
95	Stealth: Healer Alpha	$0.0000	19.4s	69%	89	81	79	73	71	78%
79	Z.AI GLM 4.5 Air	$0.0027	46.6s	75%	82	79	78	77	77	78%
98	DeepSeek V3 (2024-12-26)	$0.0026	37.2s	70%	87	79	79	73	73	78%
93	GPT-5 Mini	$0.0093	1.1m	74%	81	81	80	75	72	78%
76	GPT-4.1 Mini	$0.0030	18.5s	74%	81	79	78	78	74	78%
85	Gemma 3 12B	$0.0008	47.9s	73%	80	80	79	79	71	78%
86	Gemini 2.5 Flash (Reasoning)	$0.010	19.0s	73%	83	79	78	76	73	78%
117	Gemma 4 31B (Reasoning)	$0.0017	2.8m	74%	82	78	77	76	75	78%
102	Z.AI GLM 4.7 Flash	$0.0016	1.1m	71%	82	82	77	75	72	78%
104	Gemma 4 31B	$0.0013	1.7m	73%	81	78	77	76	73	77%
106	DeepSeek V3.1	$0.0022	1.7m	73%	80	80	78	77	71	77%
83	GPT-4o Mini (temp=1)	$0.0015	38.2s	75%	79	78	78	75	75	77%
120	Claude 3.7 Sonnet	$0.049	46.9s	73%	81	79	78	74	72	77%
105	Hermes 3 70B	$0.0016	1.5m	73%	81	80	80	73	70	77%
103	GPT-4o, May 13th (temp=1)	$0.042	13.2s	75%	78	78	78	76	73	77%
112	Cohere Command R+ (Aug. 2024)	$0.023	56.0s	73%	80	79	78	75	70	77%
90	Gemini 2.5 Flash	$0.0049	9.8s	72%	80	77	77	76	72	76%
116	DeepSeek-V2 Chat	$0.0025	45.9s	69%	83	80	75	73	71	76%
108	Ministral 8B	$0.0005	8.6s	68%	82	81	76	76	66	76%
124	ByteDance Seed 1.6	$0.0096	1.8m	69%	82	77	74	74	74	76%
111	Ministral 3 14B	$0.0011	8.3s	68%	82	80	74	73	71	76%
94	Grok 4 Fast	$0.0018	17.2s	72%	79	77	75	75	73	76%
125	Gemma 4 26B (Reasoning)	$0.0015	1.6m	67%	84	78	74	72	70	76%
97	Ministral 3B	$0.0002	2.5s	71%	79	79	75	74	72	76%
119	Arcee AI: Trinity Mini	$0.0004	9.4s	65%	86	76	74	72	70	76%
127	Grok 4	$0.042	1.4m	71%	80	78	75	74	72	76%
101	Gemini 3 Flash (Preview)	$0.0070	20.2s	72%	78	76	74	74	74	75%
118	Ministral 3 8B	$0.0008	6.2s	66%	82	80	75	75	65	75%
107	Xiaomi MIMO v2.5	$0.0049	27.2s	71%	81	78	77	71	70	75%
115	Gemma 4 26B	$0.0011	59.7s	71%	80	77	76	73	70	75%
114	Gemini 3 Flash (Preview, Reasoning)	$0.011	25.7s	71%	78	77	77	75	67	75%
110	Mistral NeMO	$0.0008	7.6s	70%	79	77	76	70	68	74%
122	Gemini 2.5 Flash Lite (Reasoning)	$0.0025	21.0s	68%	77	76	72	71	69	73%
134	GPT-4o, May 13th (temp=0)	$0.058	15.0s	67%	79	74	71	71	70	73%
131	Claude 3 Haiku	$0.0028	12.3s	62%	81	79	71	66	64	72%
123	GPT-4.1 Nano	$0.0008	14.9s	67%	79	73	73	69	67	72%
126	GPT-4o Mini (temp=0)	$0.0013	40.4s	67%	76	74	71	71	68	72%
130	Llama 3.1 70B	$0.0026	48.5s	65%	81	75	75	65	63	72%
121	Gemini 2.5 Flash Lite	$0.0010	7.2s	69%	73	73	71	70	70	72%
128	Gemma 3 4B	$0.0003	21.7s	65%	79	72	71	69	66	71%
138	GPT-4o, Aug. 6th (temp=0)	$0.050	40.8s	66%	78	74	74	68	63	71%
132	Qwen 2.5 72B	$0.0013	35.7s	66%	74	73	72	67	63	70%
129	Gemini 3.1 Flash Lite (Reasoning)	$0.0036	9.7s	66%	72	71	69	69	66	69%
133	Gemini 3.1 Flash Lite (Preview)	$0.0036	8.3s	65%	75	72	72	64	64	69%
135	Nemotron 3 Super	$0.0000	51.4s	65%	75	69	69	67	66	69%
139	Llama 3.1 8B	$0.0003	1.1m	62%	77	69	68	67	65	69%
140	Llama 3.1 Nemotron 70B	$0.0061	29.7s	61%	76	70	70	66	57	68%
137	Gemini 3.1 Flash Lite	$0.0036	21.4s	65%	69	69	67	66	64	67%
147	Mistral Small 3.2 24B	$0.012	10.0m	63%	70	69	67	67	61	67%
141	GPT-OSS 120B	$0.0014	50.5s	62%	71	68	66	65	63	67%
142	GPT-5 Nano	$0.0043	1.5m	62%	72	67	66	65	62	66%
143	Stealth: Aurora Alpha	$0.0000	8.5s	59%	69	64	62	62	62	64%
144	Inception Mercury 2	$0.0026	5.5s	59%	69	65	62	61	61	64%
145	Inception Mercury	$0.016	28.3s	58%	67	65	62	61	61	63%
146	Nemotron 3 Nano	$0.0010	38.8s	55%	67	65	61	58	56	61%
80.39%

Median	Evaluator	Top 3	Flop 3
80.0%	"Not X but Y" pattern overuse	100Z.AI GLM 5.1 100Claude Opus 4.7 100GPT-5.4 (Reasoning, Low)	0GPT-5 Nano 0Gemini 2.5 Flash Lite 16Gemini 2.5 Flash Lite (Reasoning)
80.2%	Adverb-first sentence starts	100Claude Haiku 4.5 100Qwen3 235B A22B Instruct 2507 100Claude Opus 4.5	7Nemotron 3 Nano 10ByteDance Seed 2.0 Lite 16Mistral Small 3.2 24B
100.0%	Adverbs in dialogue tags	100GPT-4o, Aug. 6th (temp=0) 100Arcee AI: Trinity Mini 100Qwen 3.6 27B	33GPT-4o Mini (temp=1) 50GPT-4o, May 13th (temp=1) 53Claude Sonnet 4
91.3%	AI-ism adverb frequency	100o4 Mini High 99Qwen 3.5 35B 99ByteDance Seed 2.0 Lite	68Llama 3.1 Nemotron 70B 69Claude 3 Haiku 72GPT-4.1 Nano
100.0%	AI-ism character names	100Gemma 4 26B (Reasoning) 100Qwen 3.6 Flash 100Qwen 3.6 35B	88DeepSeek V4 Pro 92Claude Opus 4 96Qwen 3 32B
100.0%	AI-ism location names	100Rocinante 12B 100Claude Opus 4.7 (Reasoning) 100Mistral Large 2	—
23.8%	AI-ism word frequency	76ByteDance Seed 2.0 Mini 72GPT-5.5 (Reasoning) 70GPT-5	0Arcee AI: Trinity Mini 0GPT-4.1 Nano 0GPT-4o, May 13th (temp=1)
100.0%	Cliché density	100ByteDance Seed 1.6 Flash 100Qwen 3.5 397B A17B 100ByteDance Seed 1.6	47Mistral Small 3.2 24B 60Qwen 2.5 72B 67Claude 3 Haiku
44.5%	Dialogue tag variety (said vs. fancy)	100Mistral Large 100Claude Sonnet 4.6 100GPT-5.5	0Gemini 3.1 Flash Lite (Preview) 0Nemotron 3 Nano 0Z.AI GLM 4.7
79.1%	Em-dash & semicolon overuse	100Claude Opus 4.7 (Reasoning) 100Claude Opus 4.7 100GPT-5.2	0GPT-5 Nano 0GPT-4.1 Nano 0Qwen3 235B A22B Instruct 2507
100.0%	Emotion telling (show vs. tell)	100GPT-4.1 Mini 100Grok 4.3 100Gemini 2.5 Flash Lite (Reasoning)	77Mistral Small 3.2 24B 82GPT-4o, May 13th (temp=0) 89Inception Mercury
98.4%	Filter word density	100GPT-5.1 100o4 Mini High 100Qwen 3.5 Plus (2026-04-20)	0Inception Mercury 2Llama 3.1 Nemotron 70B 11Llama 3.1 8B
100.0%	Gibberish response detection	100Qwen 3.5 397B A17B 100Gemini 3.1 Pro (Preview) 100GPT-4o, Aug. 6th (temp=1)	48Llama 3.1 8B 80Llama 3.1 70B 88DeepSeek V3 (2025-03-24)
100.0%	Markdown formatting overuse	100DeepSeek V4 Pro 100Qwen 3 32B 100Gemini 3.1 Flash Lite (Reasoning)	51ByteDance Seed 1.6 Flash 67Mistral Medium 3.1 78Mistral Small Creative
100.0%	Missing dialogue indicators (quotation marks)	100Stealth: Aurora Alpha 100Gemma 3 12B 100Claude Haiku 4.5	80Qwen 3.5 35B 80GPT-5 80Llama 3.1 8B
90.0%	Name drop frequency	100Gemini 3.1 Flash Lite 100GPT-5 Nano 100Z.AI GLM 4.6	28Rocinante 12B 51Qwen 3.5 9B 51Hermes 3 405B
75.2%	Narrator intent-glossing	100Gemini 3.1 Pro (Preview) 100Mistral Medium 3.1 100GPT-5.4 (Reasoning, Low)	0Stealth: Aurora Alpha 0Nemotron 3 Nano 1Inception Mercury
100.0%	Overuse of "that" (subordinate clause padding)	100Claude Sonnet 4 100Arcee AI: Trinity Mini 100Stealth: Healer Alpha	40ByteDance Seed 2.0 Lite 60Llama 3.1 8B 77Hermes 3 70B
100.0%	Paragraph length variance	100Qwen 3 32B 100Gemini 3.1 Flash Lite 100DeepSeek V4 Flash	5Grok 4.3 (Reasoning) 16Inception Mercury 32Nemotron 3 Nano
99.2%	Passive voice overuse	100Grok 4.3 100GPT-5.1 100Grok 4 Fast	89Inception Mercury 89Llama 3.1 Nemotron 70B 90ByteDance Seed 2.0 Mini
96.4%	Past progressive (was/were + -ing) overuse	100GPT-5.5 (Reasoning) 100GPT-5.5 (Reasoning, Low) 100Qwen 3.5 9B	33Z.AI GLM 4.7 Flash 34Ministral 3 8B 36Gemini 3.1 Flash Lite (Reasoning)
91.7%	Pronoun-first sentence starts	100Qwen3 235B A22B Instruct 2507 100Z.AI GLM 4.5 100DeepSeek V4 Flash	14Gemini 3.1 Flash Lite (Reasoning) 16Gemini 3.1 Flash Lite 16Gemini 3.1 Flash Lite (Preview)
96.4%	Purple prose (modifier overload)	100Qwen 3.5 9B 100Ministral 8B 100Hermes 3 405B	79Gemini 3 Flash (Preview, Reasoning) 81Gemini 2.5 Flash (Reasoning) 84Gemma 4 31B (Reasoning)
100.0%	Repeated phrase echo	100GPT-5.1 100Grok 4.3 100Gemini 2.5 Flash Lite	—
100.0%	Sentence length variance	100GPT-5.5 (Reasoning, Low) 100Qwen 3.6 35B 100DeepSeek V4 Pro	66Nemotron 3 Nano 73Inception Mercury 79GPT-4o, Aug. 6th (temp=0)
46.5%	Sentence opener variety	87DeepSeek V3 (2025-03-24) 83GPT-4o Mini (temp=1) 83Grok 4.1 Fast	25GPT-5 Nano 30Qwen 3.5 35B 30Inception Mercury
51.1%	Subject-first sentence starts	100Qwen3 235B A22B Instruct 2507 100Writer: Palmyra X5 97Gemma 3 27B	0Inception Mercury 0GPT-OSS 120B 0Qwen 3.5 27B
33.3%	Subordinate conjunction sentence starts	88Gemini 2.5 Flash Lite 80Claude Opus 4.7 (Reasoning) 75ByteDance Seed 2.0 Lite	0Stealth: Aurora Alpha 0Qwen 3.5 9B 0Inception Mercury
67.4%	Technical jargon density	100Qwen3.6 Max Preview 100Qwen 3.6 Flash 100Qwen 3.5 9B	0GPT-5 Nano 0Nemotron 3 Nano 2Gemini 2.5 Flash Lite (Reasoning)
65.0%	Useless dialogue additions	100Gemini 2.5 Pro 100Claude Sonnet 4 100Qwen 3.5 27B	0Inception Mercury 2 0Gemma 3 12B 0GPT-4o Mini (temp=0)

Literary fiction: old friends reunite

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.5	90%
GPT-5.4	89%
GPT-5.5 (Reasoning, Low)	89%
GPT-5.4 (Reasoning, Low)	89%
GPT-5.5 (Reasoning)	88%
GPT-5.4 (Reasoning)	88%
GPT-5.4 Mini	88%
GPT-5.4 Mini (Reasoning, Low)	87%
GPT-5.4 Mini (Reasoning)	86%
Mistral Large	85%
Grok 4.20 (Reasoning)	85%
DeepSeek V3 (2025-03-24)	85%
Qwen3.6 Max Preview	84%
GPT-5.1	84%
Mistral Small Creative	83%
Mistral Medium 3.1	83%
Grok 4.20 (Beta)	83%
Grok 4.20	83%
Qwen3 235B A22B Instruct 2507	83%
Grok 4.20 (Beta, Reasoning)	83%

	Score	Cost	Time
GPT-5.4 Mini	88%	$0.018	20.0s
DeepSeek V3 (2025-03-24)	85%	$0.0012	39.8s
GPT-5.4 Mini (Reasoning, Low)	87%	$0.018	21.3s
Mistral Small Creative	83%	$0.0007	10.0s
Ministral 8B	83%	$0.0003	12.7s
Mistral Large	85%	$0.012	33.8s
GPT-5.4 Mini (Reasoning)	86%	$0.030	33.3s
Qwen 3.6 35B	82%	$0.0062	54.2s
Ministral 3 14B	82%	$0.0006	17.7s
Writer: Palmyra X5	83%	$0.011	23.1s
ByteDance Seed 1.6 Flash	83%	$0.0013	30.7s
Mistral Medium 3.1	83%	$0.0047	41.0s
Grok 4.20	83%	$0.0083	50.1s
DeepSeek V4 Flash (Reasoning)	81%	$0.0008	39.8s
o4 Mini	81%	$0.017	27.0s
Grok 4.20 (Beta)	83%	$0.016	14.2s
Qwen3 235B A22B Instruct 2507	83%	$0.0011	1.2m
Mistral Small 4	81%	$0.0015	23.5s
Qwen 3.6 Flash	81%	$0.0096	38.9s
Ministral 3B	78%	$0.0001	7.0s

	Score	Consistency	Stability
GPT-5.5 (Reasoning, Low)	89%	99%	88%
GPT-5.5	90%	97%	87%
GPT-5.4 (Reasoning)	88%	98%	86%
GPT-5.5 (Reasoning)	88%	96%	86%
GPT-5.4 Mini	88%	96%	86%
GPT-5.4	89%	96%	85%
GPT-5.4 Mini (Reasoning, Low)	87%	97%	85%
GPT-5.4 (Reasoning, Low)	89%	95%	84%
GPT-5.4 Mini (Reasoning)	86%	96%	84%
GPT-5.1	84%	97%	83%
Qwen3.6 Max Preview	84%	98%	82%
Grok 4.20 (Reasoning)	85%	94%	81%
Grok 4.1 Fast	81%	98%	80%
Grok 4.20	83%	96%	80%
Grok 4.20 (Beta, Reasoning)	83%	96%	80%
Ministral 3 14B	82%	96%	79%
Mistral Medium 3.1	83%	96%	79%
Grok 4	81%	97%	79%
Mistral Small Creative	83%	93%	79%
DeepSeek V3 (2025-03-24)	85%	91%	79%

	Score	Cost	Speed	Stability
GPT-5.4 Mini	88%	$0.018	20.0s	86%
GPT-5.4 Mini (Reasoning, Low)	87%	$0.018	21.3s	85%
GPT-5.4 Mini (Reasoning)	86%	$0.030	33.3s	84%
Mistral Small Creative	83%	$0.0007	10.0s	79%
Ministral 3 14B	82%	$0.0006	17.7s	79%
DeepSeek V3 (2025-03-24)	85%	$0.0012	39.8s	79%
Ministral 8B	83%	$0.0003	12.7s	78%
Mistral Large	85%	$0.012	33.8s	78%
Mistral Medium 3.1	83%	$0.0047	41.0s	79%
Grok 4.20 (Beta)	83%	$0.016	14.2s	78%
Grok 4.20	83%	$0.0083	50.1s	80%
Grok 4.1 Fast	81%	$0.0019	38.6s	80%
Writer: Palmyra X5	83%	$0.011	23.1s	77%
ByteDance Seed 1.6 Flash	83%	$0.0013	30.7s	77%
DeepSeek V4 Flash (Reasoning)	81%	$0.0008	39.8s	78%
Mistral Small 4	81%	$0.0015	23.5s	75%
o4 Mini	81%	$0.017	27.0s	78%
Qwen3 235B A22B Instruct 2507	83%	$0.0011	1.2m	78%
Grok 4.3	80%	$0.0059	23.2s	77%
GPT-5.4	89%	$0.070	2.0m	85%

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
85	GPT-5.5	$0.179	2.0m	87%	92	91	89	88	88	90%
20	GPT-5.4	$0.070	2.0m	85%	92	91	89	88	87	89%
73	GPT-5.5 (Reasoning, Low)	$0.160	2.1m	88%	90	89	89	89	88	89%
24	GPT-5.4 (Reasoning, Low)	$0.070	1.9m	84%	92	91	88	87	86	89%
89	GPT-5.5 (Reasoning)	$0.164	2.0m	86%	90	90	89	86	86	88%
105	GPT-5.4 (Reasoning)	$0.125	3.6m	86%	89	89	88	88	87	88%
1	GPT-5.4 Mini	$0.018	20.0s	86%	90	89	89	86	85	88%
2	GPT-5.4 Mini (Reasoning, Low)	$0.018	21.3s	85%	90	88	87	86	86	87%
3	GPT-5.4 Mini (Reasoning)	$0.030	33.3s	84%	89	87	87	85	83	86%
8	Mistral Large	$0.012	33.8s	78%	90	89	85	84	79	85%
23	Grok 4.20 (Reasoning)	$0.019	1.9m	81%	88	86	85	83	80	85%
6	DeepSeek V3 (2025-03-24)	$0.0012	39.8s	79%	91	87	86	80	78	85%
103	Qwen3.6 Max Preview	$0.054	3.9m	82%	85	85	84	83	82	84%
46	GPT-5.1	$0.066	1.5m	83%	85	85	85	83	82	84%
4	Mistral Small Creative	$0.0007	10.0s	79%	87	85	85	84	77	83%
9	Mistral Medium 3.1	$0.0047	41.0s	79%	88	83	83	82	81	83%
10	Grok 4.20 (Beta)	$0.016	14.2s	78%	88	84	82	82	81	83%
11	Grok 4.20	$0.0083	50.1s	80%	86	84	84	83	80	83%
18	Qwen3 235B A22B Instruct 2507	$0.0011	1.2m	78%	88	85	83	82	79	83%
27	Grok 4.20 (Beta, Reasoning)	$0.052	43.2s	80%	86	85	83	81	80	83%
13	Writer: Palmyra X5	$0.011	23.1s	77%	89	85	84	79	79	83%
7	Ministral 8B	$0.0003	12.7s	78%	87	84	83	81	77	83%
14	ByteDance Seed 1.6 Flash	$0.0013	30.7s	77%	88	83	82	81	78	83%
5	Ministral 3 14B	$0.0006	17.7s	79%	85	84	83	82	79	82%
136	MoonshotAI: Kimi K2.6	$0.050	4.6m	76%	86	85	81	80	80	82%
26	Mistral Large 2	$0.013	34.9s	74%	89	85	80	79	78	82%
138	Claude Opus 4	$0.190	1.4m	77%	86	84	83	78	77	82%
44	Qwen 3.6 35B	$0.0062	54.2s	72%	87	86	85	83	67	82%
12	Grok 4.1 Fast	$0.0019	38.6s	80%	83	82	81	81	80	81%
21	GPT-4.1	$0.017	34.0s	77%	86	81	81	81	78	81%
122	GPT-5	$0.074	2.9m	78%	84	84	82	79	78	81%
16	Mistral Small 4	$0.0015	23.5s	75%	88	82	81	79	78	81%
28	Z.AI GLM 5 Turbo	$0.0093	40.2s	75%	87	85	82	79	75	81%
95	Grok 4.3 (Reasoning)	$0.021	2.4m	74%	87	85	80	79	76	81%
17	o4 Mini	$0.017	27.0s	78%	84	83	83	81	76	81%
15	DeepSeek V4 Flash (Reasoning)	$0.0008	39.8s	78%	84	83	82	79	78	81%
31	Qwen 3.6 Flash	$0.0096	38.9s	75%	89	82	82	77	75	81%
22	Mistral Small 4 (Reasoning)	$0.0024	37.2s	75%	87	82	80	79	77	81%
92	MoonshotAI: Kimi K2.5	$0.016	2.6m	75%	85	84	82	81	74	81%
68	Grok 4	$0.047	1.8m	79%	82	82	81	81	78	81%
45	Claude Sonnet 4.5	$0.035	41.2s	77%	86	82	82	78	77	81%
66	Claude Opus 4.7	$0.062	29.0s	74%	86	83	79	78	77	81%
34	Qwen 3 32B	$0.0015	1.4m	78%	83	81	81	80	78	81%
38	o4 Mini High	$0.027	43.2s	77%	84	81	80	80	78	81%
57	Claude Sonnet 4	$0.035	53.9s	75%	85	81	79	78	78	80%
83	Z.AI GLM 5.1	$0.015	1.6m	70%	87	86	79	77	73	80%
19	Grok 4.3	$0.0059	23.2s	77%	84	80	80	79	77	80%
67	Qwen 3.5 Plus (2026-04-20)	$0.014	1.4m	73%	88	82	82	78	71	80%
33	Ministral 3 8B	$0.0004	13.8s	71%	87	83	78	77	76	80%
114	Gemini 3.1 Pro (Preview)	$0.073	1.3m	72%	89	82	80	76	74	80%
72	Qwen 3.6 27B	$0.017	1.6m	74%	86	82	80	78	74	80%
39	Qwen 3.5 9B	$0.0010	1.2m	77%	82	81	80	79	77	80%
43	MiniMax M2.7	$0.0036	1.4m	77%	81	81	80	79	77	80%
32	GPT-5.4 Nano (Reasoning)	$0.0063	24.9s	74%	85	81	79	78	76	80%
54	Claude 3.5 Sonnet	$0.048	41.3s	77%	83	81	81	77	76	80%
30	DeepSeek V4 Flash	$0.0007	31.4s	75%	85	80	79	79	76	80%
42	Qwen 3.5 Flash	$0.0028	1.2m	76%	82	80	79	79	78	80%
48	MiniMax M2.5	$0.0041	1.1m	75%	84	81	80	78	75	80%
59	DeepSeek V4 Pro	$0.0063	1.4m	74%	85	81	79	77	75	79%
47	Mistral Large 3	$0.0028	31.2s	72%	86	84	81	73	72	79%
49	Rocinante 12B	$0.0010	31.8s	71%	86	83	78	75	74	79%
50	Qwen 3.5 27B	$0.012	1.1m	75%	83	79	78	77	77	79%
123	Claude Opus 4.6	$0.081	1.3m	72%	83	83	80	80	69	79%
63	Arcee AI: Trinity Large (Preview)	$0.0000	32.8s	68%	90	81	79	73	71	79%
37	GPT-4o, May 13th (temp=0)	$0.026	13.1s	76%	81	80	80	78	75	79%
55	DeepSeek V3 (2024-12-26)	$0.0019	58.2s	73%	82	82	79	76	72	78%
25	GPT-5.4 Nano (Reasoning, Low)	$0.0054	19.1s	77%	80	79	79	76	76	78%
36	Claude Haiku 4.5	$0.010	21.0s	75%	82	79	79	78	74	78%
29	Grok 4 Fast	$0.0017	30.7s	76%	80	79	79	78	76	78%
135	Claude Opus 4.6 (Reasoning)	$0.112	1.8m	74%	81	80	78	78	74	78%
97	Qwen 3.5 397B A17B	$0.020	2.0m	74%	82	79	79	78	73	78%
71	Gemini 2.5 Pro	$0.043	44.1s	75%	81	80	79	77	73	78%
74	Hermes 3 405B	$0.0020	1.6m	73%	83	80	78	76	73	78%
127	GPT-5.2	$0.076	1.9m	76%	80	79	79	75	75	78%
80	Claude Sonnet 4.6	$0.035	45.3s	72%	82	80	77	77	73	78%
40	Mistral NeMO	$0.0004	19.7s	73%	81	79	78	77	73	78%
51	Qwen 2.5 72B	$0.0008	33.9s	72%	83	80	78	73	73	78%
35	Ministral 3B	$0.0001	7.0s	73%	80	79	79	79	70	78%
110	Claude Opus 4.7 (Reasoning)	$0.078	35.8s	73%	82	80	78	76	72	77%
125	DeepSeek V4 Pro (Reasoning)	$0.012	3.1m	73%	81	77	76	76	76	77%
41	Mistral Small 3.2 24B	$0.0006	22.9s	74%	80	79	78	78	72	77%
53	Xiaomi MIMO v2.5	$0.0062	37.4s	73%	82	78	78	75	74	77%
146	ByteDance Seed 2.0 Mini	$0.0044	4.9m	70%	84	81	78	74	70	77%
76	Aion 2.0	$0.0063	1.5m	74%	81	78	78	75	74	77%
60	Llama 3.1 Nemotron 70B	$0.0026	41.8s	72%	81	79	76	75	73	77%
62	Gemini 2.5 Flash Lite	$0.0009	11.6s	68%	83	82	77	75	67	77%
126	ByteDance Seed 1.6	$0.015	2.9m	74%	79	77	76	75	75	77%
52	Stealth: Healer Alpha	$0.0000	24.2s	72%	81	79	77	73	72	76%
108	Z.AI GLM 4.7	$0.010	1.7m	71%	81	78	75	75	72	76%
56	Qwen 3.5 122B	$0.015	36.9s	75%	77	77	77	75	74	76%
100	DeepSeek V3.2	$0.0013	1.1m	67%	82	82	74	73	70	76%
64	GPT-5 Mini	$0.011	43.2s	73%	78	78	76	76	73	76%
101	Llama 3.1 8B	$0.0002	41.3s	64%	86	80	72	72	70	76%
124	Claude Opus 4.5	$0.077	1.0m	73%	80	79	78	72	72	76%
79	Gemma 3 27B	$0.0004	1.1m	71%	80	79	76	74	72	76%
69	Z.AI GLM 4.5 Air	$0.0028	35.2s	70%	81	78	75	74	73	76%
121	Z.AI GLM 5	$0.0081	1.5m	66%	86	76	74	72	72	76%
61	GPT-5.4 Nano	$0.0057	21.3s	72%	81	76	76	75	71	76%
82	Cohere Command R+ (Aug. 2024)	$0.020	52.5s	72%	80	77	77	74	72	76%
94	Claude 3.7 Sonnet	$0.040	48.3s	74%	78	77	76	74	74	76%
141	Claude Sonnet 4.6 (Reasoning)	$0.072	1.7m	66%	85	78	74	71	71	76%
133	ByteDance Seed 2.0 Lite	$0.013	2.4m	68%	83	77	74	73	72	76%
93	Qwen 3.5 35B	$0.016	52.3s	71%	80	79	76	74	70	76%
58	Ministral 3 3B	$0.0003	10.1s	70%	80	78	75	72	72	76%
87	Z.AI GLM 4.5	$0.0066	55.8s	71%	79	78	75	74	71	75%
88	Hermes 3 70B	$0.0007	55.2s	70%	79	79	75	73	71	75%
86	Xiaomi MIMO v2.5 Pro	$0.0088	58.7s	71%	78	77	77	76	69	75%
91	WizardLM 2 8x22b	$0.0023	42.7s	69%	82	77	76	70	69	75%
120	Llama 3.1 70B	$0.0023	54.6s	63%	88	74	74	72	67	75%
75	Gemini 3.1 Flash Lite (Reasoning)	$0.0027	20.9s	69%	79	79	75	73	69	75%
70	Gemini 3.1 Flash Lite	$0.0027	17.0s	70%	78	77	73	73	72	75%
65	Gemini 2.5 Flash	$0.0065	13.6s	71%	78	77	76	74	69	75%
102	GPT-4o, May 13th (temp=1)	$0.026	13.0s	67%	81	79	76	71	65	74%
104	Z.AI GLM 4.7 Flash	$0.0016	1.0m	68%	82	75	75	70	70	74%
78	GPT-4o, Aug. 6th (temp=0)	$0.018	24.5s	72%	77	75	75	73	72	74%
109	DeepSeek-V2 Chat	$0.0019	47.3s	66%	80	78	73	72	67	74%
113	Stealth: Hunter Alpha	$0.0000	1.2m	68%	78	77	73	73	68	74%
99	LFM2 24B	$0.0002	38.5s	67%	79	77	74	72	66	74%
106	Gemma 3 12B	$0.0005	47.6s	67%	78	78	74	74	64	74%
96	Qwen 3.5 Plus (2026-02-15)	$0.0052	32.0s	69%	79	76	75	72	67	74%
116	Gemini 2.5 Flash (Reasoning)	$0.012	24.8s	65%	82	75	72	70	69	74%
98	Gemini 3 Flash (Preview)	$0.0069	17.7s	66%	81	77	73	70	68	74%
115	GPT-4o, Aug. 6th (temp=1)	$0.018	34.1s	67%	80	73	72	72	71	73%
77	Claude 3 Haiku	$0.0022	15.6s	70%	76	75	73	72	71	73%
81	Gemini 3 Flash (Preview, Reasoning)	$0.010	24.1s	71%	76	75	74	71	71	73%
90	GPT-4o Mini (temp=0)	$0.0012	27.7s	69%	77	74	73	72	71	73%
84	Arcee AI: Trinity Mini	$0.0003	10.5s	68%	77	75	73	73	67	73%
128	Gemini 2.5 Flash Lite (Reasoning)	$0.0032	43.9s	62%	81	79	71	71	64	73%
130	Gemini 3 Pro (Preview)	$0.051	49.5s	70%	75	75	74	72	69	73%
111	Z.AI GLM 4.6	$0.0077	57.0s	69%	75	75	72	72	70	73%
107	Gemma 3 4B	$0.0002	19.2s	66%	76	75	71	70	68	72%
142	Gemma 4 26B	$0.0007	2.6m	66%	78	75	73	70	64	72%
112	Gemini 3.1 Flash Lite (Preview)	$0.0027	7.9s	65%	78	74	72	67	66	71%
117	GPT-4o Mini (temp=1)	$0.0012	31.4s	67%	74	74	71	69	67	71%
140	DeepSeek V3.1	$0.0020	2.4m	67%	75	72	72	69	65	70%
134	Gemma 4 31B	$0.0008	1.2m	65%	74	72	69	68	67	70%
118	GPT-4.1 Mini	$0.0024	18.3s	66%	73	72	70	67	67	70%
119	Inception Mercury	$0.0082	14.5s	67%	74	72	72	66	66	70%
139	Gemma 4 31B (Reasoning)	$0.0012	1.4m	63%	75	74	69	65	65	69%
137	Gemma 4 26B (Reasoning)	$0.0018	1.7m	66%	70	70	68	68	67	69%
132	GPT-4.1 Nano	$0.0007	12.1s	63%	71	70	67	67	64	68%
129	Inception Mercury 2	$0.0028	7.6s	66%	70	68	68	66	65	67%
131	Stealth: Aurora Alpha	$0.0000	10.2s	65%	70	67	67	66	66	67%
145	Nemotron 3 Super	$0.0000	53.0s	57%	77	69	64	64	61	67%
147	GPT-OSS 120B	$0.0019	2.5m	62%	72	68	66	65	63	67%
144	Nemotron 3 Nano	$0.0008	57.0s	59%	72	71	64	63	63	66%
143	GPT-5 Nano	$0.0037	1.2m	63%	70	67	66	65	63	66%
77.86%

Median	Evaluator	Top 3	Flop 3
98.3%	"Not X but Y" pattern overuse	100GPT-5.4 Nano (Reasoning, Low) 100Mistral NeMO 100Qwen 3.5 397B A17B	2GPT-5 Nano 32Nemotron 3 Super 45Xiaomi MIMO v2.5 Pro
54.5%	Adverb-first sentence starts	100Mistral Large 100GPT-5.4 Mini 100GPT-5.4 Mini (Reasoning)	0ByteDance Seed 2.0 Lite 0ByteDance Seed 1.6 0Inception Mercury
82.6%	Adverbs in dialogue tags	100Stealth: Aurora Alpha 100GPT-5.5 100Qwen3.6 Max Preview	1GPT-4.1 Nano 7GPT-4.1 Mini 19Claude 3.7 Sonnet
83.8%	AI-ism adverb frequency	98Inception Mercury 97ByteDance Seed 2.0 Lite 97ByteDance Seed 1.6	52GPT-4.1 Nano 65Claude 3 Haiku 67Gemma 3 4B
100.0%	AI-ism character names	100Grok 4 100WizardLM 2 8x22b 100Gemini 3 Flash (Preview)	68Claude Opus 4 76Claude Sonnet 4 80Gemma 3 12B
100.0%	AI-ism location names	100Qwen 2.5 72B 100Qwen 3.5 35B 100Gemma 4 26B (Reasoning)	—
47.4%	AI-ism word frequency	88Claude Opus 4.7 (Reasoning) 86Claude Opus 4.7 86GPT-5.5 (Reasoning)	0GPT-4o, Aug. 6th (temp=0) 0GPT-4o Mini (temp=0) 0GPT-4o, May 13th (temp=0)
100.0%	Cliché density	100Mistral Small 4 (Reasoning) 100GPT-5 Nano 100Grok 4.1 Fast	40Gemini 2.5 Flash 60Inception Mercury 60Mistral Small 3.2 24B
93.4%	Dialogue tag variety (said vs. fancy)	100Stealth: Hunter Alpha 100Writer: Palmyra X5 100Mistral Small 3.2 24B	20Claude 3 Haiku 28GPT-4o, Aug. 6th (temp=1) 32Qwen 3.5 Plus (2026-02-15)
24.6%	Em-dash & semicolon overuse	100Qwen3.6 Max Preview 100Qwen 3.5 27B 100Gemini 3.1 Pro (Preview)	0Z.AI GLM 5.1 0Z.AI GLM 5 0GPT-4.1 Mini
100.0%	Emotion telling (show vs. tell)	100Gemini 3 Flash (Preview) 100Claude Opus 4.5 100Mistral Medium 3.1	73Llama 3.1 8B 86Llama 3.1 70B 90Mistral NeMO
98.2%	Filter word density	100Claude Sonnet 4 100Ministral 8B 100Z.AI GLM 4.6	11Stealth: Aurora Alpha 22Nemotron 3 Nano 34Inception Mercury 2
100.0%	Gibberish response detection	100Hermes 3 405B 100Grok 4.20 (Beta) 100GPT-5.4 Nano (Reasoning)	40Llama 3.1 8B 52Llama 3.1 70B 77Inception Mercury
100.0%	Markdown formatting overuse	100GPT-4o Mini (temp=0) 100DeepSeek V4 Flash (Reasoning) 100GPT-5.1	80Mistral Medium 3.1 98ByteDance Seed 1.6 Flash
100.0%	Missing dialogue indicators (quotation marks)	100Z.AI GLM 5.1 100Claude Opus 4.6 (Reasoning) 100DeepSeek V3.2	2Qwen3.6 Max Preview 19Qwen 3.5 397B A17B 40Gemini 3.1 Pro (Preview)
69.0%	Name drop frequency	100Z.AI GLM 4.6 100GPT-5 Nano 100Inception Mercury 2	0GPT-5.1 0GPT-5.5 (Reasoning) 0GPT-5.5 (Reasoning, Low)
69.5%	Narrator intent-glossing	100Cohere Command R+ (Aug. 2024) 100Claude 3 Haiku 99ByteDance Seed 2.0 Lite	0Stealth: Aurora Alpha 0Nemotron 3 Nano 2GPT-5 Nano
100.0%	Overuse of "that" (subordinate clause padding)	100Llama 3.1 Nemotron 70B 100LFM2 24B 100Xiaomi MIMO v2.5	72ByteDance Seed 2.0 Mini 77Claude 3 Haiku 78Claude Haiku 4.5
100.0%	Paragraph length variance	100Qwen 2.5 72B 100Xiaomi MIMO v2.5 Pro 100o4 Mini High	64Inception Mercury 65GPT-5 Nano 66Arcee AI: Trinity Mini
97.0%	Passive voice overuse	100o4 Mini High 100GPT-5.1 100GPT-4o Mini (temp=0)	86Gemma 4 31B 86Gemma 4 26B (Reasoning) 86WizardLM 2 8x22b
95.8%	Past progressive (was/were + -ing) overuse	100GPT-4o, Aug. 6th (temp=0) 100GPT-5.2 100GPT-5.4	25Z.AI GLM 4.6 26Claude Sonnet 4.6 40Claude Opus 4.7 (Reasoning)
47.8%	Pronoun-first sentence starts	100GPT-5.5 (Reasoning, Low) 100GPT-5.4 Mini (Reasoning) 100GPT-5.5 (Reasoning)	0Gemma 3 4B 0DeepSeek V3.1 0o4 Mini High
95.0%	Purple prose (modifier overload)	100Grok 4 100Inception Mercury 100Stealth: Aurora Alpha	76GPT-4.1 Nano 79GPT-5.4 Mini (Reasoning, Low) 82GPT-5.4 Mini
100.0%	Repeated phrase echo	100DeepSeek V4 Flash 100Ministral 8B 100Z.AI GLM 4.5	—
100.0%	Sentence length variance	100GPT-5.5 100Arcee AI: Trinity Large (Preview) 100MoonshotAI: Kimi K2.5	94Inception Mercury 96GPT-4o, Aug. 6th (temp=1) 96Llama 3.1 70B
49.4%	Sentence opener variety	86GPT-4o, Aug. 6th (temp=1) 85Grok 4.1 Fast 85DeepSeek V3 (2025-03-24)	32Qwen 3.5 9B 33Gemini 3.1 Flash Lite 33Qwen 3.5 397B A17B
26.9%	Subject-first sentence starts	91Llama 3.1 8B 79Qwen3 235B A22B Instruct 2507 74Mistral Large	0GPT-OSS 120B 0Inception Mercury 0ByteDance Seed 1.6
37.2%	Subordinate conjunction sentence starts	82GPT-5.5 81Mistral Small Creative 80Ministral 3 8B	0Claude 3.5 Sonnet 0GPT-4o Mini (temp=0) 0GPT-4.1 Nano
66.2%	Technical jargon density	100Qwen 2.5 72B 99Mistral Large 99o4 Mini High	0Nemotron 3 Nano 0GPT-5 Nano 4Nemotron 3 Super
54.6%	Useless dialogue additions	100Qwen 3.5 397B A17B 100Claude Opus 4.6 100GPT-5.4 (Reasoning)	0Inception Mercury 2 0Gemini 2.5 Flash Lite 0LFM2 24B

Thriller: chase through city streets

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4	92%
GPT-5.4 (Reasoning)	91%
GPT-5.5	91%
GPT-5.4 (Reasoning, Low)	91%
GPT-5.4 Mini (Reasoning)	90%
Qwen3.6 Max Preview	90%
GPT-5.5 (Reasoning, Low)	90%
GPT-5.5 (Reasoning)	90%
GPT-5.4 Mini	89%
Qwen 3.6 27B	87%
GPT-5.4 Mini (Reasoning, Low)	87%
Qwen 3.5 397B A17B	87%
Gemini 3.1 Pro (Preview)	86%
Writer: Palmyra X5	86%
Qwen3 235B A22B Instruct 2507	86%
Claude Sonnet 4.5	85%
Qwen 3.6 Flash	85%
Z.AI GLM 5 Turbo	85%
Qwen 3.5 35B	84%
Qwen 3.6 35B	83%

	Score	Cost	Time
GPT-5.4 Mini (Reasoning)	90%	$0.022	26.3s
GPT-5.4 Mini	89%	$0.015	18.3s
Qwen 3.6 Flash	85%	$0.0085	36.0s
GPT-5.4 Mini (Reasoning, Low)	87%	$0.013	16.0s
Writer: Palmyra X5	86%	$0.011	22.5s
Gemini 2.5 Flash	81%	$0.0039	8.1s
Qwen3 235B A22B Instruct 2507	86%	$0.0011	1.1m
Qwen 3.6 27B	87%	$0.017	1.7m
GPT-5.4	92%	$0.050	1.4m
DeepSeek V4 Flash	81%	$0.0006	27.0s
GPT-5.4 (Reasoning, Low)	91%	$0.051	1.4m
Mistral Small 4 (Reasoning)	83%	$0.0023	32.8s
Z.AI GLM 5 Turbo	85%	$0.0068	30.7s
Qwen 3.5 397B A17B	87%	$0.016	1.5m
Grok 4.3	82%	$0.0051	25.2s
GPT-4.1	83%	$0.018	46.3s
Qwen 3.5 Plus (2026-04-20)	83%	$0.013	1.4m
Qwen 3.6 35B	83%	$0.0053	48.7s
Mistral Small 4	79%	$0.0011	17.2s
Qwen 3.5 35B	84%	$0.0068	26.9s

	Score	Consistency	Stability
GPT-5.4	92%	98%	90%
GPT-5.5 (Reasoning)	90%	98%	88%
GPT-5.4 (Reasoning)	91%	95%	88%
GPT-5.5 (Reasoning, Low)	90%	98%	88%
GPT-5.5	91%	97%	87%
GPT-5.4 (Reasoning, Low)	91%	94%	86%
GPT-5.4 Mini	89%	96%	86%
GPT-5.4 Mini (Reasoning)	90%	93%	85%
Qwen3.6 Max Preview	90%	94%	83%
GPT-5.4 Mini (Reasoning, Low)	87%	94%	83%
Qwen 3.5 397B A17B	87%	94%	82%
Writer: Palmyra X5	86%	94%	81%
Qwen 3.6 27B	87%	89%	80%
Qwen 3.6 35B	83%	96%	80%
Gemini 3.1 Pro (Preview)	86%	91%	80%
GPT-5	82%	96%	80%
Qwen 3.6 Flash	85%	89%	80%
o4 Mini High	81%	97%	79%
Qwen3 235B A22B Instruct 2507	86%	93%	79%
GPT-5.1	83%	93%	78%

Romance: separated couple reunites

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4	92%
GPT-5.4 (Reasoning)	91%
GPT-5.4 (Reasoning, Low)	90%
GPT-5.5 (Reasoning, Low)	90%
GPT-5.5	90%
GPT-5.4 Mini (Reasoning, Low)	90%
GPT-5.4 Mini (Reasoning)	89%
GPT-5.5 (Reasoning)	89%
GPT-5.4 Mini	88%
Qwen3.6 Max Preview	85%
DeepSeek V3 (2025-03-24)	85%
Claude Opus 4	85%
Rocinante 12B	84%
Z.AI GLM 5 Turbo	84%
Qwen 3.6 35B	83%
GPT-5.1	83%
MiniMax M2.7	83%
Qwen3 235B A22B Instruct 2507	83%
Qwen 3.5 397B A17B	83%
DeepSeek V4 Flash	83%

Fantasy: entering an ancient ruin

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.5 (Reasoning, Low)	90%
GPT-5.4 (Reasoning)	89%
GPT-5.5 (Reasoning)	89%
GPT-5.5	87%
GPT-5.4 (Reasoning, Low)	86%
GPT-5.4 Mini (Reasoning)	84%
GPT-5.4	84%
GPT-5.4 Mini	83%
Qwen3.6 Max Preview	83%
GPT-5.4 Mini (Reasoning, Low)	83%
Qwen 3.6 Flash	81%
Claude Opus 4.7 (Reasoning)	80%
Qwen 3.5 397B A17B	80%
Grok 4.20 (Reasoning)	80%
Hermes 3 405B	80%
Gemini 3.1 Pro (Preview)	80%
GPT-5.1	80%
Grok 4.20	80%
Claude Sonnet 4.6	79%
Claude Opus 4.7	79%

Mystery: examining a crime scene

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	92%
GPT-5.4	92%
GPT-5.5	91%
GPT-5.4 (Reasoning, Low)	91%
GPT-5.5 (Reasoning, Low)	90%
GPT-5.5 (Reasoning)	90%
GPT-5.4 Mini	89%
GPT-5.4 Mini (Reasoning)	89%
GPT-5.4 Mini (Reasoning, Low)	87%
Qwen3.6 Max Preview	87%
Qwen 3.6 35B	86%
Qwen 3.6 Flash	86%
Claude Opus 4.7 (Reasoning)	86%
GPT-5	85%
GPT-5.1	85%
Rocinante 12B	85%
MoonshotAI: Kimi K2.6	85%
Claude Sonnet 4.5	85%
Grok 4.3	84%
Qwen 3.5 397B A17B	84%

Horror: alone in an eerie place at night

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.5 (Reasoning, Low)	89%
GPT-5.4 (Reasoning)	88%
GPT-5.4	88%
Qwen3.6 Max Preview	88%
GPT-5.5 (Reasoning)	88%
GPT-5.4 (Reasoning, Low)	86%
GPT-5.4 Mini	86%
GPT-5.5	86%
Qwen 3.5 397B A17B	86%
GPT-5.4 Mini (Reasoning, Low)	86%
GPT-5.4 Mini (Reasoning)	85%
DeepSeek V3 (2025-03-24)	85%
Gemini 3.1 Pro (Preview)	85%
Claude Opus 4.5	84%
Z.AI GLM 5 Turbo	83%
Claude Sonnet 4.6 (Reasoning)	83%
Qwen 3.6 Flash	83%
Claude Opus 4.6 (Reasoning)	83%
GPT-5	83%
DeepSeek V4 Flash	83%

Literary fiction: old friends reunite

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
Grok 4.20 (Reasoning)	94%
Grok 4.20 (Beta, Reasoning)	91%
Mistral Large 2	88%
GPT-5.4	88%
GPT-5.5 (Reasoning)	88%
GPT-5.5 (Reasoning, Low)	87%
GPT-5.5	87%
GPT-5.4 Mini	87%
o4 Mini High	87%
GPT-5.4 (Reasoning, Low)	86%
Qwen3 235B A22B Instruct 2507	86%
Mistral Small Creative	86%
Grok 4.3 (Reasoning)	86%
GPT-5.4 (Reasoning)	86%
Mistral Small 4	86%
Ministral 3 14B	86%
DeepSeek V4 Pro (Reasoning)	85%
MoonshotAI: Kimi K2.6	85%
Mistral Small 4 (Reasoning)	85%
Grok 4.3	85%

Thriller: chase through city streets

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	93%
GPT-5.4 (Reasoning, Low)	93%
GPT-5.4	92%
Grok 4.20 (Reasoning)	90%
GPT-5.4 Mini (Reasoning)	90%
GPT-5.5 (Reasoning, Low)	90%
GPT-5.5 (Reasoning)	90%
GPT-5.5	89%
GPT-5.4 Mini (Reasoning, Low)	88%
GPT-5.1	88%
Qwen3.6 Max Preview	88%
Writer: Palmyra X5	88%
Grok 4.20 (Beta, Reasoning)	87%
Z.AI GLM 5 Turbo	87%
Qwen3 235B A22B Instruct 2507	87%
Gemini 3.1 Pro (Preview)	87%
MoonshotAI: Kimi K2.6	87%
Claude Opus 4.7	87%
GPT-5.4 Mini	87%
Qwen 3.5 397B A17B	86%

Romance: separated couple reunites

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning, Low)	90%
GPT-5.4 Mini	90%
GPT-5.5	90%
GPT-5.5 (Reasoning, Low)	89%
GPT-5.4 (Reasoning)	89%
GPT-5.4	89%
GPT-5.5 (Reasoning)	88%
Grok 4.20 (Beta, Reasoning)	87%
GPT-5.4 Mini (Reasoning)	87%
GPT-5.1	87%
Mistral Small Creative	86%
GPT-5.4 Mini (Reasoning, Low)	86%
Grok 4.20 (Reasoning)	86%
Claude Opus 4	86%
Grok 4.20	86%
DeepSeek V4 Flash	85%
Qwen 3.6 35B	85%
Qwen 3.6 Flash	85%
DeepSeek V4 Pro (Reasoning)	85%
Qwen 3.5 397B A17B	85%

Fantasy: entering an ancient ruin

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4	90%
GPT-5.4 (Reasoning)	90%
GPT-5.4 (Reasoning, Low)	90%
GPT-5.5 (Reasoning, Low)	89%
GPT-5.5	88%
GPT-5.5 (Reasoning)	88%
GPT-5.1	87%
Grok 4.20 (Reasoning)	87%
Grok 4.20 (Beta, Reasoning)	87%
Qwen 3.6 35B	86%
GPT-5.4 Mini	86%
Qwen 3.5 397B A17B	86%
Qwen 3.6 Flash	85%
GPT-5.4 Mini (Reasoning)	85%
Claude Opus 4.6 (Reasoning)	84%
GPT-5.4 Mini (Reasoning, Low)	84%
Grok 4.3 (Reasoning)	84%
Qwen3.6 Max Preview	84%
Gemini 3.1 Pro (Preview)	83%
Qwen 3.5 35B	83%

Mystery: examining a crime scene

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	93%
Grok 4.20 (Reasoning)	92%
Qwen3.6 Max Preview	92%
GPT-5.4	91%
GPT-5.1	90%
Grok 4.20 (Beta, Reasoning)	90%
GPT-5.5	89%
GPT-5.4 (Reasoning, Low)	89%
Cohere Command R+ (Aug. 2024)	89%
GPT-5.4 Mini (Reasoning)	89%
GPT-5	88%
GPT-5.5 (Reasoning)	88%
ByteDance Seed 2.0 Lite	87%
Qwen 3.6 27B	87%
Qwen 3.6 35B	87%
MoonshotAI: Kimi K2.6	87%
GPT-5.5 (Reasoning, Low)	87%
Grok 4.20 (Beta)	87%
GPT-5.4 Mini (Reasoning, Low)	87%
GPT-5.4 Mini	87%

Horror: alone in an eerie place at night

Creative Writing Hallucination

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

	Score
GPT-5.4 (Reasoning)	92%
GPT-5.5 (Reasoning)	91%
GPT-5.5 (Reasoning, Low)	91%
GPT-5.4 (Reasoning, Low)	90%
GPT-5.5	90%
GPT-5.4	90%
Qwen 3.5 Plus (2026-04-20)	90%
Qwen3.6 Max Preview	89%
GPT-5.1	88%
Gemini 3.1 Pro (Preview)	87%
Qwen 3.5 397B A17B	86%
Qwen 3.6 Flash	86%
GPT-5	86%
GPT-5.4 Mini (Reasoning)	86%
Grok 4.3 (Reasoning)	86%
GPT-5.4 Mini (Reasoning, Low)	86%
Grok 4.20 (Beta)	86%
Grok 4.20	86%
Grok 4.20 (Reasoning)	86%
Claude Sonnet 4.5	86%

		genre						Novelcrafter Default Prompt						Detailed Writing Rules
Model	Total ▼	Literary fiction: old friends reunite	Thriller: chase through city streets	Romance: separated couple reunites	Fantasy: entering an ancient ruin	Mystery: examining a crime scene	Horror: alone in an eerie place at night	Literary fiction: old friends reunite	Thriller: chase through city streets	Romance: separated couple reunites	Fantasy: entering an ancient ruin	Mystery: examining a crime scene	Horror: alone in an eerie place at night	Literary fiction: old friends reunite	Thriller: chase through city streets	Romance: separated couple reunites	Fantasy: entering an ancient ruin	Mystery: examining a crime scene	Horror: alone in an eerie place at night
GPT-5.4	90%	89%	92%	92%	84%	92%	88%	88%	92%	89%	90%	91%	90%	91%	93%	92%	89%	92%	90%
GPT-5.4 (Reasoning)	90%	88%	91%	91%	89%	92%	88%	86%	93%	89%	90%	93%	92%	87%	91%	88%	92%	91%	93%
GPT-5.4 (Reasoning, Low)	90%	89%	91%	90%	86%	91%	86%	86%	93%	90%	90%	89%	90%	88%	91%	92%	89%	93%	90%
GPT-5.5	89%	90%	91%	90%	87%	91%	86%	87%	89%	90%	88%	89%	90%	88%	90%	89%	90%	90%	91%
GPT-5.5 (Reasoning, Low)	89%	89%	90%	90%	90%	90%	89%	87%	90%	89%	89%	87%	91%	88%	90%	88%	88%	89%	91%
GPT-5.5 (Reasoning)	89%	88%	90%	89%	89%	90%	88%	88%	90%	88%	88%	88%	91%	87%	90%	89%	88%	89%	92%
GPT-5.4 Mini (Reasoning)	88%	86%	90%	89%	84%	89%	85%	84%	90%	87%	85%	89%	86%	87%	90%	87%	87%	90%	90%
GPT-5.4 Mini	87%	88%	89%	88%	83%	89%	86%	87%	87%	90%	86%	87%	85%	88%	88%	89%	86%	89%	86%
GPT-5.4 Mini (Reasoning, Low)	87%	87%	87%	90%	83%	87%	86%	84%	88%	86%	84%	87%	86%	86%	89%	88%	87%	88%	88%
Qwen3.6 Max Preview	86%	84%	90%	85%	83%	87%	88%	84%	88%	83%	84%	92%	89%	85%	89%	87%	84%	87%	87%
GPT-5.1	86%	84%	83%	83%	80%	85%	83%	84%	88%	87%	87%	90%	88%	85%	90%	86%	86%	90%	89%
Grok 4.20 (Reasoning)	85%	85%	75%	81%	80%	82%	80%	94%	90%	86%	87%	92%	86%	88%	86%	87%	84%	86%	85%
Qwen 3.5 397B A17B	85%	78%	87%	83%	80%	84%	86%	85%	86%	85%	86%	85%	86%	87%	91%	87%	84%	88%	87%
Qwen3 235B A22B Instruct 2507	85%	83%	86%	83%	78%	81%	81%	86%	87%	83%	80%	85%	85%	89%	91%	87%	84%	87%	87%
Qwen 3.6 Flash	84%	81%	85%	81%	81%	86%	83%	81%	84%	85%	85%	85%	86%	84%	87%	87%	83%	88%	85%

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100o4 Mini High 100Mistral Large 3 100Qwen3.6 Max Preview	7GPT-5 Nano 32MoonshotAI: Kimi K2.6 36Mistral Small 4
57.2%	Adverb-first sentence starts	100Claude Opus 4.7 100Writer: Palmyra X5 100GPT-5.4 (Reasoning, Low)	0Stealth: Aurora Alpha 0Inception Mercury 2 2Mistral Small 3.2 24B
95.0%	Adverbs in dialogue tags	100Ministral 3B 100GPT-5.4 100GPT-5.4 Nano (Reasoning)	20GPT-5 Nano 31Claude Opus 4 40DeepSeek V3 (2025-03-24)
92.1%	AI-ism adverb frequency	99ByteDance Seed 2.0 Lite 99o4 Mini 98DeepSeek V4 Pro (Reasoning)	81Claude Sonnet 4.6 82Gemma 3 4B 82Arcee AI: Trinity Large (Preview)
100.0%	AI-ism character names	100Grok 4.20 (Beta, Reasoning) 100Grok 4.1 Fast 100GPT-5.4 Mini	76Llama 3.1 8B 92Claude Opus 4 96Gemma 3 12B
100.0%	AI-ism location names	100Xiaomi MIMO v2.5 100GPT-4o, Aug. 6th (temp=1) 100Gemini 3.1 Flash Lite (Preview)	—
44.1%	AI-ism word frequency	88Claude Opus 4.7 83Claude Sonnet 4.6 82ByteDance Seed 2.0 Lite	0Mistral Small 3.2 24B 0Stealth: Aurora Alpha 0GPT-4o Mini (temp=0)
100.0%	Cliché density	100Grok 4.3 100Gemma 4 26B 100Claude Haiku 4.5	33Claude 3 Haiku 33Qwen 2.5 72B 33GPT-4o, May 13th (temp=0)
60.0%	Dialogue tag variety (said vs. fancy)	100Qwen 3.5 397B A17B 100Z.AI GLM 5.1 100Grok 4.3 (Reasoning)	0Gemini 3 Flash (Preview, Reasoning) 0Gemma 4 26B 0Gemini 3 Pro (Preview)
0.4%	Em-dash & semicolon overuse	100Inception Mercury 100Qwen3.6 Max Preview 100WizardLM 2 8x22b	0Claude Haiku 4.5 0Gemma 4 31B 0GPT-5.4 Nano
100.0%	Emotion telling (show vs. tell)	100Grok 4 100Z.AI GLM 5.1 100Nemotron 3 Super	80WizardLM 2 8x22b 90Claude 3 Haiku 95GPT-4o Mini (temp=1)
91.1%	Filter word density	100Qwen3.6 Max Preview 100GPT-5.5 (Reasoning, Low) 100Qwen3 235B A22B Instruct 2507	0Nemotron 3 Nano 1Inception Mercury 2 2Stealth: Aurora Alpha
100.0%	Gibberish response detection	100GPT-4o, May 13th (temp=0) 100Nemotron 3 Nano 100Mistral Small 4 (Reasoning)	0Llama 3.1 8B 32WizardLM 2 8x22b 68Rocinante 12B
100.0%	Markdown formatting overuse	100GPT-5.4 Mini (Reasoning) 100Qwen 3.5 9B 100Z.AI GLM 4.6	80Ministral 3 14B 80Ministral 3 3B 80Rocinante 12B
100.0%	Missing dialogue indicators (quotation marks)	100DeepSeek V3 (2025-03-24) 100GPT-5.4 Nano 100o4 Mini High	60Qwen 3.5 397B A17B 64Gemini 3.1 Flash Lite (Reasoning) 75Qwen 3.6 Flash
86.7%	Name drop frequency	100Claude Sonnet 4.6 100Qwen3.6 Max Preview 100Gemini 3.1 Flash Lite (Reasoning)	18Claude 3.7 Sonnet 27ByteDance Seed 1.6 Flash 37Hermes 3 405B
74.3%	Narrator intent-glossing	100Writer: Palmyra X5 100Hermes 3 405B 100GPT-5.5 (Reasoning, Low)	0Nemotron 3 Nano 1Inception Mercury 2 9Stealth: Aurora Alpha
100.0%	Overuse of "that" (subordinate clause padding)	100Claude 3.7 Sonnet 100WizardLM 2 8x22b 100MoonshotAI: Kimi K2.5	50ByteDance Seed 2.0 Lite 62Claude Sonnet 4.6 67ByteDance Seed 2.0 Mini
97.3%	Paragraph length variance	100MoonshotAI: Kimi K2.6 100Stealth: Hunter Alpha 100Writer: Palmyra X5	37Grok 4.3 (Reasoning) 38GPT-5 Nano 43ByteDance Seed 2.0 Lite
94.4%	Passive voice overuse	100GPT-5.5 (Reasoning, Low) 100GPT-5.5 100o4 Mini High	83ByteDance Seed 2.0 Lite 84DeepSeek V4 Pro (Reasoning) 85ByteDance Seed 2.0 Mini
80.2%	Past progressive (was/were + -ing) overuse	100Gemini 2.5 Flash (Reasoning) 100GPT-5.4 Nano (Reasoning, Low) 100GPT-5.5 (Reasoning, Low)	17Gemma 3 27B 19Z.AI GLM 4.7 23Qwen 3.5 122B
83.3%	Pronoun-first sentence starts	100GPT-5.4 Mini 100Z.AI GLM 5.1 100GPT-4.1 Mini	9Inception Mercury 11Mistral Small 3.2 24B 17Gemini 3.1 Flash Lite
97.6%	Purple prose (modifier overload)	100GPT-4o Mini (temp=1) 100Xiaomi MIMO v2.5 100o4 Mini High	90Llama 3.1 8B 90Gemma 4 26B 90Gemini 3.1 Flash Lite
100.0%	Repeated phrase echo	100Mistral Small 4 (Reasoning) 100Llama 3.1 70B 100Gemma 4 31B	—
100.0%	Sentence length variance	100Z.AI GLM 4.6 100Mistral Large 2 100GPT-5.4 Mini	82Llama 3.1 70B 82Inception Mercury 92GPT-4o, Aug. 6th (temp=0)
49.4%	Sentence opener variety	100Llama 3.1 8B 86Grok 4.1 Fast 83WizardLM 2 8x22b	27GPT-5 Nano 27Mistral Small 3.2 24B 28Inception Mercury
47.6%	Subject-first sentence starts	100Llama 3.1 8B 96GPT-5.4 (Reasoning) 94GPT-5.4	0Inception Mercury 2 0Stealth: Aurora Alpha 1Inception Mercury
50.8%	Subordinate conjunction sentence starts	100Qwen 3.5 Plus (2026-04-20) 100Qwen 3.5 397B A17B 100Gemma 4 26B (Reasoning)	0Mistral Large 3 0WizardLM 2 8x22b 0Grok 4
60.9%	Technical jargon density	100Aion 2.0 100GPT-5.5 98GPT-4o Mini (temp=0)	0GPT-5 Nano 0MoonshotAI: Kimi K2.5 0MiniMax M2.5
53.3%	Useless dialogue additions	100Qwen 3.5 35B 100Gemini 2.5 Flash (Reasoning) 100Gemini 2.5 Pro	0GPT-4o Mini (temp=0) 0Stealth: Aurora Alpha 0Z.AI GLM 4.5 Air

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Hermes 3 405B 100DeepSeek V4 Pro (Reasoning) 100Z.AI GLM 5	32GPT-5 Nano 49Xiaomi MIMO v2.5 Pro 51Nemotron 3 Nano
56.8%	Adverb-first sentence starts	100GPT-5.4 Mini 100GPT-5.4 Mini (Reasoning) 100Grok 4.20	2Inception Mercury 6Inception Mercury 2 8Nemotron 3 Nano
92.7%	Adverbs in dialogue tags	100Qwen 3 32B 100Mistral Medium 3.1 100GPT-OSS 120B	0GPT-4.1 Nano 32GPT-5 Nano 35Claude Haiku 4.5
84.3%	AI-ism adverb frequency	99ByteDance Seed 2.0 Lite 97ByteDance Seed 1.6 96Grok 4.1 Fast	51Claude 3 Haiku 53Mistral Small 3.2 24B 53GPT-4.1 Nano
100.0%	AI-ism character names	100Z.AI GLM 5 Turbo 100Claude Opus 4.7 (Reasoning) 100Qwen 2.5 72B	64Llama 3.1 8B 92Qwen 3.6 35B 92Grok 4.20 (Beta)
100.0%	AI-ism location names	100WizardLM 2 8x22b 100Ministral 3 14B 100Arcee AI: Trinity Large (Preview)	96DeepSeek V3.2 96o4 Mini
47.6%	AI-ism word frequency	86Claude Opus 4.7 85Claude Sonnet 4.6 84ByteDance Seed 1.6 Flash	0GPT-4o Mini (temp=0) 0WizardLM 2 8x22b 0GPT-4o Mini (temp=1)
93.3%	Cliché density	100LFM2 24B 100GPT-4.1 100GPT-5.4 Mini (Reasoning)	33Qwen 2.5 72B 47Inception Mercury 2 53Gemini 2.5 Flash
75.9%	Dialogue tag variety (said vs. fancy)	100Claude Opus 4.7 (Reasoning) 100Claude Sonnet 4.6 (Reasoning) 100Z.AI GLM 5 Turbo	0Gemma 4 31B 0Gemma 4 31B (Reasoning) 4Gemini 3 Flash (Preview, Reasoning)
17.3%	Em-dash & semicolon overuse	100GPT-5.4 Mini 100Qwen3.6 Max Preview 100Qwen 3.5 397B A17B	0Claude Opus 4.6 (Reasoning) 0Claude Opus 4.7 0DeepSeek V4 Pro
100.0%	Emotion telling (show vs. tell)	100Inception Mercury 2 100Z.AI GLM 4.7 100Gemini 2.5 Flash (Reasoning)	83Llama 3.1 70B 83Inception Mercury 89GPT-4o, May 13th (temp=0)
95.5%	Filter word density	100Claude Opus 4.6 (Reasoning) 100Mistral Medium 3.1 100Cohere Command R+ (Aug. 2024)	2Nemotron 3 Nano 9Inception Mercury 2 16Stealth: Aurora Alpha
100.0%	Gibberish response detection	100GPT-4o, Aug. 6th (temp=1) 100Mistral NeMO 100Claude 3.7 Sonnet	0Llama 3.1 8B 80Llama 3.1 70B 80Inception Mercury
100.0%	Markdown formatting overuse	100Grok 4.3 100DeepSeek V4 Pro 100Aion 2.0	80Ministral 3B
100.0%	Missing dialogue indicators (quotation marks)	100GPT-5.4 Nano (Reasoning) 100Ministral 3 3B 100Qwen 3.5 Plus (2026-04-20)	0Qwen3.6 Max Preview 22Qwen 3.5 35B 26Qwen 3.5 397B A17B
90.0%	Name drop frequency	100Grok 4.3 (Reasoning) 100Nemotron 3 Nano 100ByteDance Seed 2.0 Mini	21Mistral Large 29Mistral Large 2 30GPT-5.4 Nano
71.2%	Narrator intent-glossing	100GPT-4o Mini (temp=0) 100Mistral Medium 3.1 100Gemini 3 Pro (Preview)	0Nemotron 3 Nano 0GPT-5 Nano 0Inception Mercury 2
100.0%	Overuse of "that" (subordinate clause padding)	100Z.AI GLM 5 100Inception Mercury 100Grok 4.20 (Beta)	49Mistral Small 3.2 24B 50ByteDance Seed 2.0 Mini 58ByteDance Seed 2.0 Lite
100.0%	Paragraph length variance	100GPT-5.2 100Claude Opus 4.5 100MiniMax M2.5	47WizardLM 2 8x22b 51GPT-5 Nano 59Nemotron 3 Nano
97.9%	Passive voice overuse	100Mistral Small 4 100Claude 3 Haiku 100Claude Opus 4.5	83Inception Mercury 91ByteDance Seed 2.0 Mini 91Gemini 3.1 Flash Lite (Reasoning)
94.6%	Past progressive (was/were + -ing) overuse	100Grok 4 Fast 100GPT-5.5 (Reasoning) 100GPT-5.2	31MiniMax M2.5 33Claude Sonnet 4.6 37Ministral 3B
26.7%	Pronoun-first sentence starts	100Llama 3.1 8B 100Llama 3.1 Nemotron 70B 98Mistral Large	0Claude Sonnet 4.6 (Reasoning) 0Gemini 2.5 Flash (Reasoning) 0Z.AI GLM 4.7 Flash
94.2%	Purple prose (modifier overload)	100Stealth: Aurora Alpha 100Grok 4 100Grok 4.1 Fast	73Llama 3.1 8B 79GPT-4.1 Nano 79Gemma 3 4B
100.0%	Repeated phrase echo	100Gemini 3.1 Flash Lite 100Claude Sonnet 4 100Gemini 2.5 Flash Lite	—
100.0%	Sentence length variance	100GPT-4o, May 13th (temp=1) 100Qwen 3.6 35B 100Writer: Palmyra X5	91WizardLM 2 8x22b 96GPT-4o, Aug. 6th (temp=1) 97Inception Mercury
49.0%	Sentence opener variety	95Llama 3.1 8B 89Rocinante 12B 87GPT-4o, Aug. 6th (temp=1)	30Inception Mercury 33Qwen 3.5 35B 34Qwen 3.5 9B
29.0%	Subject-first sentence starts	91Rocinante 12B 87GPT-5.4 80Llama 3.1 8B	0GPT-OSS 120B 2Qwen 3.5 Plus (2026-02-15) 2Inception Mercury 2
32.2%	Subordinate conjunction sentence starts	98Qwen3.6 Max Preview 85Qwen 3.6 Flash 84Gemini 3.1 Flash Lite (Reasoning)	0Claude Sonnet 4.6 0Mistral Large 2 0Qwen 3.5 9B
64.3%	Technical jargon density	100DeepSeek V3 (2025-03-24) 100GPT-5.4 (Reasoning) 96GPT-5.1	0Inception Mercury 2 0Nemotron 3 Nano 0Nemotron 3 Super
39.9%	Useless dialogue additions	100Gemini 3.1 Flash Lite (Reasoning) 100Qwen3.6 Max Preview 100Gemini 3.1 Flash Lite	0Z.AI GLM 4.6 0Z.AI GLM 4.5 Air 0Qwen 2.5 72B

	Score	Cost	Time
GPT-5.4 Mini (Reasoning, Low)	83%	$0.016	18.6s
GPT-5.4 Mini	83%	$0.016	18.8s
Rocinante 12B	78%	$0.0009	26.7s
GPT-5.4 Mini (Reasoning)	84%	$0.024	29.9s
Qwen 3.6 Flash	81%	$0.0087	35.9s
ByteDance Seed 1.6 Flash	76%	$0.0014	29.6s
Qwen 3.6 35B	78%	$0.0064	51.6s
Mistral Small 4	76%	$0.0015	19.7s
Qwen 3.5 9B	79%	$0.0008	56.9s
GPT-5.4 Nano	77%	$0.0050	17.8s
Qwen3 235B A22B Instruct 2507	78%	$0.0010	56.7s
Grok 4.20	80%	$0.0097	52.7s
DeepSeek V4 Flash (Reasoning)	76%	$0.0007	36.4s
Z.AI GLM 5	79%	$0.0078	45.3s
Gemini 3.1 Flash Lite (Preview)	76%	$0.0029	8.8s
DeepSeek V4 Flash	76%	$0.0007	28.1s
Hermes 3 405B	80%	$0.0030	1.6m
GPT-4o Mini (temp=1)	77%	$0.0012	28.4s
Gemini 3.1 Flash Lite (Reasoning)	75%	$0.0027	8.7s
Claude 3 Haiku	74%	$0.0025	16.9s

Median	Evaluator	Top 3	Flop 3
69.3%	"Not X but Y" pattern overuse	100Gemini 3.1 Pro (Preview) 100GPT-4o Mini (temp=1) 100Inception Mercury	0Writer: Palmyra X5 0GPT-5 Nano 0LFM2 24B
53.2%	Adverb-first sentence starts	100GPT-5.4 100GPT-5.4 Mini (Reasoning, Low) 100GPT-5.4 (Reasoning)	0Grok 4.1 Fast 0Stealth: Aurora Alpha 0Inception Mercury 2
97.8%	Adverbs in dialogue tags	100Gemma 3 4B 100Gemma 3 27B 100Llama 3.1 8B	40Claude Sonnet 4.6 (Reasoning) 49Claude Opus 4.7 52GPT-4.1 Nano
85.5%	AI-ism adverb frequency	100ByteDance Seed 1.6 97o4 Mini High 97GPT-5 Mini	55GPT-4.1 Nano 66GPT-4o Mini (temp=0) 68Claude Opus 4.7
100.0%	AI-ism character names	100Z.AI GLM 5 Turbo 100Grok 4.1 Fast 100Z.AI GLM 5.1	84Llama 3.1 8B 92GPT-4.1 96Z.AI GLM 4.6
100.0%	AI-ism location names	100Grok 4.20 (Reasoning) 100Ministral 3 8B 100Z.AI GLM 5	—
26.0%	AI-ism word frequency	77Claude Opus 4.7 (Reasoning) 72Claude Opus 4.7 69GPT-5.5 (Reasoning)	0Gemini 2.5 Flash Lite (Reasoning) 0GPT-4.1 Nano 0GPT-4o, Aug. 6th (temp=0)
100.0%	Cliché density	100GPT-5.4 Nano (Reasoning) 100Grok 4.3 100Z.AI GLM 4.7	60GPT-4o, May 13th (temp=0) 67Claude 3 Haiku 67Inception Mercury
29.8%	Dialogue tag variety (said vs. fancy)	100Claude Sonnet 4.6 100Qwen3.6 Max Preview 100Gemini 3.1 Flash Lite (Reasoning)	0GPT-4o, Aug. 6th (temp=1) 0Z.AI GLM 4.7 0Stealth: Healer Alpha
2.0%	Em-dash & semicolon overuse	100Qwen3.6 Max Preview 100Mistral Small 3.2 24B 99Qwen 3.6 Flash	0Mistral Medium 3.1 0Mistral Large 0Z.AI GLM 5.1
100.0%	Emotion telling (show vs. tell)	100Aion 2.0 100Mistral Large 3 100Claude Sonnet 4.5	36Llama 3.1 70B 65GPT-4o Mini (temp=0) 68GPT-4o, Aug. 6th (temp=0)
87.0%	Filter word density	100GPT-5 100GPT-5.5 100o4 Mini High	0Stealth: Aurora Alpha 0Gemma 4 26B (Reasoning) 0Inception Mercury 2
100.0%	Gibberish response detection	100Gemma 4 31B 100GPT-5.1 100Gemini 3 Pro (Preview)	40Llama 3.1 8B 80Z.AI GLM 4.6 87Hermes 3 70B
100.0%	Markdown formatting overuse	100Qwen 3.5 Plus (2026-02-15) 100Llama 3.1 70B 100Z.AI GLM 4.5 Air	40Mistral Large 3 55Ministral 8B 60Ministral 3 8B
100.0%	Missing dialogue indicators (quotation marks)	100GPT-5.4 Mini 100Gemma 4 26B (Reasoning) 100Arcee AI: Trinity Large (Preview)	0Gemini 3.1 Flash Lite (Preview) 0Qwen3.6 Max Preview 0Gemini 3.1 Flash Lite
68.7%	Name drop frequency	99GPT-5 Mini 98Stealth: Aurora Alpha 97Z.AI GLM 4.6	10GPT-5.2 20Claude 3 Haiku 27Z.AI GLM 4.5 Air
38.9%	Narrator intent-glossing	100Grok 4.1 Fast 97Qwen3.6 Max Preview 96o4 Mini	0Nemotron 3 Nano 0Nemotron 3 Super 0Inception Mercury 2
100.0%	Overuse of "that" (subordinate clause padding)	100Gemma 3 4B 100Mistral Medium 3.1 100GPT-5.5 (Reasoning, Low)	40Mistral Small 3.2 24B 61Llama 3.1 70B 66GPT-4o, Aug. 6th (temp=0)
97.4%	Paragraph length variance	100Mistral Small 4 (Reasoning) 100DeepSeek V4 Flash (Reasoning) 100Z.AI GLM 5 Turbo	11Mistral Small 3.2 24B 27Inception Mercury 35WizardLM 2 8x22b
97.5%	Passive voice overuse	100Grok 4 Fast 100Claude Opus 4 100Qwen3 235B A22B Instruct 2507	83ByteDance Seed 2.0 Mini 83Llama 3.1 70B 85Gemma 4 26B (Reasoning)
96.7%	Past progressive (was/were + -ing) overuse	100Grok 4.3 100Gemini 2.5 Flash Lite (Reasoning) 100GPT-5.1	14Z.AI GLM 4.7 37Gemma 4 26B 38Gemma 4 26B (Reasoning)
96.8%	Pronoun-first sentence starts	100Mistral Large 3 100Mistral Large 100Z.AI GLM 5 Turbo	30Mistral Small 3.2 24B 34Mistral NeMO 64Z.AI GLM 4.7 Flash
95.5%	Purple prose (modifier overload)	100Inception Mercury 100Arcee AI: Trinity Large (Preview) 100Grok 4	79Gemini 3.1 Pro (Preview) 80Gemini 2.5 Flash (Reasoning) 84GPT-5.4 (Reasoning, Low)
100.0%	Repeated phrase echo	100Grok 4.20 (Reasoning) 100MiniMax M2.7 100Xiaomi MIMO v2.5	—
100.0%	Sentence length variance	100Qwen 3.5 Flash 100Grok 4.20 (Beta, Reasoning) 100Gemini 3 Flash (Preview, Reasoning)	57Mistral Small 3.2 24B 74WizardLM 2 8x22b 82Llama 3.1 70B
54.0%	Sentence opener variety	91Rocinante 12B 86Llama 3.1 8B 85GPT-4o Mini (temp=1)	29Inception Mercury 33GPT-5 Nano 34Mistral Small 3.2 24B
42.5%	Subject-first sentence starts	90GPT-5.5 (Reasoning, Low) 89GPT-5.4 (Reasoning, Low) 88GPT-5.4 (Reasoning)	0Stealth: Aurora Alpha 1Inception Mercury 2Arcee AI: Trinity Mini
40.7%	Subordinate conjunction sentence starts	100Gemini 3.1 Flash Lite 100Gemini 3.1 Flash Lite (Preview) 92Qwen 3.5 397B A17B	0Qwen 2.5 72B 0Grok 4 0Stealth: Aurora Alpha
33.5%	Technical jargon density	93GPT-5.5 (Reasoning) 90GPT-5.4 (Reasoning) 89GPT-5.5 (Reasoning, Low)	0MiniMax M2.7 0MiniMax M2.5 0ByteDance Seed 2.0 Lite
35.3%	Useless dialogue additions	100Claude Opus 4.6 (Reasoning) 100Gemini 3.1 Flash Lite (Preview) 100Qwen3.6 Max Preview	0Nemotron 3 Super 0Gemma 3 12B 0GPT-4o Mini (temp=0)

	Score	Cost	Time
Rocinante 12B	85%	$0.0006	25.4s
Qwen 3.6 35B	86%	$0.0090	58.9s
GPT-5.4 Mini (Reasoning, Low)	87%	$0.016	18.8s
GPT-5.4 Mini	89%	$0.017	19.5s
Qwen 3.6 Flash	86%	$0.0090	36.8s
Grok 4.3	84%	$0.0053	27.4s
Mistral Small 4 (Reasoning)	83%	$0.0022	30.7s
Qwen 3 32B	83%	$0.0012	39.7s
DeepSeek V4 Flash	83%	$0.0005	26.7s
GPT-5.4 Mini (Reasoning)	89%	$0.031	37.4s
GPT-5.4 Nano	80%	$0.0057	20.4s
Mistral Large 3	82%	$0.0026	28.4s
Mistral Small Creative	81%	$0.0006	9.8s
LFM2 24B	78%	$0.0002	33.2s
Mistral NeMO	81%	$0.0003	11.1s
Qwen3 235B A22B Instruct 2507	81%	$0.0011	1.1m
Mistral Small 4	79%	$0.0011	18.0s
GPT-5.4 Nano (Reasoning, Low)	81%	$0.0055	20.4s
Hermes 3 405B	82%	$0.0020	59.2s
DeepSeek V3 (2025-03-24)	80%	$0.0016	33.7s

Median	Evaluator	Top 3	Flop 3
92.3%	"Not X but Y" pattern overuse	100Qwen 3.5 Plus (2026-04-20) 100GPT-4o Mini (temp=0) 100o4 Mini High	2GPT-5 Nano 28Mistral Small 4 33Writer: Palmyra X5
49.1%	Adverb-first sentence starts	100LFM2 24B 95Writer: Palmyra X5 95Claude 3.5 Sonnet	0WizardLM 2 8x22b 0Inception Mercury 2 0Hermes 3 70B
95.8%	Adverbs in dialogue tags	100Gemini 3.1 Flash Lite (Preview) 100Gemma 4 31B 100Grok 4	20GPT-4.1 Nano 31GPT-4.1 Mini 38Claude Haiku 4.5
87.5%	AI-ism adverb frequency	98ByteDance Seed 1.6 Flash 98ByteDance Seed 2.0 Lite 98DeepSeek V4 Flash	60Gemma 3 4B 66Claude 3 Haiku 68GPT-4.1 Nano
100.0%	AI-ism character names	100Qwen 3.5 Plus (2026-04-20) 100Claude Opus 4.6 100Grok 4.20 (Beta, Reasoning)	80Z.AI GLM 5 Turbo 80DeepSeek V4 Pro 80Mistral NeMO
100.0%	AI-ism location names	100Arcee AI: Trinity Mini 100Z.AI GLM 4.5 100GPT-4o, Aug. 6th (temp=1)	96Claude Haiku 4.5
48.8%	AI-ism word frequency	85Claude Opus 4.7 (Reasoning) 85Claude Opus 4.7 84GPT-5.4 Mini (Reasoning)	3Inception Mercury 3GPT-4o Mini (temp=0) 8GPT-4o, Aug. 6th (temp=1)
100.0%	Cliché density	100o4 Mini 100ByteDance Seed 1.6 Flash 100Claude Haiku 4.5	20GPT-4o, May 13th (temp=0) 33Mistral Small 3.2 24B 47Claude 3 Haiku
78.8%	Dialogue tag variety (said vs. fancy)	100Mistral Small 4 (Reasoning) 100GPT-5.4 100Claude Opus 4.5	0Qwen 3.5 Plus (2026-02-15) 3Gemma 4 31B 8Gemini 3 Pro (Preview)
17.2%	Em-dash & semicolon overuse	100Qwen3.6 Max Preview 100GPT-5.4 Mini (Reasoning) 100GPT-4o, May 13th (temp=0)	0Claude Opus 4.6 (Reasoning) 0Claude 3.7 Sonnet 0Claude Opus 4.6
100.0%	Emotion telling (show vs. tell)	100Arcee AI: Trinity Large (Preview) 100Ministral 8B 100Stealth: Hunter Alpha	88Mistral Small 3.2 24B 90GPT-4o Mini (temp=0) 93Hermes 3 70B
93.6%	Filter word density	100GPT-5.4 Mini 100Claude Opus 4.7 (Reasoning) 100DeepSeek-V2 Chat	8Inception Mercury 2 9Nemotron 3 Nano 11Stealth: Aurora Alpha
100.0%	Gibberish response detection	100Grok 4.3 (Reasoning) 100Qwen 3.5 122B 100Claude Opus 4.5	40Llama 3.1 8B 87Rocinante 12B 98Qwen 2.5 72B
100.0%	Markdown formatting overuse	100Inception Mercury 100Qwen 3.5 35B 100Ministral 3 8B	93ByteDance Seed 1.6 Flash
100.0%	Missing dialogue indicators (quotation marks)	100Llama 3.1 Nemotron 70B 100GPT-4o, Aug. 6th (temp=1) 100Gemini 3 Flash (Preview)	0Qwen3.6 Max Preview 20Qwen 3.5 397B A17B 39Gemini 3.1 Flash Lite (Reasoning)
48.0%	Name drop frequency	94GPT-4.1 Nano 93Claude Opus 4.7 (Reasoning) 90Claude Opus 4.7	0GPT-5.5 (Reasoning) 1Z.AI GLM 4.5 3GPT-5.4 Nano
77.3%	Narrator intent-glossing	100ByteDance Seed 2.0 Lite 100Grok 4.1 Fast 100GPT-5.4 Mini (Reasoning, Low)	11GPT-5 Nano 17Nemotron 3 Nano 20GPT-4o, Aug. 6th (temp=1)
100.0%	Overuse of "that" (subordinate clause padding)	100Grok 4.3 100Mistral Small 4 100Qwen3 235B A22B Instruct 2507	64Inception Mercury 74Mistral Small 3.2 24B 74Hermes 3 70B
100.0%	Paragraph length variance	100Gemini 2.5 Pro 100Qwen3 235B A22B Instruct 2507 100GPT-4.1 Nano	53GPT-5 Nano 65ByteDance Seed 2.0 Lite 67Nemotron 3 Nano
89.8%	Passive voice overuse	100Grok 4.1 Fast 100Llama 3.1 8B 99Hermes 3 405B	65ByteDance Seed 2.0 Mini 70Nemotron 3 Super 76Nemotron 3 Nano
80.0%	Past progressive (was/were + -ing) overuse	100GPT-5.5 100GPT-5 Mini 100Gemini 2.5 Flash Lite	0Gemini 3 Flash (Preview) 10Gemma 4 31B (Reasoning) 12Z.AI GLM 4.6
94.6%	Pronoun-first sentence starts	100GPT-5.4 Nano (Reasoning, Low) 100Grok 4 Fast 100Claude Haiku 4.5	35Gemma 4 26B 36Gemini 3.1 Flash Lite 37Inception Mercury
97.6%	Purple prose (modifier overload)	100ByteDance Seed 1.6 Flash 100o4 Mini High 100Mistral Large	84GPT-4.1 Nano 86Gemma 3 4B 88ByteDance Seed 2.0 Mini
100.0%	Repeated phrase echo	100DeepSeek V4 Pro 100Qwen 3 32B 100Ministral 3 3B	—
100.0%	Sentence length variance	100Writer: Palmyra X5 100Claude Sonnet 4.6 100Xiaomi MIMO v2.5 Pro	99Cohere Command R+ (Aug. 2024) 99Hermes 3 70B 100Mistral Small 3.2 24B
56.4%	Sentence opener variety	95Rocinante 12B 91Claude 3.5 Sonnet 91GPT-4o Mini (temp=1)	31Inception Mercury 32Stealth: Aurora Alpha 34GPT-5 Nano
44.5%	Subject-first sentence starts	93GPT-5.4 92Claude Sonnet 4.5 92Grok 4 Fast	0Stealth: Aurora Alpha 1Inception Mercury 2 2Inception Mercury
40.0%	Subordinate conjunction sentence starts	100Gemini 3.1 Flash Lite (Preview) 91Gemini 3.1 Flash Lite 91Qwen 3.6 35B	0GPT-4.1 Mini 0Claude Sonnet 4 0DeepSeek V3 (2025-03-24)
66.0%	Technical jargon density	100Rocinante 12B 100o4 Mini High 100Qwen 2.5 72B	0ByteDance Seed 2.0 Lite 0Nemotron 3 Nano 0GPT-5 Nano
42.6%	Useless dialogue additions	100Gemini 3.1 Flash Lite 100GPT-5.5 100Gemini 3.1 Flash Lite (Preview)	0Ministral 3 3B 0Qwen 3.5 Plus (2026-02-15) 0GPT-4o, Aug. 6th (temp=0)

Rank	Model	Avg. Cost	Avg. Time	Stability	# 1	# 2	# 3	# 4	# 5	Total
120	GPT-5.5 (Reasoning, Low)	$0.151	2.1m	85%	91	91	89	87	86	89%
66	GPT-5.4 (Reasoning)	$0.074	2.1m	82%	93	88	87	87	86	88%
9	GPT-5.4	$0.048	1.4m	86%	90	89	89	87	86	88%
56	Qwen3.6 Max Preview	$0.043	3.2m	86%	89	89	88	88	86	88%
118	GPT-5.5 (Reasoning)	$0.144	1.8m	83%	91	89	88	85	85	88%
34	GPT-5.4 (Reasoning, Low)	$0.055	1.4m	82%	90	89	86	85	82	86%
2	GPT-5.4 Mini	$0.013	15.9s	82%	89	88	85	85	84	86%
119	GPT-5.5	$0.141	1.5m	82%	90	87	87	86	82	86%
12	Qwen 3.5 397B A17B	$0.017	1.7m	81%	89	87	86	86	81	86%
3	GPT-5.4 Mini (Reasoning, Low)	$0.013	15.6s	82%	88	88	86	85	82	86%
1	GPT-5.4 Mini (Reasoning)	$0.015	18.0s	83%	87	86	86	86	82	85%
4	DeepSeek V3 (2025-03-24)	$0.0011	25.4s	80%	90	87	85	82	82	85%
84	Gemini 3.1 Pro (Preview)	$0.076	1.4m	78%	90	86	83	82	82	85%
43	Claude Opus 4.5	$0.056	50.8s	79%	88	87	84	82	80	84%
8	Z.AI GLM 5 Turbo	$0.0066	34.0s	78%	89	83	82	82	80	83%
26	Claude Sonnet 4.6 (Reasoning)	$0.035	50.5s	78%	88	85	84	81	78	83%
18	Qwen 3.6 Flash	$0.0095	43.2s	75%	90	87	82	79	77	83%
74	Claude Opus 4.6 (Reasoning)	$0.075	1.3m	80%	86	83	83	82	82	83%
111	GPT-5	$0.067	2.5m	79%	86	85	82	82	80	83%
6	DeepSeek V4 Flash	$0.0006	28.0s	77%	89	84	83	81	79	83%
67	GPT-5.1	$0.056	1.4m	79%	87	84	83	81	80	83%
5	Mistral Small 4	$0.0012	21.9s	78%	86	84	82	81	80	83%
11	Qwen 3.6 35B	$0.0063	47.2s	78%	87	87	86	78	76	83%
21	Z.AI GLM 5	$0.0072	59.0s	76%	88	87	84	78	76	82%
27	Z.AI GLM 5.1	$0.0092	1.5m	78%	87	83	83	82	78	82%
76	Claude Opus 4.6	$0.072	1.2m	80%	85	84	83	82	79	82%
96	MoonshotAI: Kimi K2.6	$0.030	3.0m	79%	84	84	82	81	79	82%
7	Qwen 3.5 Flash	$0.0015	28.4s	79%	84	83	83	78	78	81%
23	Writer: Palmyra X5	$0.011	22.8s	73%	90	80	80	80	78	81%
60	Claude Opus 4.7	$0.054	27.1s	76%	85	84	81	80	77	81%
39	Claude Sonnet 4.6	$0.029	41.9s	77%	85	82	81	80	77	81%
25	GPT-4.1	$0.016	32.8s	75%	86	83	81	79	75	81%
141	DeepSeek V4 Pro (Reasoning)	$0.036	4.9m	72%	89	84	81	76	75	81%
13	Rocinante 12B	$0.0007	21.4s	74%	87	83	83	79	72	81%
10	DeepSeek V4 Flash (Reasoning)	$0.0005	26.0s	77%	83	82	80	80	78	81%
31	Gemini 2.5 Flash	$0.0055	11.3s	70%	89	84	78	77	76	81%
46	Qwen3 235B A22B Instruct 2507	$0.0011	1.1m	73%	88	83	80	77	75	81%
20	Mistral Large 2	$0.0094	25.1s	75%	87	81	80	79	76	81%
40	Qwen 3 32B	$0.0014	47.4s	72%	87	83	79	76	76	80%
17	Mistral Small Creative	$0.0006	9.8s	73%	86	84	79	78	75	80%
33	GPT-4o, Aug. 6th (temp=1)	$0.017	31.8s	75%	84	82	80	78	76	80%
16	LFM2 24B	$0.0002	27.7s	75%	86	83	83	76	73	80%
15	Grok 4.20 (Beta)	$0.016	15.3s	77%	83	82	80	78	78	80%
14	Grok 4.1 Fast	$0.0013	27.0s	75%	85	80	80	78	78	80%
110	Qwen 3.5 Plus (2026-04-20)	$0.014	1.9m	67%	94	81	80	76	70	80%
22	Mistral Small 4 (Reasoning)	$0.0021	28.9s	74%	85	81	80	79	73	80%
30	Mistral Medium 3.1	$0.0043	34.7s	74%	84	83	79	77	76	80%
91	Grok 4.3 (Reasoning)	$0.017	2.4m	76%	83	81	79	79	76	80%
29	Mistral Large	$0.0089	23.8s	74%	85	80	78	78	77	80%
38	Grok 4.20 (Reasoning)	$0.011	1.0m	77%	82	81	80	79	77	80%
101	DeepSeek V3.2	$0.0011	3.2m	76%	83	80	79	78	78	80%
52	GPT-5.4 Nano (Reasoning, Low)	$0.0054	20.0s	70%	85	83	77	76	73	79%
69	Claude Sonnet 4.5	$0.027	35.1s	73%	85	80	79	77	73	79%
64	Gemini 2.5 Pro	$0.035	37.1s	75%	82	81	80	77	75	79%
53	Qwen 3.5 9B	$0.0006	53.5s	73%	83	83	82	77	69	79%
140	Claude Opus 4	$0.156	1.2m	75%	82	79	78	78	77	79%
61	Grok 4.20	$0.0083	51.2s	72%	84	83	80	76	70	79%
19	Ministral 3 14B	$0.0005	13.1s	74%	82	80	78	77	75	78%
35	Xiaomi MIMO v2.5	$0.0054	32.6s	75%	82	80	80	76	74	78%
47	o4 Mini High	$0.018	29.8s	75%	81	81	78	76	76	78%
42	Qwen 3.5 35B	$0.0078	27.8s	73%	83	78	78	78	74	78%
58	Grok 4.3	$0.0055	31.3s	70%	86	79	77	77	72	78%
50	MiniMax M2.5	$0.0027	33.4s	72%	84	80	78	77	72	78%
36	ByteDance Seed 1.6 Flash	$0.0011	28.6s	73%	83	80	78	76	74	78%
88	Qwen 3.6 27B	$0.017	1.7m	74%	83	80	79	75	74	78%
104	GPT-5.2	$0.054	1.5m	76%	80	79	78	77	77	78%
24	Ministral 8B	$0.0002	11.1s	73%	84	79	79	76	72	78%
44	Gemma 3 12B	$0.0002	36.7s	73%	81	81	80	80	69	78%
55	Ministral 3 8B	$0.0004	8.4s	68%	87	80	76	76	71	78%
102	ByteDance Seed 2.0 Lite	$0.013	2.4m	74%	80	80	78	78	73	78%
48	Mistral Large 3	$0.0022	23.7s	71%	84	79	76	76	74	78%
51	Qwen 3.5 122B	$0.011	27.0s	73%	81	80	77	76	74	78%
70	Claude Sonnet 4	$0.027	41.6s	75%	81	80	78	76	75	78%
57	MoonshotAI: Kimi K2.5	$0.010	1.1m	76%	79	79	79	78	74	78%
98	Hermes 3 70B	$0.0006	32.2s	61%	91	85	74	69	69	78%
32	Stealth: Healer Alpha	$0.0000	23.0s	73%	81	79	77	76	75	78%
37	GPT-5.4 Nano	$0.0056	21.3s	74%	81	78	77	77	76	78%
86	DeepSeek V4 Pro	$0.0018	57.1s	67%	86	81	75	75	71	78%
28	Ministral 3 3B	$0.0003	7.5s	73%	81	80	80	77	69	77%
106	Claude Opus 4.7 (Reasoning)	$0.057	28.0s	70%	82	82	77	75	70	77%
81	Xiaomi MIMO v2.5 Pro	$0.0072	48.6s	68%	83	82	75	74	73	77%
71	Qwen 3.5 27B	$0.0090	46.7s	71%	84	80	80	71	71	77%
73	Gemma 3 27B	$0.0004	47.6s	69%	85	78	75	75	73	77%
49	Mistral NeMO	$0.0003	10.0s	70%	84	83	81	69	68	77%
75	GPT-4o, May 13th (temp=1)	$0.028	15.6s	71%	82	80	76	75	72	77%
68	GPT-5 Mini	$0.0098	41.5s	72%	81	80	77	74	72	77%
45	GPT-5.4 Nano (Reasoning)	$0.0051	18.9s	73%	80	78	78	77	71	77%
85	Claude Haiku 4.5	$0.0096	20.7s	66%	88	80	78	74	65	77%
41	Gemini 2.5 Flash Lite	$0.0007	8.7s	72%	82	78	77	74	72	76%
77	Grok 4.20 (Beta, Reasoning)	$0.026	24.9s	72%	82	78	77	73	72	76%
90	Stealth: Hunter Alpha	$0.0000	1.1m	68%	83	79	75	75	69	76%
80	MiniMax M2.7	$0.0044	56.7s	70%	83	77	76	74	70	76%
94	Hermes 3 405B	$0.0018	1.3m	69%	85	78	77	71	69	76%
65	Grok 4 Fast	$0.0014	22.7s	70%	81	80	78	72	68	76%
62	GPT-4o Mini (temp=1)	$0.0011	28.0s	71%	81	76	76	73	73	76%
100	Aion 2.0	$0.0051	1.3m	69%	81	77	74	74	71	75%
63	Claude 3 Haiku	$0.0021	14.3s	70%	80	78	75	73	71	75%
89	DeepSeek V3.1	$0.0019	1.5m	72%	78	78	77	74	70	75%
54	Ministral 3B	$0.0001	4.2s	71%	78	78	76	76	69	75%
103	DeepSeek V3 (2024-12-26)	$0.0017	45.8s	64%	86	75	73	72	69	75%
121	Cohere Command R+ (Aug. 2024)	$0.021	1.2m	68%	80	79	74	73	69	75%
107	DeepSeek-V2 Chat	$0.0017	39.2s	63%	85	79	71	71	68	75%
128	Z.AI GLM 4.7	$0.0086	2.1m	69%	78	78	73	72	72	75%
131	Grok 4	$0.039	1.4m	70%	77	77	73	73	72	74%
87	o4 Mini	$0.013	21.1s	69%	81	75	74	73	69	74%
99	Gemma 4 31B (Reasoning)	$0.0010	1.6m	72%	76	75	75	75	70	74%
79	Qwen 3.5 Plus (2026-02-15)	$0.0052	30.7s	70%	78	74	73	73	73	74%
105	Claude 3.5 Sonnet	$0.041	35.6s	72%	76	75	74	74	71	74%
72	Gemini 3 Flash (Preview)	$0.0064	17.3s	70%	77	76	73	73	71	74%
115	Z.AI GLM 4.6	$0.0055	41.3s	63%	84	75	71	70	70	74%
78	GPT-4.1 Mini	$0.0026	19.2s	69%	79	78	76	71	67	74%
130	Gemini 3 Pro (Preview)	$0.049	50.2s	69%	79	75	74	73	69	74%
59	Arcee AI: Trinity Mini	$0.0003	9.5s	71%	75	75	74	73	71	74%
114	Gemma 4 31B	$0.0008	1.4m	68%	79	76	74	70	69	74%
146	ByteDance Seed 2.0 Mini	$0.0043	4.8m	67%	80	80	78	66	64	74%
123	Claude 3.7 Sonnet	$0.039	47.5s	70%	77	74	73	72	72	74%
93	Gemini 3.1 Flash Lite	$0.0028	8.9s	64%	84	72	72	71	68	73%
117	Llama 3.1 8B	$0.0004	1.4m	67%	79	76	75	69	66	73%
124	Qwen 2.5 72B	$0.0006	31.0s	61%	82	77	69	69	67	73%
83	Arcee AI: Trinity Large (Preview)	$0.0000	26.3s	69%	75	75	75	74	65	73%
129	Llama 3.1 70B	$0.0007	30.9s	59%	84	77	69	68	64	72%
82	Gemini 3.1 Flash Lite (Reasoning)	$0.0027	8.8s	68%	77	75	73	69	68	72%
122	Z.AI GLM 4.5	$0.0049	44.0s	64%	79	72	70	70	69	72%
92	Gemma 3 4B	$0.0001	17.2s	67%	77	72	71	70	68	72%
97	Gemini 2.5 Flash Lite (Reasoning)	$0.0022	21.1s	66%	78	72	71	70	68	72%
116	Gemini 2.5 Flash (Reasoning)	$0.013	25.6s	66%	78	76	74	67	64	72%
109	Gemini 3 Flash (Preview, Reasoning)	$0.0097	24.1s	67%	74	73	70	69	68	71%
127	Z.AI GLM 4.7 Flash	$0.0015	1.2m	67%	76	71	71	69	67	71%
95	GPT-4.1 Nano	$0.0006	10.6s	67%	75	72	71	68	67	71%
147	Mistral Small 3.2 24B	$0.0039	4.0m	61%	79	73	72	70	57	70%
113	GPT-4o, Aug. 6th (temp=0)	$0.016	19.1s	68%	72	72	72	67	66	70%
112	Llama 3.1 Nemotron 70B	$0.0022	33.4s	67%	73	72	71	69	65	70%
143	Gemma 4 26B (Reasoning)	$0.0010	3.6m	66%	74	70	70	68	66	70%
108	WizardLM 2 8x22b	$0.0016	23.4s	68%	70	69	69	69	66	69%
125	GPT-4o Mini (temp=0)	$0.0011	26.5s	65%	74	71	70	66	63	69%
138	ByteDance Seed 1.6	$0.013	2.5m	66%	71	70	68	67	67	68%
126	Gemini 3.1 Flash Lite (Preview)	$0.0027	8.8s	64%	70	69	66	66	65	67%
132	Z.AI GLM 4.5 Air	$0.0029	57.2s	61%	74	68	67	63	62	67%
136	GPT-5 Nano	$0.0044	1.5m	61%	72	66	66	65	60	66%
133	GPT-4o, May 13th (temp=0)	$0.028	14.0s	61%	71	67	65	63	62	65%
135	Nemotron 3 Super	$0.0000	1.0m	60%	68	68	63	62	61	64%
137	Gemma 4 26B	$0.0007	49.6s	57%	70	68	62	62	59	64%
144	GPT-OSS 120B	$0.0014	1.1m	58%	65	64	61	60	59	62%
134	Inception Mercury 2	$0.0029	7.7s	59%	64	62	61	61	58	61%
142	Nemotron 3 Nano	$0.0007	43.9s	57%	65	63	60	59	56	61%
145	Inception Mercury	$0.012	17.6s	54%	67	62	59	55	55	60%
139	Stealth: Aurora Alpha	$0.0000	9.1s	58%	62	59	59	58	58	59%
76.94%

Median	Evaluator	Top 3	Flop 3
74.6%	"Not X but Y" pattern overuse	100Mistral Large 100Llama 3.1 8B 100Hermes 3 70B	0GPT-5 Nano 0GPT-5 Mini 6Nemotron 3 Nano
85.3%	Adverb-first sentence starts	100GPT-5.4 (Reasoning) 100Claude Haiku 4.5 100GPT-5.5	0Inception Mercury 17Inception Mercury 2 18Qwen 2.5 72B
80.0%	Adverbs in dialogue tags	100Grok 4.3 (Reasoning) 100Qwen 3.5 27B 100Gemini 3.1 Flash Lite	20Grok 4.20 (Beta, Reasoning) 40Grok 4.20 (Reasoning) 40Mistral Large 2
86.2%	AI-ism adverb frequency	97Qwen 3.5 9B 97ByteDance Seed 1.6 97o4 Mini High	63Claude Sonnet 4.6 68GPT-4.1 Nano 74GPT-4.1 Mini
100.0%	AI-ism character names	100GPT-5.5 (Reasoning, Low) 100DeepSeek V4 Flash (Reasoning) 100GPT-5.4 Mini (Reasoning, Low)	68Llama 3.1 8B 96DeepSeek V4 Pro 96Claude Sonnet 4.5
100.0%	AI-ism location names	100Z.AI GLM 4.5 Air 100GPT-4o Mini (temp=0) 100Qwen 2.5 72B	96Qwen 3 32B
20.0%	AI-ism word frequency	73GPT-5 69ByteDance Seed 2.0 Mini 64ByteDance Seed 2.0 Lite	0Gemini 2.5 Flash (Reasoning) 0GPT-4o, Aug. 6th (temp=0) 0GPT-4o, May 13th (temp=1)
100.0%	Cliché density	100Xiaomi MIMO v2.5 Pro 100Qwen 3.5 397B A17B 100Z.AI GLM 4.7 Flash	53Claude 3 Haiku 53Stealth: Aurora Alpha 60GPT-4.1 Nano
41.3%	Dialogue tag variety (said vs. fancy)	100Claude Opus 4.6 (Reasoning) 100Ministral 8B 100Claude Sonnet 4.6 (Reasoning)	0GPT-4o Mini (temp=0) 0Gemini 3 Flash (Preview) 0Z.AI GLM 4.5
6.6%	Em-dash & semicolon overuse	100Qwen3.6 Max Preview 100Gemini 3.1 Pro (Preview) 100GPT-5.4 Mini	0Mistral Small 4 0GPT-4o Mini (temp=1) 0Mistral Large
100.0%	Emotion telling (show vs. tell)	100MoonshotAI: Kimi K2.5 100Xiaomi MIMO v2.5 Pro 100Mistral Large	86Inception Mercury 94Llama 3.1 70B 94GPT-4o Mini (temp=0)
90.2%	Filter word density	100DeepSeek V3 (2025-03-24) 100Qwen 3.6 Flash 100Mistral Medium 3.1	0Stealth: Aurora Alpha 0Llama 3.1 70B 0Nemotron 3 Nano
100.0%	Gibberish response detection	100Z.AI GLM 4.7 Flash 100o4 Mini High 100GPT-5.1	36Llama 3.1 8B 79Qwen 2.5 72B 80ByteDance Seed 2.0 Lite
100.0%	Markdown formatting overuse	100Grok 4 100Qwen3.6 Max Preview 100GPT-5.4	83Mistral Medium 3.1 86ByteDance Seed 1.6 Flash 86Qwen3 235B A22B Instruct 2507
100.0%	Missing dialogue indicators (quotation marks)	100Claude Opus 4.7 (Reasoning) 100Qwen3 235B A22B Instruct 2507 100Claude Opus 4.6 (Reasoning)	77Z.AI GLM 4.6 78Gemma 3 27B 80Gemini 3.1 Pro (Preview)
97.8%	Name drop frequency	100MiniMax M2.5 100GPT-5.4 Mini (Reasoning, Low) 100GPT-5.4	49Claude 3 Haiku 50Claude 3.7 Sonnet 54Qwen 2.5 72B
61.6%	Narrator intent-glossing	100DeepSeek V3 (2025-03-24) 100o4 Mini High 100GPT-5.5	0Stealth: Aurora Alpha 0Inception Mercury 2 0Nemotron 3 Nano
100.0%	Overuse of "that" (subordinate clause padding)	100GPT-5 100Gemini 3.1 Flash Lite (Reasoning) 100Xiaomi MIMO v2.5 Pro	56Inception Mercury 60ByteDance Seed 1.6 65GPT-5 Nano
99.1%	Paragraph length variance	100GPT-5.4 (Reasoning, Low) 100Ministral 8B 100Z.AI GLM 5.1	6Arcee AI: Trinity Mini 17Inception Mercury 22Grok 4.3 (Reasoning)
96.1%	Passive voice overuse	100o4 Mini High 100Grok 4.1 Fast 100Grok 4.3 (Reasoning)	73ByteDance Seed 1.6 75ByteDance Seed 2.0 Mini 81Inception Mercury
76.0%	Past progressive (was/were + -ing) overuse	100GPT-5.2 100Grok 4 Fast 100Grok 4.1 Fast	2Gemma 4 31B 5Gemma 4 31B (Reasoning) 6Gemini 3 Flash (Preview, Reasoning)
80.3%	Pronoun-first sentence starts	100Claude Sonnet 4 100DeepSeek V3 (2025-03-24) 100GPT-5.4	7Mistral Small 3.2 24B 13Gemini 3.1 Flash Lite 22Gemma 4 26B
94.8%	Purple prose (modifier overload)	100Qwen 2.5 72B 100Stealth: Aurora Alpha 100Nemotron 3 Nano	79Gemini 2.5 Flash (Reasoning) 79GPT-5.4 (Reasoning) 79GPT-4.1 Nano
100.0%	Repeated phrase echo	100Z.AI GLM 4.5 Air 100Stealth: Healer Alpha 100Gemini 3.1 Flash Lite	—
100.0%	Sentence length variance	100Z.AI GLM 5 100Gemma 4 26B (Reasoning) 100Ministral 3 8B	70GPT-4o, Aug. 6th (temp=0) 71Inception Mercury 81WizardLM 2 8x22b
40.2%	Sentence opener variety	84Llama 3.1 8B 79Grok 4 78Rocinante 12B	25Mistral Small 3.2 24B 27GPT-5 Nano 30Inception Mercury
57.6%	Subject-first sentence starts	100Qwen3 235B A22B Instruct 2507 100GPT-5.4 100Writer: Palmyra X5	0Inception Mercury 0Inception Mercury 2 0Stealth: Aurora Alpha
49.4%	Subordinate conjunction sentence starts	100GPT-5.4 98Z.AI GLM 4.7 94Writer: Palmyra X5	0Claude Opus 4 0Claude 3 Haiku 0Qwen 2.5 72B
52.7%	Technical jargon density	99Qwen 2.5 72B 97GPT-5.5 (Reasoning, Low) 97GPT-5.5	0GPT-5 Nano 0GPT-OSS 120B 0Nemotron 3 Nano
65.0%	Useless dialogue additions	100GPT-4o, Aug. 6th (temp=1) 100LFM2 24B 100Mistral Large 2	0Stealth: Aurora Alpha 0Inception Mercury 2 0Z.AI GLM 4.5 Air

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100ByteDance Seed 1.6 Flash 100GPT-5.4 (Reasoning, Low) 100DeepSeek-V2 Chat	15GPT-5 Nano 40Nemotron 3 Super 65Gemini 2.5 Flash Lite (Reasoning)
39.3%	Adverb-first sentence starts	98Qwen3 235B A22B Instruct 2507 96Ministral 3 8B 96Ministral 3 14B	0Claude 3.5 Sonnet 0GPT-4o, Aug. 6th (temp=0) 0Arcee AI: Trinity Mini
100.0%	Adverbs in dialogue tags	100Qwen 3.5 9B 100GPT-4.1 100Mistral Small 4	28GPT-4.1 Nano 30Arcee AI: Trinity Large (Preview) 37GPT-4.1 Mini
89.8%	AI-ism adverb frequency	99GPT-5.5 (Reasoning) 99GPT-5.2 99GPT-5	60GPT-4.1 Nano 64GPT-4.1 Mini 66Gemini 2.5 Flash (Reasoning)
100.0%	AI-ism character names	100Rocinante 12B 100GPT-4.1 Nano 100Cohere Command R+ (Aug. 2024)	76Z.AI GLM 5 84MiniMax M2.7 88DeepSeek V4 Flash (Reasoning)
100.0%	AI-ism location names	100GPT-5.4 100GPT-5.4 Nano (Reasoning, Low) 100Claude 3.5 Sonnet	96Claude Opus 4.6 (Reasoning) 96Claude Opus 4 96Claude Sonnet 4
48.9%	AI-ism word frequency	91Claude Opus 4.7 90Claude Opus 4.7 (Reasoning) 89GPT-5.5 (Reasoning, Low)	0Llama 3.1 Nemotron 70B 0GPT-4o, May 13th (temp=0) 0GPT-4o Mini (temp=0)
100.0%	Cliché density	100Qwen 3.6 Flash 100o4 Mini High 100Claude Sonnet 4.6	13GPT-4o Mini (temp=0) 33Claude 3 Haiku 40Mistral Small 3.2 24B
100.0%	Dialogue tag variety (said vs. fancy)	100ByteDance Seed 1.6 Flash 100Qwen 3.5 9B 100Writer: Palmyra X5	21Gemma 3 4B 35GPT-4o, May 13th (temp=1) 37GPT-4o, Aug. 6th (temp=1)
67.3%	Em-dash & semicolon overuse	100Qwen 3.6 27B 100GPT-5.4 Mini 100GPT-5.4 (Reasoning)	0Z.AI GLM 5 0ByteDance Seed 1.6 1MoonshotAI: Kimi K2.5
100.0%	Emotion telling (show vs. tell)	100Gemini 3.1 Pro (Preview) 100LFM2 24B 100Inception Mercury 2	61GPT-4o, May 13th (temp=0) 75Mistral Small 3.2 24B 86Claude 3 Haiku
100.0%	Filter word density	100Mistral Large 2 100GPT-5.5 100GPT-5.4 Nano	35Nemotron 3 Nano 46Llama 3.1 8B 46Gemini 3.1 Flash Lite (Reasoning)
100.0%	Gibberish response detection	100Nemotron 3 Super 100Z.AI GLM 5.1 100Qwen 3.6 27B	57Llama 3.1 8B 80Llama 3.1 70B 99DeepSeek V3 (2025-03-24)
100.0%	Markdown formatting overuse	100Claude Opus 4.7 100DeepSeek-V2 Chat 100Mistral Medium 3.1	80Ministral 3B 94ByteDance Seed 1.6 Flash
100.0%	Missing dialogue indicators (quotation marks)	100Hermes 3 405B 100Ministral 3 3B 100Grok 4.20 (Beta)	80Qwen 3.5 Plus (2026-04-20) 80Qwen 3.5 Flash 83Qwen 3.5 35B
56.7%	Name drop frequency	100Gemma 3 4B 100GPT-4.1 Nano 100Grok 4	0GPT-5.5 (Reasoning, Low) 0GPT-5.4 Nano (Reasoning, Low) 0GPT-5.1
80.9%	Narrator intent-glossing	100Qwen3.6 Max Preview 100Grok 4.3 100Qwen 3.5 397B A17B	7Nemotron 3 Nano 18GPT-5.4 Nano 19Gemini 2.5 Flash Lite
100.0%	Overuse of "that" (subordinate clause padding)	100GPT-5.4 (Reasoning) 100GPT-5 100Qwen 2.5 72B	65ByteDance Seed 2.0 Mini 70ByteDance Seed 1.6 80Mistral Small 3.2 24B
100.0%	Paragraph length variance	100GPT-5.4 Mini 100GPT-4o, May 13th (temp=0) 100Ministral 3 3B	53Nemotron 3 Nano 59Stealth: Aurora Alpha 61Arcee AI: Trinity Mini
99.1%	Passive voice overuse	100GPT-5.1 100Grok 4.3 (Reasoning) 100GPT-5.4 Nano (Reasoning, Low)	90Claude Sonnet 4.6 (Reasoning) 91ByteDance Seed 2.0 Lite 91Inception Mercury
100.0%	Past progressive (was/were + -ing) overuse	100GPT-5.1 100GPT-5.5 (Reasoning, Low) 100Grok 4.3	31Claude Sonnet 4.6 48Claude Opus 4.7 (Reasoning) 51Z.AI GLM 4.6
55.2%	Pronoun-first sentence starts	100Claude Opus 4.5 100Llama 3.1 Nemotron 70B 100GPT-5.1	0DeepSeek V3.1 0Gemma 3 4B 0GPT-4.1 Nano
97.6%	Purple prose (modifier overload)	100Gemini 3 Pro (Preview) 100GPT-5.4 (Reasoning) 100GPT-5	82GPT-4.1 Nano 88Gemini 2.5 Flash Lite 88Gemma 4 26B (Reasoning)
100.0%	Repeated phrase echo	100ByteDance Seed 1.6 Flash 100o4 Mini 100ByteDance Seed 1.6	—
100.0%	Sentence length variance	100Gemini 3 Flash (Preview) 100Qwen 3.5 35B 100Qwen 3.5 27B	86Inception Mercury 89Stealth: Aurora Alpha 93Llama 3.1 70B
49.4%	Sentence opener variety	96Grok 4.1 Fast 87Llama 3.1 8B 86Llama 3.1 Nemotron 70B	30Qwen 3.6 35B 33Qwen 3.5 35B 33Qwen3.6 Max Preview
10.6%	Subject-first sentence starts	87Grok 4.20 (Reasoning) 73Llama 3.1 8B 64Writer: Palmyra X5	0Gemma 4 26B (Reasoning) 0Gemini 2.5 Flash 0Gemini 2.5 Flash (Reasoning)
21.6%	Subordinate conjunction sentence starts	88Grok 4 Fast 83Grok 4.20 (Reasoning) 80Ministral 3 14B	0Gemma 4 31B 0GPT-4o Mini (temp=0) 0Claude Opus 4.7 (Reasoning)
83.3%	Technical jargon density	100GPT-5.4 (Reasoning, Low) 100Qwen 3.5 397B A17B 100Ministral 8B	17ByteDance Seed 1.6 22Claude Opus 4.6 27Nemotron 3 Nano
69.9%	Useless dialogue additions	100Gemini 3.1 Flash Lite (Reasoning) 100Grok 4.20 (Reasoning) 100MoonshotAI: Kimi K2.6	0Inception Mercury 2 0Gemma 3 4B 0GPT-4o Mini (temp=0)

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100GPT-5.4 Mini (Reasoning, Low) 100DeepSeek V3 (2024-12-26) 100GPT-5.2	50DeepSeek V3.2 60DeepSeek V4 Pro (Reasoning) 66Claude Opus 4.6
48.3%	Adverb-first sentence starts	100Claude Opus 4.7 98Writer: Palmyra X5 97Ministral 3 14B	0Nemotron 3 Nano 0Stealth: Aurora Alpha 0Inception Mercury 2
100.0%	Adverbs in dialogue tags	100Gemini 3.1 Flash Lite (Reasoning) 100Gemini 2.5 Pro 100Ministral 3 8B	60Hermes 3 405B 66GPT-4.1 Nano 67DeepSeek V3 (2025-03-24)
94.8%	AI-ism adverb frequency	100Qwen3.6 Max Preview 100GPT-5.5 100MoonshotAI: Kimi K2.6	77GPT-4.1 Nano 78Claude 3 Haiku 82Claude 3.7 Sonnet
100.0%	AI-ism character names	100Claude Opus 4.5 100Claude Opus 4.6 100Mistral Small 4	92Gemma 3 12B 92Llama 3.1 8B 92Gemma 3 4B
100.0%	AI-ism location names	100Stealth: Hunter Alpha 100GPT-4o, May 13th (temp=1) 100GPT-5 Mini	—
42.4%	AI-ism word frequency	86GPT-5.4 (Reasoning) 82GPT-5.5 (Reasoning) 82Claude Sonnet 4.6	0Stealth: Aurora Alpha 0GPT-4o, Aug. 6th (temp=0) 0Gemini 2.5 Flash Lite (Reasoning)
100.0%	Cliché density	100Z.AI GLM 4.6 100Z.AI GLM 5 Turbo 100Claude Sonnet 4	13GPT-4o, May 13th (temp=0) 20Qwen 2.5 72B 33Stealth: Aurora Alpha
80.0%	Dialogue tag variety (said vs. fancy)	100Qwen 3.6 35B 100Mistral Small 4 (Reasoning) 100MiniMax M2.5	0GPT-OSS 120B 0GPT-4o Mini (temp=1) 0GPT-4o, May 13th (temp=1)
29.7%	Em-dash & semicolon overuse	100GPT-5.4 Mini (Reasoning, Low) 100GPT-5.4 100GPT-5.4 (Reasoning)	0Claude 3.7 Sonnet 0ByteDance Seed 1.6 Flash 0Claude Sonnet 4.6 (Reasoning)
100.0%	Emotion telling (show vs. tell)	100Mistral Medium 3.1 100GPT-5.4 Mini (Reasoning) 100Z.AI GLM 5.1	87Llama 3.1 70B 90GPT-4.1 Mini 92Hermes 3 70B
96.0%	Filter word density	100Qwen 3.6 27B 100Mistral Medium 3.1 100Claude Opus 4.6	21Claude Sonnet 4 27Stealth: Aurora Alpha 42GPT-OSS 120B
100.0%	Gibberish response detection	100DeepSeek V3.2 100Grok 4 100Gemini 3.1 Pro (Preview)	40Llama 3.1 8B 99ByteDance Seed 1.6 Flash 99MiniMax M2.5
100.0%	Markdown formatting overuse	100Hermes 3 70B 100DeepSeek V3 (2025-03-24) 100Gemini 2.5 Flash (Reasoning)	60Ministral 3B 80Ministral 3 8B 82Ministral 3 14B
100.0%	Missing dialogue indicators (quotation marks)	100GPT-5.4 Mini (Reasoning, Low) 100GPT-5.2 100Gemma 3 12B	80Qwen 3.6 Flash 81Grok 4 Fast 88Rocinante 12B
74.3%	Name drop frequency	100Gemini 3.1 Flash Lite (Reasoning) 100Gemini 3.1 Flash Lite 100Gemini 2.5 Pro	9Claude 3 Haiku 20LFM2 24B 24GPT-5.4 Nano
85.8%	Narrator intent-glossing	100Qwen 3.6 27B 100Grok 4.1 Fast 100GPT-4.1 Mini	16GPT-5 Nano 22Arcee AI: Trinity Large (Preview) 27Claude Sonnet 4
100.0%	Overuse of "that" (subordinate clause padding)	100Qwen 3.5 9B 100Gemini 2.5 Flash Lite (Reasoning) 100Gemini 3 Pro (Preview)	76ByteDance Seed 2.0 Lite 79Llama 3.1 70B 80Gemini 2.5 Flash
100.0%	Paragraph length variance	100GPT-5.5 100Z.AI GLM 5 100Ministral 3 14B	44Inception Mercury 47Mistral Small 3.2 24B 52Hermes 3 405B
98.3%	Passive voice overuse	100GPT-5.4 Mini (Reasoning, Low) 100Qwen 3.5 Plus (2026-02-15) 100GPT-5.4 Mini	90Claude Sonnet 4.6 90ByteDance Seed 1.6 92ByteDance Seed 2.0 Mini
97.7%	Past progressive (was/were + -ing) overuse	100GPT-5.4 Mini 100Grok 4.20 100GPT-5.4 Mini (Reasoning)	51Stealth: Healer Alpha 52Gemini 3 Flash (Preview) 53Gemini 3 Flash (Preview, Reasoning)
88.6%	Pronoun-first sentence starts	100Claude Haiku 4.5 100GPT-5.5 (Reasoning) 100Claude 3.7 Sonnet	5Inception Mercury 33Mistral NeMO 39Qwen 3.5 35B
98.5%	Purple prose (modifier overload)	100DeepSeek V3 (2025-03-24) 100Claude Opus 4.6 100Ministral 3B	92Gemini 3.1 Pro (Preview) 92Inception Mercury 93ByteDance Seed 2.0 Lite
100.0%	Repeated phrase echo	100DeepSeek V4 Pro (Reasoning) 100DeepSeek V3 (2025-03-24) 100Grok 4.20 (Beta)	—
100.0%	Sentence length variance	100Mistral Small 4 (Reasoning) 100Grok 4.3 100Claude Opus 4.7 (Reasoning)	76Inception Mercury 91Stealth: Aurora Alpha 93Llama 3.1 8B
52.1%	Sentence opener variety	98Grok 4.1 Fast 88GPT-4o, Aug. 6th (temp=1) 87Llama 3.1 Nemotron 70B	26Inception Mercury 32Mistral Small 3.2 24B 33Gemma 4 26B (Reasoning)
31.0%	Subject-first sentence starts	89Qwen3 235B A22B Instruct 2507 88Writer: Palmyra X5 88Z.AI GLM 5	0Stealth: Aurora Alpha 0Inception Mercury 2 0GPT-OSS 120B
28.5%	Subordinate conjunction sentence starts	94Rocinante 12B 88Z.AI GLM 4.7 Flash 71Grok 4.20	0Ministral 8B 0Ministral 3 8B 0DeepSeek V4 Flash
83.7%	Technical jargon density	100GPT-5.4 Mini (Reasoning) 100o4 Mini High 100GPT-4.1	6GPT-5 Nano 15ByteDance Seed 1.6 18ByteDance Seed 2.0 Lite
64.8%	Useless dialogue additions	100Qwen 3.6 27B 100GPT-5.4 Mini (Reasoning) 100GPT-5.4 (Reasoning, Low)	0Ministral 3 8B 0Qwen 2.5 72B 0Mistral NeMO

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Gemini 3 Pro (Preview) 100ByteDance Seed 1.6 100Qwen 3.5 397B A17B	21GPT-5 Nano 34Nemotron 3 Super 71Gemma 3 4B
43.5%	Adverb-first sentence starts	97GPT-5.4 Mini 94GPT-5.4 (Reasoning, Low) 94Grok 4.20	0Inception Mercury 0Stealth: Aurora Alpha 0Inception Mercury 2
100.0%	Adverbs in dialogue tags	100Nemotron 3 Super 100Ministral 3 14B 100DeepSeek V3 (2025-03-24)	34Llama 3.1 70B 38Llama 3.1 8B 41Hermes 3 70B
90.3%	AI-ism adverb frequency	100Gemini 3.1 Pro (Preview) 100MoonshotAI: Kimi K2.6 99GPT-5.5	55Claude 3 Haiku 58Mistral Small 3.2 24B 66GPT-4.1 Nano
100.0%	AI-ism character names	100o4 Mini 100DeepSeek V4 Flash (Reasoning) 100Claude Haiku 4.5	88Claude Sonnet 4 92Claude 3.7 Sonnet 92GPT-5.5 (Reasoning, Low)
100.0%	AI-ism location names	100Qwen 3.5 27B 100GPT-5.2 100DeepSeek V4 Flash	—
50.3%	AI-ism word frequency	89ByteDance Seed 2.0 Mini 87Claude Opus 4.7 (Reasoning) 85ByteDance Seed 2.0 Lite	0Llama 3.1 Nemotron 70B 0GPT-4o Mini (temp=0) 0GPT-4o, Aug. 6th (temp=0)
93.3%	Cliché density	100Claude Sonnet 4.5 100Claude Opus 4.6 100Grok 4.20 (Beta, Reasoning)	33GPT-4o, Aug. 6th (temp=0) 40Gemini 2.5 Flash Lite 47Qwen 2.5 72B
97.5%	Dialogue tag variety (said vs. fancy)	100GPT-5.4 (Reasoning) 100ByteDance Seed 1.6 100Qwen 3.5 397B A17B	13Hermes 3 70B 24Claude 3 Haiku 25GPT-4o, Aug. 6th (temp=1)
56.6%	Em-dash & semicolon overuse	100Qwen 3.5 27B 100GPT-5.4 (Reasoning, Low) 100Gemini 3.1 Pro (Preview)	0GPT-5 Nano 0Mistral Small 4 0GPT-4o, Aug. 6th (temp=1)
100.0%	Emotion telling (show vs. tell)	100MiniMax M2.7 100o4 Mini 100Claude Opus 4.6	85GPT-4o, May 13th (temp=0) 89Mistral Small 3.2 24B 90Mistral NeMO
98.8%	Filter word density	100Grok 4 100Grok 4.20 (Beta) 100GPT-5	42Llama 3.1 Nemotron 70B 51Nemotron 3 Nano 61Llama 3.1 70B
100.0%	Gibberish response detection	100GPT-5.2 100Mistral Small 4 (Reasoning) 100Inception Mercury	72Llama 3.1 8B 81Llama 3.1 70B 99Hermes 3 70B
100.0%	Markdown formatting overuse	100Mistral Medium 3.1 100Mistral Large 2 100Arcee AI: Trinity Mini	80Qwen 3.5 9B 80Ministral 3B 100ByteDance Seed 1.6 Flash
100.0%	Missing dialogue indicators (quotation marks)	100Cohere Command R+ (Aug. 2024) 100DeepSeek V4 Flash 100MoonshotAI: Kimi K2.5	71Qwen 3.5 Flash 80Qwen 3.5 35B 80Qwen 3.6 Flash
83.0%	Name drop frequency	100Gemini 2.5 Pro 100Claude Opus 4.7 100MiniMax M2.7	0GPT-5.4 Nano (Reasoning, Low) 7GPT-5.4 Nano (Reasoning) 8Ministral 3B
85.1%	Narrator intent-glossing	100Ministral 8B 100Z.AI GLM 4.5 100Qwen 3.6 35B	2GPT-5 Nano 11Nemotron 3 Nano 13Nemotron 3 Super
100.0%	Overuse of "that" (subordinate clause padding)	100Qwen 3.5 35B 100DeepSeek V3 (2024-12-26) 100Gemini 3 Pro (Preview)	40ByteDance Seed 2.0 Mini 76Ministral 3 8B 77ByteDance Seed 1.6
100.0%	Paragraph length variance	100GPT-5.5 (Reasoning) 100Mistral Large 3 100DeepSeek V4 Pro (Reasoning)	38Inception Mercury 39Arcee AI: Trinity Mini 48Grok 4.3 (Reasoning)
99.6%	Passive voice overuse	100GPT-5.1 100GPT-5.4 Nano (Reasoning) 100GPT-5.4 Nano (Reasoning, Low)	89ByteDance Seed 2.0 Mini 91ByteDance Seed 1.6 92Claude 3 Haiku
100.0%	Past progressive (was/were + -ing) overuse	100Gemma 3 4B 100MoonshotAI: Kimi K2.6 100Mistral Large 2	49Claude Sonnet 4.6 57Claude Opus 4.7 59Claude Sonnet 4.6 (Reasoning)
35.2%	Pronoun-first sentence starts	100Ministral 3B 99Llama 3.1 8B 97Llama 3.1 Nemotron 70B	0GPT-4o Mini (temp=0) 0GPT-5 1Gemini 3.1 Flash Lite
97.6%	Purple prose (modifier overload)	100Qwen 3.5 Flash 100Inception Mercury 100Qwen 3.6 27B	81GPT-4.1 Nano 83Gemma 3 4B 84Llama 3.1 8B
100.0%	Repeated phrase echo	100GPT-4o, May 13th (temp=1) 100Qwen 3.5 Flash 100GPT-5.4 (Reasoning)	—
100.0%	Sentence length variance	100Ministral 3 3B 100MoonshotAI: Kimi K2.5 100Mistral Small 4 (Reasoning)	83Inception Mercury 87Stealth: Aurora Alpha 90Inception Mercury 2
49.8%	Sentence opener variety	88Llama 3.1 Nemotron 70B 86GPT-4o, Aug. 6th (temp=1) 84Claude 3.5 Sonnet	27Inception Mercury 31Qwen 3.5 9B 33ByteDance Seed 2.0 Lite
15.6%	Subject-first sentence starts	73Rocinante 12B 66Mistral Small Creative 57Mistral Small 4 (Reasoning)	0Qwen 3.5 Plus (2026-02-15) 0GPT-5.4 Nano (Reasoning, Low) 0Qwen 3.5 27B
20.0%	Subordinate conjunction sentence starts	73Qwen 3.5 122B 71Claude Sonnet 4 70Gemini 3.1 Flash Lite (Reasoning)	0Gemini 3 Flash (Preview) 0Grok 4.3 0GPT-5.4 Nano (Reasoning)
83.7%	Technical jargon density	100Qwen 3.6 35B 100Qwen 3.5 Plus (2026-04-20) 100Qwen 3.6 Flash	8GPT-5 Nano 21Claude Sonnet 4.6 27Nemotron 3 Nano
61.2%	Useless dialogue additions	100Qwen3.6 Max Preview 100ByteDance Seed 2.0 Lite 100Qwen 3.5 35B	0Qwen 2.5 72B 0GPT-4o, May 13th (temp=0) 0Gemma 3 4B

Median	Evaluator	Top 3	Flop 3
80.0%	"Not X but Y" pattern overuse	100Gemini 3.1 Flash Lite 100Cohere Command R+ (Aug. 2024) 100Qwen 3.6 Flash	0Stealth: Hunter Alpha 0GPT-5 Nano 6Gemini 2.5 Flash Lite (Reasoning)
42.2%	Adverb-first sentence starts	98GPT-5.4 97GPT-5.4 Mini 94Writer: Palmyra X5	0GPT-OSS 120B 0Arcee AI: Trinity Mini 0Inception Mercury 2
100.0%	Adverbs in dialogue tags	100Gemini 2.5 Flash (Reasoning) 100Qwen 3.5 35B 100Ministral 3 8B	25Hermes 3 70B 38GPT-5 Nano 43Arcee AI: Trinity Large (Preview)
90.5%	AI-ism adverb frequency	100Grok 4.1 Fast 99GPT-5 99GPT-5.1	51Mistral Small 3.2 24B 60GPT-4.1 Nano 65Hermes 3 70B
100.0%	AI-ism character names	100Inception Mercury 100GPT-4o Mini (temp=1) 100Claude Sonnet 4.6 (Reasoning)	88Llama 3.1 8B 92GPT-5.4 92DeepSeek V4 Flash
100.0%	AI-ism location names	100Gemini 2.5 Flash Lite (Reasoning) 100Qwen 3.6 35B 100o4 Mini	96o4 Mini High
22.3%	AI-ism word frequency	74GPT-5.5 71GPT-5.4 (Reasoning) 69Claude Opus 4.7	0GPT-4o Mini (temp=1) 0Llama 3.1 70B 0Claude 3 Haiku
100.0%	Cliché density	100Gemma 4 26B 100DeepSeek V3.1 100MiniMax M2.7	47Mistral Small 3.2 24B 60Nemotron 3 Nano 67Mistral NeMO
52.7%	Dialogue tag variety (said vs. fancy)	100Z.AI GLM 5.1 100ByteDance Seed 1.6 100Qwen 3.5 397B A17B	0Gemini 2.5 Flash Lite (Reasoning) 0Hermes 3 70B 0Grok 4.20 (Beta)
24.0%	Em-dash & semicolon overuse	100Qwen 3.5 397B A17B 100Grok 4.3 (Reasoning) 100Qwen 3.5 Plus (2026-04-20)	0Qwen3 235B A22B Instruct 2507 0Claude Opus 4 0Claude Sonnet 4.5
100.0%	Emotion telling (show vs. tell)	100Grok 4.20 (Beta, Reasoning) 100Stealth: Hunter Alpha 100Claude Opus 4.6	36GPT-4o, Aug. 6th (temp=0) 47GPT-4o, May 13th (temp=0) 55Llama 3.1 70B
94.7%	Filter word density	100DeepSeek V3 (2025-03-24) 100GPT-4.1 100Ministral 3B	1Inception Mercury 4Llama 3.1 8B 8Nemotron 3 Nano
100.0%	Gibberish response detection	100Ministral 8B 100Claude 3 Haiku 100GPT-5.4 Mini	23Llama 3.1 8B 80Rocinante 12B 91DeepSeek V3 (2025-03-24)
100.0%	Markdown formatting overuse	100GPT-5.4 Mini (Reasoning) 100GPT-4.1 100Qwen 3.5 27B	33Ministral 8B 60Mistral Medium 3.1 71Ministral 3 8B
100.0%	Missing dialogue indicators (quotation marks)	100DeepSeek V3 (2024-12-26) 100Nemotron 3 Super 100GPT-4.1 Nano	60Qwen 3.6 35B 77Qwen 3.5 35B 80Qwen 3.5 27B
61.1%	Name drop frequency	96Claude Sonnet 4.6 96Grok 4.20 (Beta, Reasoning) 94Nemotron 3 Nano	3GPT-5.2 6Qwen 3.5 27B 15GPT-5.4 Nano (Reasoning)
52.1%	Narrator intent-glossing	100MoonshotAI: Kimi K2.6 100Qwen3.6 Max Preview 100Rocinante 12B	0Z.AI GLM 4.5 0GPT-OSS 120B 1Gemini 2.5 Flash Lite (Reasoning)
100.0%	Overuse of "that" (subordinate clause padding)	100Gemini 3.1 Flash Lite (Preview) 100Mistral Small 4 100LFM2 24B	73GPT-4o, May 13th (temp=0) 77Rocinante 12B 78ByteDance Seed 2.0 Lite
100.0%	Paragraph length variance	100GPT-5.1 100Qwen 3.6 27B 100GPT-5.4 Mini (Reasoning)	41Arcee AI: Trinity Mini 51Mistral Small 3.2 24B 52Inception Mercury 2
99.0%	Passive voice overuse	100LFM2 24B 100Gemini 3.1 Pro (Preview) 100Qwen 3.5 27B	85DeepSeek V3.1 91Gemini 2.5 Pro 92ByteDance Seed 2.0 Lite
100.0%	Past progressive (was/were + -ing) overuse	100Qwen 3 32B 100Claude Sonnet 4.5 100GPT-5.5 (Reasoning, Low)	50Z.AI GLM 4.7 Flash 69Claude Haiku 4.5 76Ministral 3 14B
100.0%	Pronoun-first sentence starts	100GPT-5.4 Mini (Reasoning) 100Grok 4.20 100Claude Sonnet 4.6 (Reasoning)	47Mistral NeMO 54Mistral Small 3.2 24B 74Qwen 3.5 9B
97.0%	Purple prose (modifier overload)	100Nemotron 3 Nano 100ByteDance Seed 1.6 Flash 100Stealth: Aurora Alpha	81Gemini 3.1 Pro (Preview) 86Xiaomi MIMO v2.5 Pro 86Gemini 2.5 Flash (Reasoning)
100.0%	Repeated phrase echo	100Hermes 3 405B 100GPT-5.5 (Reasoning, Low) 100Z.AI GLM 5	—
100.0%	Sentence length variance	100Ministral 8B 100WizardLM 2 8x22b 100Gemini 2.5 Flash	82Inception Mercury 83Nemotron 3 Nano 86GPT-4o, Aug. 6th (temp=1)
54.3%	Sentence opener variety	91GPT-4o Mini (temp=1) 90Llama 3.1 8B 87Grok 4.1 Fast	30GPT-5 Nano 34Nemotron 3 Nano 37Qwen 3.5 35B
27.3%	Subject-first sentence starts	83Hermes 3 70B 77Qwen3 235B A22B Instruct 2507 77Rocinante 12B	0Inception Mercury 0Arcee AI: Trinity Mini 0ByteDance Seed 1.6
20.8%	Subordinate conjunction sentence starts	74Gemini 3.1 Flash Lite 71Gemma 3 4B 71Z.AI GLM 4.5 Air	0Z.AI GLM 5 Turbo 0Mistral Small 3.2 24B 0Inception Mercury
46.3%	Technical jargon density	100Gemini 3.1 Pro (Preview) 100GPT-5.5 (Reasoning) 100Qwen 3.5 122B	0Stealth: Aurora Alpha 0Inception Mercury 2 0GPT-OSS 120B
49.6%	Useless dialogue additions	100DeepSeek V3 (2025-03-24) 100Gemma 4 26B (Reasoning) 100GPT-5.5 (Reasoning, Low)	0Stealth: Aurora Alpha 0Gemma 3 27B 0DeepSeek-V2 Chat

Median	Evaluator	Top 3	Flop 3
100.0%	"Not X but Y" pattern overuse	100Z.AI GLM 4.5 Air 100Qwen3.6 Max Preview 100Claude Sonnet 4	1GPT-5 Nano 31Grok 4 Fast 51Aion 2.0
39.3%	Adverb-first sentence starts	100Qwen3 235B A22B Instruct 2507 98Mistral Small 4 (Reasoning) 98Mistral Medium 3.1	0Inception Mercury 0Z.AI GLM 4.7 0Nemotron 3 Nano
100.0%	Adverbs in dialogue tags	100Qwen3.6 Max Preview 100Gemini 3.1 Flash Lite 100GPT-5.4 (Reasoning, Low)	40Hermes 3 70B 59Claude Haiku 4.5 60Ministral 3B
91.4%	AI-ism adverb frequency	100Grok 4.1 Fast 100ByteDance Seed 1.6 Flash 100MoonshotAI: Kimi K2.5	71GPT-4.1 Nano 76Claude Haiku 4.5 76Gemma 3 12B
96.0%	AI-ism character names	100Ministral 3 8B 100Z.AI GLM 4.7 100DeepSeek V3 (2025-03-24)	68Claude Opus 4 72Claude Opus 4.5 76Claude Sonnet 4.5
100.0%	AI-ism location names	100GPT-5.2 100DeepSeek V3.1 100Gemini 2.5 Pro	—
47.8%	AI-ism word frequency	91Claude Opus 4.7 (Reasoning) 89Claude Opus 4.7 87Claude Sonnet 4.6	0GPT-4o, Aug. 6th (temp=1) 1GPT-4o Mini (temp=0) 4Inception Mercury
100.0%	Cliché density	100GPT-5.1 100Gemma 3 12B 100DeepSeek V3.1	40GPT-4o, May 13th (temp=0) 47Inception Mercury 60DeepSeek-V2 Chat
90.0%	Dialogue tag variety (said vs. fancy)	100ByteDance Seed 2.0 Lite 100DeepSeek V4 Pro (Reasoning) 100Grok 4.20 (Reasoning)	1GPT-4o, Aug. 6th (temp=1) 9GPT-4o, May 13th (temp=1) 15Gemma 3 4B
54.5%	Em-dash & semicolon overuse	100Qwen 3.6 27B 100Qwen 3.5 27B 100GPT-5.4 (Reasoning, Low)	0Claude Sonnet 4 0Qwen3 235B A22B Instruct 2507 0DeepSeek V3 (2025-03-24)
100.0%	Emotion telling (show vs. tell)	100Gemma 3 27B 100Gemma 4 31B (Reasoning) 100Hermes 3 405B	80Mistral Small 3.2 24B 87Hermes 3 70B 94Claude 3 Haiku
96.7%	Filter word density	100Mistral Small Creative 100GPT-5.5 100GPT-5.4 Mini (Reasoning)	20Nemotron 3 Nano 30Stealth: Aurora Alpha 42Claude 3.5 Sonnet
100.0%	Gibberish response detection	100Qwen 3.5 Flash 100Aion 2.0 100Qwen 3.5 122B	60Llama 3.1 70B 65Llama 3.1 8B 80DeepSeek V3 (2025-03-24)
100.0%	Markdown formatting overuse	100GPT-4o, Aug. 6th (temp=0) 100Nemotron 3 Super 100GPT-5.4 (Reasoning, Low)	80Ministral 3 3B 80Rocinante 12B
100.0%	Missing dialogue indicators (quotation marks)	100Qwen 3.5 9B 100Writer: Palmyra X5 100LFM2 24B	40Qwen 3.6 35B 60Qwen 3.5 122B 80Qwen 3.6 Flash
47.4%	Name drop frequency	100Grok 4.20 (Reasoning) 99Qwen3.6 Max Preview 98Grok 4.20 (Beta, Reasoning)	0GPT-5.4 Nano (Reasoning, Low) 0GPT-5.4 Nano (Reasoning) 0Qwen 3.5 122B
86.7%	Narrator intent-glossing	100o4 Mini High 100GPT-5.4 (Reasoning, Low) 100Qwen 3.6 27B	10GPT-5 Nano 22Nemotron 3 Nano 36Claude Opus 4
100.0%	Overuse of "that" (subordinate clause padding)	100GPT-5.2 100Qwen 3 32B 100Stealth: Healer Alpha	73ByteDance Seed 2.0 Lite 81Nemotron 3 Super 91Claude Haiku 4.5
100.0%	Paragraph length variance	100GPT-4.1 100Qwen 3.5 Plus (2026-02-15) 100Mistral Small Creative	62Mistral Small 3.2 24B 70GPT-5 Nano 77Nemotron 3 Nano
95.4%	Passive voice overuse	100Qwen 3.6 27B 100Gemini 3.1 Pro (Preview) 100LFM2 24B	81Aion 2.0 84Claude Sonnet 4.6 (Reasoning) 84Inception Mercury
100.0%	Past progressive (was/were + -ing) overuse	100GPT-5 Mini 100Gemini 2.5 Flash 100o4 Mini High	0Z.AI GLM 4.7 Flash 41Z.AI GLM 4.6 47DeepSeek V3.2
94.0%	Pronoun-first sentence starts	100Llama 3.1 8B 100GPT-5.4 (Reasoning) 100GPT-5.5	28Mistral Small 3.2 24B 35Gemini 3.1 Flash Lite 37Mistral NeMO
98.6%	Purple prose (modifier overload)	100Qwen 3.5 122B 100Qwen 3.5 Flash 100MiniMax M2.5	87Mistral Small 3.2 24B 90Gemma 3 12B 91Gemma 3 4B
100.0%	Repeated phrase echo	100Qwen 3.5 9B 100Writer: Palmyra X5 100Gemini 2.5 Flash Lite (Reasoning)	—
100.0%	Sentence length variance	100Gemma 3 4B 100Gemini 3.1 Pro (Preview) 100Claude Opus 4.7 (Reasoning)	80Mistral Small 3.2 24B 88Nemotron 3 Nano 95Stealth: Aurora Alpha
59.5%	Sentence opener variety	93Claude 3.5 Sonnet 89Grok 4.1 Fast 88Claude Opus 4	34GPT-5 Nano 35Nemotron 3 Nano 36Qwen 3.5 35B
29.7%	Subject-first sentence starts	88Qwen3 235B A22B Instruct 2507 88Writer: Palmyra X5 83Rocinante 12B	0Nemotron 3 Nano 0Inception Mercury 2 0Qwen 3.5 9B
20.0%	Subordinate conjunction sentence starts	94ByteDance Seed 2.0 Lite 77ByteDance Seed 1.6 66Writer: Palmyra X5	0Qwen 3.5 9B 0Ministral 8B 0GPT-4.1
84.6%	Technical jargon density	100Qwen 3.5 122B 100Gemma 3 27B 100Mistral Large 3	0GPT-5 Nano 13Nemotron 3 Nano 25Claude Opus 4.5
63.5%	Useless dialogue additions	100Claude Opus 4.7 (Reasoning) 100GPT-5.4 100ByteDance Seed 2.0 Lite	0Arcee AI: Trinity Mini 0GPT-OSS 120B 0Mistral Small 3.2 24B

Median	Evaluator	Top 3	Flop 3
84.6%	"Not X but Y" pattern overuse	100Qwen 2.5 72B 100Z.AI GLM 5 100Arcee AI: Trinity Large (Preview)	0GPT-5 Nano 0Gemini 2.5 Flash Lite (Reasoning) 20Nemotron 3 Nano
79.9%	Adverb-first sentence starts	100Cohere Command R+ (Aug. 2024) 100Qwen3 235B A22B Instruct 2507 100Mistral Medium 3.1	0ByteDance Seed 1.6 0GPT-OSS 120B 3Inception Mercury
100.0%	Adverbs in dialogue tags	100o4 Mini 100Qwen3 235B A22B Instruct 2507 100GPT-5.5 (Reasoning, Low)	20GPT-5 Nano 40Mistral Large 3 51Claude Opus 4.7 (Reasoning)
91.2%	AI-ism adverb frequency	100MoonshotAI: Kimi K2.6 100Grok 4.1 Fast 99ByteDance Seed 2.0 Lite	70Claude 3 Haiku 74Claude Sonnet 4.6 75Cohere Command R+ (Aug. 2024)
100.0%	AI-ism character names	100Qwen 3.5 9B 100GPT-5.4 Mini 100Gemma 4 26B (Reasoning)	76Llama 3.1 8B 96Stealth: Hunter Alpha 96ByteDance Seed 2.0 Lite
100.0%	AI-ism location names	100DeepSeek V3.1 100Mistral Large 100GPT-5.4 Nano (Reasoning, Low)	—
20.3%	AI-ism word frequency	70GPT-5.5 (Reasoning, Low) 70ByteDance Seed 2.0 Lite 70GPT-5.5	0Inception Mercury 2 0Stealth: Aurora Alpha 0GPT-4o, May 13th (temp=1)
100.0%	Cliché density	100ByteDance Seed 2.0 Lite 100GPT-4o, Aug. 6th (temp=1) 100Qwen 3.6 27B	33Inception Mercury 2 40GPT-4o, May 13th (temp=0) 53GPT-4o, Aug. 6th (temp=0)
54.6%	Dialogue tag variety (said vs. fancy)	100Gemma 4 31B (Reasoning) 100MoonshotAI: Kimi K2.6 100Grok 4.20 (Beta, Reasoning)	0Z.AI GLM 4.5 Air 0Gemma 3 4B 0GPT-4o, May 13th (temp=0)
38.1%	Em-dash & semicolon overuse	100GPT-5 100Qwen 3.6 35B 100Qwen 3.6 Flash	0Claude Opus 4.7 0GPT-4.1 Nano 0ByteDance Seed 1.6 Flash
100.0%	Emotion telling (show vs. tell)	100GPT-5 Mini 100Writer: Palmyra X5 100Arcee AI: Trinity Mini	90GPT-4o, May 13th (temp=0) 92Hermes 3 70B 93Claude 3 Haiku
98.5%	Filter word density	100o4 Mini High 100Mistral Large 100GPT-5.1	0Nemotron 3 Nano 0Llama 3.1 70B 2Inception Mercury 2
100.0%	Gibberish response detection	100Gemini 2.5 Flash Lite (Reasoning) 100Xiaomi MIMO v2.5 Pro 100Qwen 3.6 35B	0Llama 3.1 8B 80Qwen 3.5 9B 80Qwen 3.5 Plus (2026-04-20)
100.0%	Markdown formatting overuse	100Claude 3 Haiku 100GPT-4.1 Nano 100GPT-5.4 Mini (Reasoning, Low)	63Ministral 3 3B 69ByteDance Seed 1.6 Flash 72Mistral Large 2
100.0%	Missing dialogue indicators (quotation marks)	100Gemma 3 4B 100Llama 3.1 Nemotron 70B 100Z.AI GLM 5.1	60Qwen 3.5 Flash 77Claude 3 Haiku 80Qwen 3.6 35B
93.8%	Name drop frequency	100Gemini 3.1 Flash Lite (Preview) 100GPT-5 100Gemini 2.5 Pro	45Claude 3 Haiku 47Mistral Small 4 49Arcee AI: Trinity Large (Preview)
72.0%	Narrator intent-glossing	100Grok 4.20 (Reasoning) 100GPT-5.4 (Reasoning) 100Writer: Palmyra X5	0Gemini 2.5 Flash Lite (Reasoning) 0Nemotron 3 Nano 0GPT-5 Nano
100.0%	Overuse of "that" (subordinate clause padding)	100Mistral Small 4 100Qwen3.6 Max Preview 100Qwen3 235B A22B Instruct 2507	78GPT-5 Nano 79Nemotron 3 Nano 79Llama 3.1 70B
100.0%	Paragraph length variance	100Z.AI GLM 4.7 100Mistral Large 100Claude Sonnet 4.6	16Inception Mercury 17Mistral Small 3.2 24B 20Grok 4.3
98.6%	Passive voice overuse	100Grok 4.3 (Reasoning) 100GPT-5.4 Nano (Reasoning) 100GPT-5.4 (Reasoning)	86Llama 3.1 70B 89DeepSeek V3.1 90Mistral Small 3.2 24B
97.1%	Past progressive (was/were + -ing) overuse	100GPT-5 100Claude 3.7 Sonnet 100Grok 4.3	26Claude Opus 4.7 (Reasoning) 32Z.AI GLM 4.6 33Z.AI GLM 4.7
87.8%	Pronoun-first sentence starts	100Claude Sonnet 4.5 100GPT-5.5 (Reasoning, Low) 100Qwen 3.5 Plus (2026-04-20)	1ByteDance Seed 1.6 17Mistral Small 3.2 24B 17ByteDance Seed 2.0 Mini
96.4%	Purple prose (modifier overload)	100Claude 3 Haiku 100Qwen 3.6 27B 100Inception Mercury	80ByteDance Seed 2.0 Mini 85Gemma 3 27B 85Gemini 2.5 Flash (Reasoning)
100.0%	Repeated phrase echo	100Grok 4.20 100DeepSeek-V2 Chat 100Grok 4.20 (Beta, Reasoning)	—
100.0%	Sentence length variance	100Mistral Medium 3.1 100Gemini 3.1 Pro (Preview) 100Claude Opus 4.6	66GPT-4o, Aug. 6th (temp=0) 74Inception Mercury 77Qwen 3.5 9B
41.8%	Sentence opener variety	91Rocinante 12B 90Grok 4.1 Fast 90Llama 3.1 8B	27GPT-5 Nano 28Inception Mercury 28Gemma 4 26B
43.4%	Subject-first sentence starts	100Qwen3 235B A22B Instruct 2507 99Rocinante 12B 98Writer: Palmyra X5	0Qwen 3.5 27B 0Inception Mercury 0Stealth: Aurora Alpha
30.4%	Subordinate conjunction sentence starts	84GPT-4o, Aug. 6th (temp=1) 82Mistral NeMO 76Hermes 3 70B	0Nemotron 3 Super 0Inception Mercury 0Qwen 3.5 9B
65.1%	Technical jargon density	100GPT-5.5 (Reasoning, Low) 100o4 Mini High 100Qwen 3.5 397B A17B	0Nemotron 3 Nano 0GPT-5 Nano 7Stealth: Aurora Alpha
75.9%	Useless dialogue additions	100Gemma 4 31B (Reasoning) 100Grok 4.3 (Reasoning) 100Qwen 3.5 397B A17B	0Stealth: Aurora Alpha 0GPT-4o, May 13th (temp=0) 0GPT-4o Mini (temp=0)