Dialogue

Subcategory of Creative Writing. 155 models scored.

Model Leaderboard

All models ranked by their Dialogue subcategory score.

#	Model	Dialogue	Creative Writing	Overall
1	GPT-5.5 (Reasoning, Low)	96.25%	90.24%	92.59%
2	GPT-5.5	95.32%	90.39%	89.09%
3	Claude Sonnet 4.6	95.20%	83.31%	91.15%
4	Claude Sonnet 4.6 (Reasoning)	94.92%	83.09%	93.66%
5	GPT-5.5 (Reasoning)	94.90%	90.26%	92.98%
6	GPT-5.4	94.79%	90.94%	84.32%
7	MiniMax M3	93.86%	84.57%	90.88%
8	GPT-5.4 (Reasoning)	93.69%	91.17%	93.24%
9	Claude Opus 4.6 (Reasoning)	93.56%	84.55%	95.02%
10	Z.AI GLM 5 Turbo	93.23%	84.66%	94.27%
11	MoonshotAI: Kimi K2.6	92.60%	85.47%	92.31%
12	Qwen 3.5 397B A17B	92.54%	86.93%	91.73%
13	Z.AI GLM 5.1	92.52%	84.05%	94.37%
14	Claude Opus 4.6	92.37%	83.59%	92.35%
15	GPT-5.4 (Reasoning, Low)	92.18%	90.51%	91.41%
16	Qwen 3.5 9B	91.95%	84.35%	86.05%
17	Qwen3.6 Max Preview	91.85%	88.42%	94.54%
18	DeepSeek V4 Pro	91.48%	83.70%	82.63%
19	GPT-5	91.30%	86.87%	91.93%
20	Grok 4.3 (Reasoning)	90.45%	85.11%	93.60%
21	GPT-5.4 Mini	90.45%	88.10%	82.43%
22	GPT-5.4 Mini (Reasoning)	90.04%	88.66%	90.65%
23	GPT-5.4 Mini (Reasoning, Low)	89.66%	87.72%	85.75%
24	Claude Sonnet 4.5	89.63%	84.19%	88.03%
25	Qwen 3.5 35B	89.61%	83.51%	88.00%
26	Claude Opus 4	89.08%	83.79%	87.69%
27	Qwen 3.6 35B	89.07%	85.97%	89.05%
28	Qwen3.7 Max	88.93%	85.39%	95.75%
29	MiniMax M2.5	88.93%	81.21%	88.71%
30	MiniMax M2.7	88.91%	81.70%	89.10%
31	Claude Opus 4.5	88.91%	81.71%	89.69%
32	Qwen 3.5 27B	88.67%	82.54%	90.85%
33	Claude Opus 4.7	88.63%	84.74%	89.93%
34	DeepSeek V4 Flash	88.48%	83.42%	82.02%
35	Z.AI GLM 5	88.46%	83.63%	91.23%
36	Claude Opus 4.7 (Reasoning)	88.14%	84.73%	93.23%
37	DeepSeek V4 Flash (Reasoning)	88.10%	83.03%	89.01%
38	Claude Opus 4.8 (Reasoning, Low)	87.92%	85.86%	92.14%
39	GPT-5.1	87.44%	87.20%	92.54%
40	Qwen3 235B A22B Instruct 2507	87.23%	84.81%	80.10%
41	Gemini 3.1 Pro (Preview)	87.20%	85.44%	94.37%
42	Grok 4.3	87.13%	84.51%	78.66%
43	Qwen 3.5 Flash	87.13%	83.81%	86.38%
44	Qwen 3.5 122B	86.76%	83.02%	91.53%
45	Qwen 3.6 Flash	86.74%	86.02%	90.65%
46	Claude Opus 4.8 (Reasoning)	86.50%	85.25%	92.22%
47	MoonshotAI: Kimi K2.5	86.21%	81.35%	91.04%
48	ByteDance Seed 2.0 Lite	85.96%	82.35%	84.80%
49	GPT-5 Mini	85.94%	80.48%	92.62%
50	ByteDance Seed 1.6	85.62%	78.43%	90.70%
51	Mistral Small 4 (Reasoning)	85.39%	81.67%	82.39%
52	Qwen 3.5 Plus (2026-04-20)	85.08%	85.18%	91.51%
53	GPT-5.4 Nano (Reasoning)	85.05%	80.97%	81.36%
54	Writer: Palmyra X5	84.42%	83.95%	79.57%
55	DeepSeek V4 Pro (Reasoning)	84.25%	82.99%	90.10%
56	DeepSeek V3 (2025-03-24)	83.31%	82.34%	81.99%
57	Grok 4.20 (Reasoning)	83.26%	86.25%	91.39%
58	GPT-5.4 Nano	83.18%	80.50%	74.40%
59	GPT-5.4 Nano (Reasoning, Low)	83.10%	80.93%	79.48%
60	Mistral Large 2	83.10%	81.86%	82.41%
61	Claude Sonnet 4	82.84%	79.21%	88.72%
62	Mistral Small 4	82.49%	81.12%	76.46%
63	Claude Haiku 4.5	82.25%	78.96%	85.14%
64	Mistral Large	81.53%	82.02%	80.15%
65	Mistral Medium 3.1	81.51%	81.70%	77.83%
66	Gemma 4 26B (Reasoning)	81.41%	76.38%	91.49%
67	ByteDance Seed 2.0 Mini	81.36%	80.11%	86.91%
68	Gemini 3.1 Flash Lite (Reasoning)	81.09%	76.31%	86.41%
69	Grok 4.20 (Beta, Reasoning)	80.80%	84.50%	91.49%
70	GPT-5.2	80.71%	80.36%	90.26%
71	Mistral Large 3	80.22%	81.21%	85.43%
72	Claude 3.5 Sonnet	80.13%	78.69%	84.24%
73	Gemma 4 31B (Reasoning)	80.03%	78.13%	91.71%
74	ByteDance Seed 1.6 Flash	79.96%	81.51%	73.27%
75	Xiaomi MIMO v2.5 Pro	79.81%	81.08%	87.36%
76	Grok 4.1 Fast	78.96%	82.14%	89.55%
77	Gemini 3.1 Flash Lite	78.90%	76.01%	85.75%
78	o4 Mini	78.87%	82.04%	88.35%
79	Qwen 3 32B	78.83%	81.30%	82.21%
80	Mistral Small Creative	78.30%	80.29%	73.27%
81	o4 Mini High	78.26%	82.72%	90.29%
82	Qwen 3.6 27B	78.25%	82.81%	89.72%
83	WizardLM 2 8x22b	77.45%	79.06%	71.06%
84	Gemini 2.5 Pro	77.40%	81.03%	88.53%
85	Xiaomi MIMO v2.5	77.40%	79.16%	85.05%
86	Stealth: Healer Alpha	77.16%	78.28%	85.93%
87	Grok 4.20	77.13%	83.44%	81.70%
88	GPT-4.1	76.96%	81.24%	88.68%
89	Aion 2.0	76.51%	80.24%	89.21%
90	Stealth: Hunter Alpha	76.41%	79.18%	87.34%
91	Ministral 8B	76.29%	76.87%	64.87%
92	Gemini 3.1 Flash Lite (Preview)	75.91%	75.78%	85.87%
93	Ministral 3B	75.29%	75.49%	61.29%
94	Claude 3.7 Sonnet	75.00%	76.31%	83.39%
95	Ministral 3 14B	74.59%	79.11%	72.54%
96	Ministral 3 3B	74.49%	75.45%	67.22%
97	Skyfall 36B V2	74.44%	83.32%	65.76%
98	Z.AI GLM 4.6	74.32%	78.86%	89.11%
99	Grok 4.20 (Beta)	74.11%	82.80%	83.85%
100	Gemma 4 26B	73.77%	75.17%	85.84%
101	DeepSeek V3.2	73.33%	79.95%	82.25%
102	Inception Mercury	73.29%	69.99%	79.50%
103	Z.AI GLM 4.5	73.22%	76.56%	86.27%
104	Llama 3.1 8B	73.18%	76.54%	63.35%
105	Ministral 3 8B	72.81%	77.26%	71.76%
106	Rocinante 12B	72.56%	81.94%	54.54%
107	LFM2 24B	72.22%	78.10%	58.77%
108	Z.AI GLM 4.7	71.01%	78.89%	88.69%
109	Llama 3.1 70B	70.76%	72.78%	78.40%
110	Gemini 2.5 Flash	70.64%	77.57%	80.60%
111	Gemma 4 31B	70.38%	75.59%	86.91%
112	DeepSeek V3 (2024-12-26)	69.79%	77.88%	83.68%
113	Z.AI GLM 4.7 Flash	69.49%	77.36%	84.82%
114	Gemini 3 Flash (Preview, Reasoning)	69.48%	75.87%	90.50%
115	Grok 4 Fast	69.13%	77.03%	86.15%
116	Mistral Small 3.2 24B	68.93%	71.87%	78.58%
117	Hermes 3 405B	68.75%	80.92%	82.86%
118	Grok 4	68.65%	77.34%	88.12%
119	DeepSeek V3.1	68.61%	77.45%	82.39%
120	Gemini 3.5 Flash (Reasoning)	67.65%	79.87%	94.08%
121	Nemotron 3 Super	67.38%	69.75%	84.56%
122	Mistral NeMO	67.00%	76.72%	65.04%
123	Gemini 3 Pro (Preview)	67.00%	77.77%	88.79%
124	DeepSeek-V2 Chat	66.95%	77.20%	84.83%
125	Gemini 2.5 Flash Lite (Reasoning)	65.73%	71.64%	85.75%
126	Z.AI GLM 4.5 Air	65.43%	74.61%	83.12%
127	Gemini 2.5 Flash (Reasoning)	65.33%	76.30%	86.51%
128	Arcee AI: Trinity Large (Preview)	65.33%	75.26%	73.33%
129	GPT-4o, May 13th (temp=0)	64.81%	74.89%	85.36%
130	Gemini 3 Flash (Preview)	64.78%	75.04%	85.35%
131	Gemma 3 27B	64.44%	78.79%	77.85%
132	Qwen 2.5 72B	64.34%	75.16%	75.46%
133	Cohere Command R+ (Aug. 2024)	64.24%	77.70%	69.03%
134	Gemini 3.5 Flash (Reasoning, Minimal)	63.97%	78.55%	86.47%
135	Arcee AI: Trinity Mini	63.42%	74.01%	70.90%
136	GPT-4o, Aug. 6th (temp=0)	63.20%	73.65%	82.45%
137	Cydonia 24B V4.1	62.72%	74.19%	75.09%
138	Gemini 2.5 Flash Lite	62.67%	75.05%	81.08%
139	GPT-5 Nano	62.32%	67.04%	82.60%
140	Llama 3.1 Nemotron 70B	62.12%	71.71%	74.70%
141	Qwen 3.5 Plus (2026-02-15)	61.06%	77.07%	85.96%
142	Nemotron 3 Nano	60.70%	65.87%	77.73%
143	Claude 3 Haiku	60.58%	74.53%	71.19%
144	Inception Mercury 2	59.94%	68.31%	83.85%
145	GPT-4o, Aug. 6th (temp=1)	59.42%	75.50%	82.62%
146	Hermes 3 70B	59.25%	77.41%	72.57%
147	Stealth: Aurora Alpha	59.16%	67.54%	83.79%
148	GPT-4o, May 13th (temp=1)	58.51%	75.88%	83.80%
149	GPT-4.1 Mini	57.72%	74.52%	83.20%
150	GPT-4o Mini (temp=0)	56.91%	73.10%	78.29%
151	GPT-OSS 120B	55.74%	67.85%	86.44%
152	GPT-4.1 Nano	55.61%	71.81%	71.94%
153	Gemma 3 12B	54.87%	75.38%	78.41%
154	GPT-4o Mini (temp=1)	53.50%	74.37%	79.08%
155	Gemma 3 4B	52.95%	72.10%	68.57%