"Not X but Y" pattern overuse

Test: Bad Writing Habits

Avg. Score

86.5%

Scenarios

Overall Performance

Performance Score Distribution (Top 20)

Click a model name to view its detail page.

Rank ▲	Model	Score	Avg. Cost	Avg. Time	Stability
1	Claude 3 Haiku	100.0%	$0.0025	14.9s	100%
2	Qwen 2.5 72B	99.6%	$0.0010	36.7s	95%
3	GPT-4o Mini (temp=0)	99.6%	$0.0012	34.8s	93%
4	o4 Mini High	100.0%	$0.025	47.2s	100%
5	Mistral NeMO	98.0%	$0.0005	10.1s	77%
6	Llama 3.1 70B	98.1%	$0.0015	29.4s	81%
7	GPT-4o, Aug. 6th (temp=0)	98.6%	$0.023	22.7s	84%
8	Hermes 3 405B	98.6%	$0.0032	53.2s	82%
9	Gemini 3 Flash (Preview)	96.5%	$0.0078	19.6s	75%
10	GPT-4o Mini (temp=1)	97.0%	$0.0012	34.8s	71%
11	Stealth: Aurora Alpha	95.7%	$0.0000	9.8s	63%
12	Llama 3.1 8B	98.2%	$0.0003	1.3m	77%
13	GPT-4o, May 13th (temp=0)	97.6%	$0.035	14.1s	74%
14	Hermes 3 70B	96.8%	$0.0010	1.2m	76%
15	o4 Mini	96.1%	$0.015	25.7s	68%
16	Mistral Large	95.2%	$0.014	30.9s	63%
17	GPT-4o, May 13th (temp=1)	96.0%	$0.033	14.4s	64%
18	Ministral 8B	90.7%	$0.0004	10.4s	49%
19	Mistral Large 2	93.5%	$0.013	29.4s	57%
20	Grok 4.1 Fast	92.9%	$0.0018	37.8s	53%
21	Ministral 3 14B	89.6%	$0.0007	11.7s	47%
22	Rocinante 12B	92.6%	$0.0014	38.4s	51%
23	Llama 3.1 Nemotron 70B	90.7%	$0.0038	31.7s	48%
24	Gemma 3 12B	91.4%	$0.0004	41.3s	47%
25	Mistral Large 3	90.1%	$0.0033	30.3s	47%
26	DeepSeek V3 (2024-12-26)	91.9%	$0.0021	54.6s	51%
27	Z.AI GLM 4.7	93.7%	$0.010	1.4m	61%
28	Cohere Command R+ (Aug. 2024)	92.4%	$0.020	52.5s	56%
29	Ministral 3 8B	88.1%	$0.0008	19.6s	41%
30	Z.AI GLM 4.5	89.1%	$0.0051	42.1s	43%
31	ByteDance Seed 1.6	97.2%	$0.013	2.5m	70%
32	Mistral Small Creative	85.1%	$0.0007	9.1s	35%
33	Ministral 3B	84.2%	$0.0001	8.1s	35%
34	Gemini 3.1 Pro (Preview)	99.9%	$0.107	1.8m	98%
35	Z.AI GLM 4.7 Flash	88.9%	$0.0017	1.2m	47%
36	DeepSeek-V2 Chat	88.4%	$0.0021	53.3s	42%
37	GPT-4.1 Nano	84.7%	$0.0007	13.3s	32%
38	Qwen 3.5 Plus (2026-02-15)	82.1%	$0.0060	31.5s	42%
39	Ministral 3 3B	82.7%	$0.0005	11.1s	32%
40	Arcee AI: Trinity Large (Preview)	85.9%	$0.0000	43.6s	37%
41	GPT-4o, Aug. 6th (temp=1)	87.1%	$0.018	24.4s	37%
42	Gemini 3 Pro (Preview)	92.1%	$0.055	54.4s	57%
43	Mistral Medium 3.1	85.5%	$0.0048	36.5s	36%
44	Z.AI GLM 5	89.0%	$0.0084	1.2m	44%
45	Claude 3.5 Sonnet	89.5%	$0.048	35.5s	50%
46	GPT-4.1 Mini	82.0%	$0.0027	19.0s	30%
47	Minimax M2.5	87.5%	$0.0034	1.3m	40%
48	GPT-5.1	92.4%	$0.054	1.8m	62%
49	Mistral Small 3.2 24B	100.0%	$0.0069	5.7m	100%
50	Arcee AI: Trinity Mini	74.9%	$0.0003	9.2s	26%
51	Gemma 3 4B	78.3%	$0.0002	20.0s	25%
52	Z.AI GLM 4.6	82.9%	$0.0065	51.5s	32%
53	Gemma 3 27B	81.9%	$0.0006	52.6s	28%
54	DeepSeek V3 (2025-03-24)	81.5%	$0.0014	39.4s	25%
55	WizardLM 2 8x22b	85.6%	$0.0026	1.8m	41%
56	Gemini 2.5 Pro	83.6%	$0.036	36.2s	37%
57	Qwen 3.5 397B A17B	92.9%	$0.014	3.0m	58%
58	Claude Sonnet 4.5	84.3%	$0.035	38.1s	34%
59	ByteDance Seed 1.6 Flash	75.9%	$0.0013	27.3s	25%
60	Claude Haiku 4.5	77.7%	$0.011	21.6s	24%
61	Grok 4 Fast	73.3%	$0.0017	24.1s	23%
62	Claude 3.5 Haiku	76.7%	$0.0035	10.8s	16%
63	Claude Sonnet 4	83.6%	$0.032	43.7s	31%
64	Gemini 2.5 Flash Lite	72.2%	$0.0009	9.5s	18%
65	DeepSeek V3.1	82.8%	$0.0020	1.8m	35%
66	Claude Opus 4.5	88.2%	$0.070	53.4s	44%
67	GPT-4.1	79.3%	$0.018	44.7s	26%
68	Gemini 2.5 Flash	70.9%	$0.0052	10.6s	19%
69	Claude Sonnet 4.6	80.4%	$0.031	39.3s	26%
70	Writer: Palmyra X5	70.8%	$0.011	22.0s	16%
71	Claude 3.7 Sonnet	78.2%	$0.042	46.7s	25%
72	GPT-5 Mini	69.0%	$0.0100	57.4s	19%
73	Grok 4	82.3%	$0.048	1.7m	33%
74	Claude Opus 4.6	81.3%	$0.078	1.2m	37%
75	DeepSeek V3.2	71.2%	$0.0014	1.9m	21%
76	GPT-5	89.3%	$0.065	2.8m	47%
77	GPT-5.2	76.9%	$0.056	1.5m	26%
78	MoonshotAI: Kimi K2.5	59.6%	$0.019	3.2m	10%
79	Claude Opus 4	86.4%	$0.209	1.4m	37%
80	GPT-5 Nano	14.2%	$0.0042	1.4m	0%
86.48%

Individual Scenarios

Detailed Writing Rules

▼

Fantasy: entering an ancient ruin

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	96	99.3%
Hermes 3 405B	100	100	100	100	90	98.1%
Z.AI GLM 4.5	100	100	100	100	85	97.1%
Gemini 3 Flash (Preview)	100	100	100	100	81	96.2%
DeepSeek V3.2	100	100	100	100	78	95.6%
Qwen 2.5 72B	100	100	100	100	78	95.5%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	77	95.3%
Cohere Command R+ (Aug. 2024)	100	100	100	100	62	92.4%
Mistral Large	100	100	100	91	68	91.9%
Mistral Large 2	100	100	100	100	58	91.6%
Mistral Large 3	100	100	100	90	50	88.1%
Rocinante 12B	100	100	100	100	38	87.7%
Grok 4 Fast	100	100	100	100	37	87.5%
Gemma 3 12B	100	100	100	100	31	86.2%
Minimax M2.5	100	100	100	100	30	86.0%
Grok 4	100	100	100	100	24	84.9%
GPT-5.1	100	100	100	78	41	83.9%
Gemma 3 27B	100	100	100	100	17	83.3%
GPT-4.1 Nano	100	100	100	100	16	83.2%
Z.AI GLM 4.7 Flash	100	100	98	74	43	83.1%
Gemini 3 Pro (Preview)	100	100	97	73	33	80.6%
Z.AI GLM 4.7	100	100	100	100	0	80.0%
DeepSeek-V2 Chat	100	100	100	100	0	80.0%
Ministral 3B	100	100	100	100	0	80.0%
GPT-5	100	100	100	83	0	76.5%
DeepSeek V3.1	100	100	100	69	0	73.8%
Claude Opus 4.6	100	100	87	79	0	73.2%
Claude Sonnet 4.5	100	100	100	55	6	72.1%
Claude 3.5 Sonnet	100	100	62	48	47	71.4%
Qwen 3.5 397B A17B	100	100	100	47	0	69.4%
Claude Opus 4	100	100	100	42	0	68.5%
Llama 3.1 Nemotron 70B	100	100	100	37	0	67.3%
Claude Sonnet 4.6	100	100	100	34	0	66.9%
Ministral 3 3B	100	100	64	52	0	63.2%
GPT-4.1 Mini	100	100	100	11	0	62.3%
Qwen 3.5 Plus (2026-02-15)	100	100	91	13	0	60.9%
GPT-4.1	100	100	100	0	0	60.0%
DeepSeek V3 (2025-03-24)	100	100	100	0	0	60.0%
Writer: Palmyra X5	100	100	100	0	0	60.0%
Ministral 3 14B	100	100	100	0	0	60.0%
Gemini 2.5 Flash Lite	100	100	60	40	0	59.9%
Z.AI GLM 4.6	100	97	95	0	0	58.2%
Claude 3.7 Sonnet	100	100	80	0	0	56.0%
Claude Opus 4.5	100	100	66	0	0	53.3%
Z.AI GLM 5	100	100	63	0	0	52.6%
Arcee AI: Trinity Mini	100	100	63	0	0	52.6%
Arcee AI: Trinity Large (Preview)	100	100	48	3	0	50.3%
Claude Sonnet 4	100	100	32	8	0	48.0%
Ministral 8B	100	97	25	0	0	44.5%
WizardLM 2 8x22b	100	100	19	0	0	43.9%
Ministral 3 8B	100	100	0	0	0	40.0%
Mistral Small Creative	100	92	0	0	0	38.4%
GPT-5.2	100	69	14	0	0	36.6%
Gemini 2.5 Flash	100	35	28	0	0	32.5%
Claude Haiku 4.5	100	62	0	0	0	32.5%
Mistral Medium 3.1	100	46	0	0	0	29.2%
ByteDance Seed 1.6 Flash	58	55	13	0	0	25.3%
GPT-5 Mini	92	25	0	0	0	23.5%
Gemini 2.5 Pro	74	27	0	0	0	20.2%
Claude 3.5 Haiku	100	0	0	0	0	20.0%
Gemma 3 4B	100	0	0	0	0	20.0%
MoonshotAI: Kimi K2.5	98	0	0	0	0	19.7%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Horror: alone in an eerie place at night

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Gemini 2.5 Pro	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
Claude 3.5 Sonnet	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	99	99.7%
Mistral Medium 3.1	100	100	100	100	94	98.7%
Claude Opus 4.6	100	100	100	100	88	97.6%
Mistral Large 2	100	100	100	100	87	97.4%
Mistral Large	100	100	100	100	84	96.8%
Llama 3.1 70B	100	100	100	100	74	94.9%
Gemini 3 Flash (Preview)	100	100	100	100	69	93.8%
Z.AI GLM 4.7	100	100	100	100	52	90.4%
Grok 4 Fast	100	100	100	100	45	89.0%
Minimax M2.5	100	100	100	100	27	85.4%
ByteDance Seed 1.6	100	100	100	100	13	82.7%
Claude Opus 4.5	100	100	100	100	0	80.0%
Claude Sonnet 4.5	100	100	100	100	0	80.0%
Claude 3.7 Sonnet	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
Writer: Palmyra X5	100	100	100	100	0	80.0%
Llama 3.1 Nemotron 70B	100	100	100	100	0	80.0%
Ministral 3 8B	100	100	100	100	0	80.0%
Ministral 8B	100	100	100	100	0	80.0%
Ministral 3B	100	100	100	100	0	80.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	97	0	79.4%
Hermes 3 70B	100	100	100	60	30	77.8%
Ministral 3 14B	100	100	100	48	0	69.6%
Gemma 3 12B	100	100	100	42	0	68.4%
GPT-4.1 Nano	100	100	100	23	15	67.6%
WizardLM 2 8x22b	100	99	52	46	36	66.7%
Z.AI GLM 5	100	100	100	34	0	66.7%
GPT-5.1	100	100	100	11	0	62.2%
Qwen 3.5 Plus (2026-02-15)	100	94	80	32	0	61.2%
Claude Sonnet 4.6	100	100	100	3	2	60.8%
Z.AI GLM 4.5	100	100	100	2	0	60.5%
GPT-4.1	100	100	100	0	0	60.0%
DeepSeek V3 (2025-03-24)	100	100	100	0	0	60.0%
DeepSeek V3 (2024-12-26)	100	100	100	0	0	60.0%
Claude Haiku 4.5	100	100	100	0	0	60.0%
Gemini 2.5 Flash	100	100	100	0	0	60.0%
Ministral 3 3B	100	100	100	0	0	60.0%
Z.AI GLM 4.7 Flash	100	100	100	0	0	59.9%
DeepSeek V3.2	100	87	50	49	0	57.2%
DeepSeek V3.1	100	100	81	0	0	56.2%
Z.AI GLM 4.6	100	100	76	0	0	55.3%
Arcee AI: Trinity Large (Preview)	100	100	45	29	0	54.7%
Claude Opus 4	100	100	65	7	0	54.4%
MoonshotAI: Kimi K2.5	100	87	59	0	0	49.3%
DeepSeek-V2 Chat	100	100	18	11	0	45.8%
Grok 4	100	100	0	0	0	40.0%
Gemma 3 4B	100	100	0	0	0	40.0%
Gemma 3 27B	100	99	0	0	0	39.8%
GPT-4.1 Mini	100	86	1	0	0	37.3%
ByteDance Seed 1.6 Flash	65	56	50	0	0	34.2%
GPT-5 Mini	100	49	7	0	0	31.1%
Arcee AI: Trinity Mini	47	38	14	0	0	19.7%
GPT-5.2	58	27	0	0	0	17.0%
Gemini 2.5 Flash Lite	1	0	0	0	0	0.2%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Literary fiction: old friends reunite

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
Claude Opus 4.6	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Gemini 2.5 Pro	100	100	100	100	100	100.0%
GPT-5.2	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
GPT-4.1	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
Grok 4 Fast	100	100	100	100	100	100.0%
Claude 3.5 Sonnet	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Arcee AI: Trinity Mini	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Gemma 3 4B	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
Ministral 3B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	99.9%
DeepSeek V3.2	100	100	100	100	82	96.4%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	78	95.6%
GPT-4o, Aug. 6th (temp=0)	100	100	100	95	76	94.2%
WizardLM 2 8x22b	100	100	100	100	64	92.8%
ByteDance Seed 1.6 Flash	100	100	100	100	62	92.5%
Gemini 2.5 Flash Lite	100	100	100	100	59	91.8%
GPT-5 Mini	100	100	100	82	76	91.7%
DeepSeek V3.1	100	100	100	100	55	91.0%
MoonshotAI: Kimi K2.5	100	100	100	100	54	90.8%
GPT-4o Mini (temp=1)	100	100	100	100	52	90.5%
Writer: Palmyra X5	100	100	100	100	50	90.0%
Mistral Medium 3.1	100	100	100	100	45	88.9%
GPT-4o, Aug. 6th (temp=1)	100	100	100	98	38	87.1%
GPT-4.1 Mini	100	100	100	77	49	85.2%
Gemma 3 12B	100	100	100	100	24	84.8%
Cohere Command R+ (Aug. 2024)	100	100	100	98	14	82.5%
Gemini 2.5 Flash	100	100	100	100	0	80.0%
Llama 3.1 Nemotron 70B	100	100	100	100	0	80.0%
Claude 3.7 Sonnet	100	100	100	86	0	77.3%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Mystery: examining a crime scene

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
Writer: Palmyra X5	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Gemini 2.5 Pro	100	100	100	100	99	99.7%
Ministral 8B	100	100	100	100	92	98.5%
Mistral Large	100	100	100	100	88	97.7%
Gemini 3 Flash (Preview)	100	100	100	100	87	97.4%
Llama 3.1 Nemotron 70B	100	100	100	100	84	96.9%
GPT-4.1 Mini	100	100	100	100	83	96.7%
Cohere Command R+ (Aug. 2024)	100	100	100	92	90	96.4%
Z.AI GLM 4.5	100	100	100	100	82	96.3%
Minimax M2.5	100	100	100	100	71	94.2%
Claude 3.7 Sonnet	100	100	100	100	69	93.8%
DeepSeek V3.2	100	100	100	100	68	93.7%
GPT-4o, Aug. 6th (temp=1)	100	100	100	87	78	92.9%
Gemma 3 4B	100	100	100	100	63	92.5%
Z.AI GLM 4.7	100	100	100	100	59	91.8%
Ministral 3 14B	100	100	100	100	51	90.2%
Gemini 2.5 Flash Lite	100	100	100	100	49	89.8%
Claude Sonnet 4.5	100	100	100	100	48	89.7%
Gemini 3 Pro (Preview)	100	100	89	75	72	87.3%
Qwen 3.5 Plus (2026-02-15)	100	100	100	81	51	86.2%
Llama 3.1 70B	100	100	100	100	30	86.0%
ByteDance Seed 1.6	100	100	100	100	8	81.7%
GPT-5.1	100	100	81	70	55	81.0%
Grok 4	100	100	100	100	0	80.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	0	80.0%
Grok 4.1 Fast	100	100	100	95	0	79.0%
GPT-5	100	100	100	88	0	77.6%
DeepSeek V3.1	100	100	100	50	36	77.1%
Arcee AI: Trinity Mini	100	100	100	46	39	77.1%
Claude Sonnet 4	100	100	100	81	0	76.3%
Claude 3.5 Sonnet	100	100	100	72	0	74.4%
Llama 3.1 8B	100	100	100	68	0	73.6%
GPT-5 Mini	100	100	91	71	0	72.3%
Ministral 3 3B	100	100	100	61	0	72.1%
Claude Opus 4.6	100	100	100	59	0	71.8%
DeepSeek V3 (2024-12-26)	100	100	100	35	11	69.1%
Gemini 2.5 Flash	100	100	100	42	0	68.5%
Ministral 3B	100	100	100	19	0	63.8%
Mistral Small Creative	100	100	92	13	0	61.1%
Mistral Medium 3.1	100	100	100	0	0	60.0%
GPT-4.1	100	100	37	0	0	47.3%
GPT-5.2	100	100	0	0	0	40.0%
MoonshotAI: Kimi K2.5	100	58	17	0	0	34.9%
Arcee AI: Trinity Large (Preview)	100	72	0	0	0	34.5%
Claude 3.5 Haiku	100	0	0	0	0	20.0%
GPT-5 Nano	72	0	0	0	0	14.3%
Grok 4 Fast	57	0	0	0	0	11.5%

▼

Romance: separated couple reunites

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
MoonshotAI: Kimi K2.5	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
DeepSeek V3.2	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
GPT-4.1 Mini	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
Writer: Palmyra X5	100	100	100	100	100	100.0%
Gemini 2.5 Flash	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Claude Opus 4.6	100	100	100	100	99	99.7%
GPT-4.1	100	100	100	100	95	99.1%
Hermes 3 405B	100	100	100	100	92	98.4%
Llama 3.1 Nemotron 70B	100	100	100	100	86	97.2%
Qwen 2.5 72B	100	100	100	100	86	97.2%
Claude 3.7 Sonnet	100	100	100	100	86	97.1%
Ministral 3 3B	100	100	100	100	81	96.1%
Qwen 3.5 397B A17B	100	100	100	100	72	94.4%
GPT-5 Mini	100	100	100	100	65	93.1%
Gemma 3 4B	100	100	100	100	64	92.8%
Claude Sonnet 4.5	100	100	100	100	63	92.5%
Gemini 2.5 Flash Lite	100	100	100	97	66	92.5%
Claude 3.5 Sonnet	100	100	100	100	62	92.3%
Claude Sonnet 4	100	100	100	100	60	92.1%
WizardLM 2 8x22b	100	100	100	100	55	91.0%
Ministral 3B	100	100	100	100	55	91.0%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	50	89.9%
Ministral 3 14B	100	100	100	96	50	89.0%
Grok 4 Fast	100	100	100	100	45	89.0%
Grok 4	100	100	100	100	41	88.1%
Gemini 2.5 Pro	100	100	100	100	30	86.1%
GPT-4o, May 13th (temp=0)	100	100	100	100	22	84.4%
Cohere Command R+ (Aug. 2024)	100	100	100	100	18	83.6%
GPT-5.2	100	100	100	100	11	82.1%
Z.AI GLM 4.7	100	100	100	100	7	81.5%
Z.AI GLM 4.6	100	100	100	100	0	80.0%
ByteDance Seed 1.6 Flash	100	100	100	100	0	80.0%
Arcee AI: Trinity Mini	100	100	100	100	0	80.0%
DeepSeek V3.1	100	100	100	73	0	74.6%
Claude Haiku 4.5	100	100	100	54	0	70.8%
Mistral Large 2	100	100	100	18	5	64.7%
GPT-5 Nano	100	80	47	0	0	45.5%

▼

Thriller: chase through city streets

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
Claude 3.7 Sonnet	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Gemini 2.5 Flash Lite	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Arcee AI: Trinity Mini	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
Ministral 3B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	98	99.6%
GPT-5	100	100	100	100	94	98.9%
Z.AI GLM 4.7 Flash	100	100	100	100	85	97.0%
GPT-5 Mini	100	100	100	100	85	97.0%
MoonshotAI: Kimi K2.5	100	100	100	94	89	96.7%
DeepSeek-V2 Chat	100	100	100	100	75	95.0%
GPT-5.1	100	100	100	100	74	94.8%
GPT-4.1 Mini	100	100	100	95	77	94.3%
DeepSeek V3.1	100	100	100	100	67	93.4%
Claude 3.5 Sonnet	100	100	100	100	61	92.3%
o4 Mini	100	100	100	100	58	91.7%
Claude Opus 4.6	100	100	100	88	64	90.3%
Gemma 3 27B	100	100	100	100	51	90.1%
Grok 4 Fast	100	100	100	82	67	89.8%
Gemma 3 4B	100	100	100	100	45	88.9%
Mistral Large 3	100	100	100	100	38	87.5%
Claude Sonnet 4	100	100	100	100	35	86.9%
Grok 4	100	100	100	100	31	86.3%
GPT-4.1	100	100	100	100	29	85.9%
Z.AI GLM 4.6	100	100	100	100	17	83.3%
Gemini 2.5 Pro	100	100	100	100	16	83.3%
GPT-5.2	100	100	100	92	21	82.6%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	10	81.9%
Gemini 3 Pro (Preview)	100	100	100	100	0	80.0%
Claude Haiku 4.5	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
Gemma 3 12B	100	100	100	100	0	80.0%
Writer: Palmyra X5	100	100	100	91	0	78.3%
Cohere Command R+ (Aug. 2024)	100	100	100	84	0	76.8%
DeepSeek V3.2	100	100	100	78	0	75.5%
Qwen 3.5 Plus (2026-02-15)	100	100	86	48	40	74.6%
Ministral 3 3B	100	100	100	47	12	71.7%
WizardLM 2 8x22b	100	100	60	31	22	62.7%
Gemini 2.5 Flash	100	100	46	0	0	49.2%
GPT-5 Nano	100	66	31	18	0	42.9%

genre

▼

Fantasy: entering an ancient ruin

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	82	96.5%
Llama 3.1 Nemotron 70B	100	100	100	100	81	96.1%
Rocinante 12B	100	100	100	100	79	95.9%
Claude 3.5 Sonnet	100	100	100	91	88	95.8%
Ministral 3 3B	100	100	100	100	78	95.6%
Llama 3.1 8B	100	100	100	100	71	94.1%
o4 Mini	100	100	100	100	53	90.7%
Gemini 2.5 Pro	100	100	100	100	42	88.4%
Mistral Small Creative	100	100	92	91	54	87.4%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	27	85.3%
Ministral 8B	100	100	100	100	25	85.0%
Z.AI GLM 4.7	100	100	100	100	21	84.2%
DeepSeek-V2 Chat	100	100	100	80	35	82.8%
Z.AI GLM 5	100	100	100	100	14	82.7%
GPT-5.1	100	100	97	59	56	82.5%
Ministral 3 14B	100	100	100	100	11	82.1%
Arcee AI: Trinity Mini	100	100	94	91	20	81.1%
Claude 3.5 Haiku	100	100	100	100	2	80.4%
Grok 4.1 Fast	100	100	100	100	0	80.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	0	80.0%
GPT-5.2	100	100	90	53	51	78.7%
Mistral Large 3	100	100	100	86	5	78.2%
Mistral Large 2	100	100	100	54	16	74.0%
GPT-5	100	87	78	77	19	72.4%
Arcee AI: Trinity Large (Preview)	100	100	100	40	20	72.1%
Gemini 3 Flash (Preview)	100	100	100	60	0	71.9%
Claude Sonnet 4.5	100	100	100	34	25	71.8%
Qwen 3.5 Plus (2026-02-15)	100	100	81	73	0	70.7%
Mistral Large	100	100	100	48	0	69.6%
Mistral Medium 3.1	100	100	79	48	3	65.9%
Z.AI GLM 4.6	100	100	100	25	0	65.0%
GPT-4.1 Nano	100	100	99	18	8	64.8%
Gemma 3 4B	100	100	100	14	6	64.0%
Ministral 3B	100	100	68	40	0	61.6%
Grok 4 Fast	100	93	82	28	0	60.6%
GPT-4o, Aug. 6th (temp=1)	100	100	100	0	0	60.0%
DeepSeek V3.1	100	100	100	0	0	60.0%
Gemini 2.5 Flash	100	89	77	30	0	59.3%
Qwen 3.5 397B A17B	100	100	59	33	0	58.4%
DeepSeek V3.2	100	100	87	0	0	57.5%
GPT-4.1	100	100	56	0	0	51.2%
Claude Haiku 4.5	100	100	50	0	0	49.9%
Z.AI GLM 4.5	100	100	45	0	0	49.2%
Gemma 3 27B	100	99	24	20	0	48.5%
Gemini 2.5 Flash Lite	100	100	40	0	0	47.9%
Claude Sonnet 4.6	100	59	48	30	0	47.4%
Claude Opus 4.6	100	84	36	7	0	45.5%
Claude Opus 4	100	100	8	0	0	41.6%
Grok 4	100	100	0	0	0	40.0%
DeepSeek V3 (2025-03-24)	100	100	0	0	0	40.0%
GPT-4.1 Mini	100	45	19	12	0	35.2%
Minimax M2.5	100	74	0	0	0	34.8%
Claude Opus 4.5	100	53	20	0	0	34.6%
GPT-5 Mini	100	40	15	0	0	30.9%
ByteDance Seed 1.6 Flash	100	11	0	0	0	22.1%
Claude 3.7 Sonnet	92	16	0	0	0	21.6%
MoonshotAI: Kimi K2.5	100	5	0	0	0	20.9%
Claude Sonnet 4	100	0	0	0	0	20.0%
GPT-5 Nano	0	0	0	0	0	0.0%
Writer: Palmyra X5	0	0	0	0	0	0.0%

▼

Horror: alone in an eerie place at night

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
GPT-4.1	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Claude 3.5 Sonnet	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	99	99.8%
Mistral NeMO	100	100	100	100	86	97.2%
Ministral 3 14B	100	100	100	100	83	96.6%
ByteDance Seed 1.6 Flash	100	100	100	100	74	94.8%
Z.AI GLM 5	100	100	100	100	67	93.4%
Arcee AI: Trinity Large (Preview)	100	100	100	100	65	93.0%
Mistral Medium 3.1	100	100	100	100	63	92.6%
DeepSeek V3 (2025-03-24)	100	100	100	100	61	92.3%
Llama 3.1 70B	100	100	100	100	59	91.9%
Claude Opus 4.6	100	100	95	89	61	89.1%
Z.AI GLM 4.7	100	100	100	82	61	88.6%
GPT-4o Mini (temp=1)	100	100	100	100	42	88.5%
Ministral 8B	100	100	100	73	61	86.8%
Claude Sonnet 4.6	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
Claude Opus 4.5	100	100	100	72	25	79.4%
Gemini 2.5 Pro	100	100	67	65	57	77.9%
Z.AI GLM 4.7 Flash	100	100	85	48	45	75.8%
GPT-5.1	100	100	73	69	34	75.4%
GPT-5	100	100	100	76	0	75.1%
Claude 3.7 Sonnet	100	100	91	60	21	74.6%
Z.AI GLM 4.5	100	100	100	73	0	74.6%
o4 Mini	100	100	100	60	12	74.4%
Grok 4.1 Fast	100	100	59	58	48	73.0%
Grok 4	100	100	100	57	0	71.4%
Cohere Command R+ (Aug. 2024)	100	100	100	52	0	70.3%
Mistral Small Creative	100	100	100	49	0	69.9%
Ministral 3B	100	100	80	64	0	68.6%
Claude Opus 4	100	100	85	42	15	68.3%
Z.AI GLM 4.6	100	100	58	48	33	67.8%
Arcee AI: Trinity Mini	93	80	73	56	16	63.7%
Gemma 3 27B	100	100	100	13	0	62.6%
Gemini 3 Pro (Preview)	100	100	100	12	0	62.3%
Gemini 2.5 Flash	100	100	58	43	8	61.9%
GPT-4.1 Mini	100	100	100	9	0	61.7%
Gemma 3 12B	100	100	100	0	0	60.0%
Llama 3.1 Nemotron 70B	100	100	100	0	0	60.0%
Ministral 3 8B	100	100	100	0	0	60.0%
Rocinante 12B	100	100	100	0	0	60.0%
DeepSeek V3.1	100	70	62	49	0	56.3%
Minimax M2.5	100	100	56	0	0	51.2%
Qwen 3.5 Plus (2026-02-15)	100	59	54	24	17	50.9%
Stealth: Aurora Alpha	100	100	26	9	6	48.2%
Claude Sonnet 4.5	100	100	35	0	0	47.0%
Claude Haiku 4.5	100	100	26	0	0	45.1%
GPT-5.2	100	94	15	0	0	41.8%
Gemma 3 4B	100	100	0	0	0	40.0%
Grok 4 Fast	100	69	19	0	0	37.7%
MoonshotAI: Kimi K2.5	83	22	0	0	0	21.1%
GPT-4.1 Nano	100	5	0	0	0	21.1%
Writer: Palmyra X5	100	0	0	0	0	20.0%
Gemini 2.5 Flash Lite	100	0	0	0	0	20.0%
DeepSeek V3.2	61	20	0	0	0	16.1%
GPT-5 Mini	0	0	0	0	0	0.0%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Literary fiction: old friends reunite

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
GPT-4.1	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
DeepSeek V3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Gemini 2.5 Flash	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Arcee AI: Trinity Mini	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
Gemma 3 4B	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	99	99.9%
Claude 3.5 Sonnet	100	100	100	100	99	99.8%
Rocinante 12B	100	100	100	100	99	99.8%
Gemini 3 Pro (Preview)	100	100	100	100	92	98.4%
GPT-5.2	100	100	100	100	88	97.7%
DeepSeek V3 (2025-03-24)	100	100	100	100	81	96.3%
Z.AI GLM 4.7 Flash	100	100	100	100	77	95.5%
DeepSeek-V2 Chat	100	100	100	100	77	95.5%
Z.AI GLM 4.7	100	100	100	100	73	94.7%
Gemini 3 Flash (Preview)	100	100	100	100	73	94.6%
Ministral 3B	100	100	100	100	72	94.4%
Claude Sonnet 4	100	100	100	100	71	94.1%
GPT-4o Mini (temp=0)	100	100	100	100	66	93.2%
DeepSeek V3.2	100	100	100	82	81	92.6%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	62	92.4%
Mistral Small Creative	100	100	100	100	56	91.3%
ByteDance Seed 1.6 Flash	100	100	100	100	51	90.3%
Writer: Palmyra X5	100	100	100	100	40	88.0%
GPT-5 Mini	100	100	100	98	42	87.8%
Hermes 3 70B	100	100	100	100	36	87.2%
Arcee AI: Trinity Large (Preview)	100	100	100	100	36	87.1%
Z.AI GLM 4.5	100	100	100	100	35	86.9%
Grok 4 Fast	100	100	100	92	35	85.3%
Cohere Command R+ (Aug. 2024)	100	100	100	100	26	85.3%
Llama 3.1 Nemotron 70B	100	100	100	100	12	82.4%
Gemini 2.5 Pro	100	100	100	88	24	82.4%
Gemini 2.5 Flash Lite	100	100	100	100	4	80.7%
Claude 3.5 Haiku	100	100	100	100	2	80.3%
MoonshotAI: Kimi K2.5	100	100	100	100	0	80.0%
Z.AI GLM 4.6	100	100	100	100	0	80.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	0	80.0%
GPT-4.1 Mini	100	100	100	100	0	80.0%
Ministral 3 3B	100	100	100	79	0	75.7%
Claude 3.7 Sonnet	100	100	100	22	0	64.3%
Claude Opus 4.6	100	100	68	45	5	63.7%
Z.AI GLM 5	100	100	100	0	0	60.0%
GPT-5 Nano	10	0	0	0	0	2.0%

▼

Mystery: examining a crime scene

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
GPT-4.1 Mini	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
Gemma 3 4B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	98	99.7%
GPT-5.1	100	100	100	100	97	99.5%
Cohere Command R+ (Aug. 2024)	100	100	100	100	97	99.4%
Claude 3.5 Sonnet	100	100	100	100	88	97.6%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	85	97.1%
Z.AI GLM 4.7	100	100	100	100	85	96.9%
Z.AI GLM 4.7 Flash	100	100	100	100	83	96.6%
DeepSeek-V2 Chat	100	100	100	100	77	95.4%
Minimax M2.5	100	100	100	100	77	95.4%
Gemini 3 Flash (Preview)	100	100	100	100	75	95.1%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	74	94.9%
Gemini 2.5 Pro	100	100	100	100	73	94.6%
WizardLM 2 8x22b	100	100	100	100	73	94.6%
Hermes 3 70B	100	100	100	100	72	94.4%
Mistral Large 3	100	100	100	100	68	93.5%
Z.AI GLM 5	100	100	100	100	67	93.4%
Gemini 2.5 Flash Lite	100	100	100	100	63	92.6%
o4 Mini	100	100	100	100	62	92.3%
Z.AI GLM 4.5	100	100	100	100	53	90.6%
Gemini 3 Pro (Preview)	100	100	100	86	66	90.4%
Ministral 3 8B	100	100	100	98	32	85.9%
Z.AI GLM 4.6	100	100	100	66	63	85.7%
Gemma 3 12B	100	100	100	100	27	85.5%
Claude Haiku 4.5	100	100	100	100	27	85.4%
Claude Opus 4.6	100	100	100	71	53	84.8%
Ministral 3 3B	100	100	100	78	41	83.8%
Grok 4.1 Fast	100	100	100	100	18	83.6%
Claude Sonnet 4.5	100	100	100	100	18	83.6%
GPT-5 Mini	100	100	94	81	35	81.9%
Hermes 3 405B	100	100	100	91	18	81.8%
Claude 3.7 Sonnet	100	100	100	100	5	81.0%
Claude Sonnet 4	100	100	100	100	4	80.7%
GPT-5.2	100	100	91	58	54	80.6%
Gemma 3 27B	100	100	100	100	0	80.0%
Ministral 8B	100	100	100	84	10	78.8%
Qwen 3.5 397B A17B	100	100	100	69	22	78.2%
GPT-4.1	100	100	100	65	2	73.4%
DeepSeek V3.1	100	100	100	42	15	71.2%
Ministral 3B	100	100	100	37	0	67.4%
Gemini 2.5 Flash	100	100	72	41	20	66.7%
DeepSeek V3.2	100	100	100	19	0	63.9%
DeepSeek V3 (2025-03-24)	100	100	100	15	0	63.0%
Mistral Medium 3.1	100	100	100	9	0	61.8%
DeepSeek V3 (2024-12-26)	100	100	67	28	14	61.6%
Arcee AI: Trinity Mini	100	100	64	41	0	60.9%
Grok 4	100	100	83	19	0	60.6%
Claude Sonnet 4.6	100	100	100	0	0	60.0%
ByteDance Seed 1.6 Flash	100	100	35	0	0	47.0%
Grok 4 Fast	100	87	23	0	0	41.9%
Claude 3.5 Haiku	100	100	0	0	0	40.0%
MoonshotAI: Kimi K2.5	100	64	29	0	0	38.6%
Writer: Palmyra X5	55	38	37	36	0	33.0%
GPT-5 Nano	12	0	0	0	0	2.4%

▼

Romance: separated couple reunites

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	100	100.0%
Grok 4 Fast	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
Claude 3.7 Sonnet	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
Writer: Palmyra X5	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Gemini 2.5 Flash	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Gemini 2.5 Flash Lite	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Claude Opus 4.6	100	100	100	100	99	99.8%
Claude 3.5 Sonnet	100	100	100	100	94	98.9%
DeepSeek V3.1	100	100	100	100	93	98.6%
Gemini 3.1 Pro (Preview)	100	100	100	100	89	97.8%
Hermes 3 70B	100	100	100	100	87	97.5%
Gemini 3 Flash (Preview)	100	100	100	100	86	97.1%
Llama 3.1 Nemotron 70B	100	100	100	100	86	97.1%
Claude Haiku 4.5	100	100	100	100	81	96.2%
GPT-5.2	100	100	100	100	80	95.9%
DeepSeek V3 (2025-03-24)	100	100	100	100	77	95.3%
GPT-5	100	100	100	100	72	94.4%
GPT-4.1 Mini	100	100	100	100	66	93.2%
Z.AI GLM 4.7 Flash	100	100	100	100	57	91.4%
GPT-5 Mini	100	100	100	100	28	85.5%
GPT-4o, May 13th (temp=0)	100	100	100	100	16	83.2%
Gemma 3 27B	100	100	100	100	15	83.1%
GPT-4.1	100	100	100	61	47	81.6%
Z.AI GLM 4.6	100	100	100	100	0	80.1%
o4 Mini	100	100	100	100	0	80.0%
DeepSeek V3.2	100	100	100	100	0	80.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
Ministral 3B	100	100	100	100	0	80.0%
MoonshotAI: Kimi K2.5	100	100	100	98	0	79.6%
Gemini 2.5 Pro	100	100	100	88	0	77.5%
Arcee AI: Trinity Mini	100	100	76	69	11	71.3%
Gemma 3 4B	100	100	100	22	21	68.6%
Claude Sonnet 4.6	100	100	100	31	0	66.2%
GPT-5 Nano	88	72	0	0	0	31.9%

▼

Thriller: chase through city streets

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
Grok 4 Fast	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
Claude 3.7 Sonnet	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	100	100.0%
GPT-4.1 Mini	100	100	100	100	100	100.0%
DeepSeek V3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Arcee AI: Trinity Mini	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Gemma 3 4B	100	100	100	100	100	100.0%
Ministral 3B	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	91	98.3%
Stealth: Aurora Alpha	100	100	100	100	89	97.9%
WizardLM 2 8x22b	100	100	100	100	89	97.8%
Mistral Medium 3.1	100	100	100	100	89	97.8%
GPT-5	100	100	100	94	88	96.4%
Gemini 3 Flash (Preview)	100	100	100	100	80	96.0%
Grok 4.1 Fast	100	100	100	100	80	95.9%
GPT-4.1	100	100	100	100	77	95.5%
GPT-5.2	100	100	100	89	86	94.9%
Z.AI GLM 5	100	100	100	100	63	92.6%
ByteDance Seed 1.6 Flash	100	100	100	100	61	92.3%
Mistral Small Creative	100	100	100	100	59	91.8%
Gemini 2.5 Pro	100	100	100	100	49	89.8%
DeepSeek V3 (2024-12-26)	100	100	100	100	47	89.5%
Claude Opus 4	100	100	100	100	47	89.3%
Gemma 3 27B	100	100	100	100	37	87.3%
Claude Opus 4.5	100	100	100	90	45	87.0%
Z.AI GLM 4.7	100	100	100	95	39	86.9%
Cohere Command R+ (Aug. 2024)	100	100	100	100	32	86.3%
Gemini 3 Pro (Preview)	100	100	100	95	32	85.4%
Claude Sonnet 4	100	100	100	100	25	84.9%
Writer: Palmyra X5	100	100	100	100	24	84.8%
Rocinante 12B	100	100	100	89	32	84.1%
Ministral 3 3B	100	100	100	100	19	83.9%
Claude Sonnet 4.6	100	100	100	65	51	83.1%
Qwen 3.5 Plus (2026-02-15)	100	100	89	60	59	81.6%
Gemini 2.5 Flash	100	100	100	100	2	80.5%
Z.AI GLM 4.6	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
GPT-5 Mini	100	100	100	57	40	79.3%
Qwen 3.5 397B A17B	100	100	100	67	28	79.1%
Claude Opus 4.6	100	100	96	91	0	77.3%
Claude 3.5 Sonnet	100	100	100	72	0	74.4%
Gemini 2.5 Flash Lite	100	100	85	58	0	68.6%
DeepSeek-V2 Chat	100	100	100	26	0	65.2%
MoonshotAI: Kimi K2.5	100	100	46	0	0	49.2%
DeepSeek V3.2	94	83	44	3	0	44.9%
GPT-5 Nano	33	0	0	0	0	6.6%

Novelcrafter Default Prompt

▼

Fantasy: entering an ancient ruin

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	99	99.7%
Stealth: Aurora Alpha	100	100	100	100	94	98.8%
GPT-5.1	100	100	100	100	88	97.7%
Qwen 3.5 Plus (2026-02-15)	100	100	100	93	75	93.6%
Qwen 3.5 397B A17B	100	100	100	100	66	93.1%
GPT-4o, May 13th (temp=0)	100	100	100	100	47	89.5%
Z.AI GLM 4.7 Flash	100	100	100	100	15	83.0%
Grok 4	100	100	97	90	23	81.8%
GPT-4o, May 13th (temp=1)	100	100	100	100	5	81.0%
Mistral Large 2	100	100	100	100	2	80.4%
Grok 4.1 Fast	100	100	100	100	0	80.0%
Claude 3.5 Haiku	100	100	100	100	0	80.0%
Gemma 3 27B	100	100	100	100	0	80.0%
Gemma 3 4B	100	100	100	100	0	80.0%
Ministral 8B	100	100	100	100	0	80.0%
Rocinante 12B	100	100	100	100	0	80.0%
Claude Sonnet 4.6	100	100	100	100	0	80.0%
Ministral 3B	100	100	100	53	41	78.7%
Z.AI GLM 4.6	100	100	100	93	0	78.5%
GPT-5	100	100	97	85	0	76.4%
Claude Opus 4.6	100	100	84	78	0	72.4%
DeepSeek-V2 Chat	100	100	100	62	0	72.3%
Claude Opus 4.5	100	100	100	59	0	71.8%
GPT-4.1	100	100	100	51	0	70.2%
Z.AI GLM 4.5	100	100	100	48	0	69.6%
Z.AI GLM 5	100	100	100	34	0	66.9%
Ministral 3 14B	100	100	93	38	0	66.3%
Gemini 2.5 Pro	100	100	100	30	0	65.9%
GPT-5.2	100	100	100	26	0	65.3%
Claude 3.7 Sonnet	100	100	100	12	0	62.4%
Grok 4 Fast	100	100	79	32	0	62.2%
Mistral Large	100	100	100	7	0	61.5%
Claude Haiku 4.5	100	100	100	5	0	61.1%
WizardLM 2 8x22b	100	100	100	5	0	61.1%
Arcee AI: Trinity Large (Preview)	100	100	100	4	0	60.7%
Ministral 3 8B	100	100	55	47	0	60.5%
Minimax M2.5	100	100	100	0	0	60.0%
Claude 3.5 Sonnet	100	100	100	0	0	60.0%
Arcee AI: Trinity Mini	100	100	100	0	0	60.0%
GPT-4.1 Nano	100	100	100	0	0	60.0%
Mistral Small Creative	100	100	91	0	0	58.1%
DeepSeek V3.1	100	100	51	17	0	53.5%
DeepSeek V3.2	100	68	60	20	0	49.5%
Mistral Medium 3.1	100	100	45	0	0	48.9%
Claude Sonnet 4	100	100	19	0	0	43.9%
Gemini 2.5 Flash Lite	100	94	10	0	0	40.8%
GPT-4.1 Mini	100	100	1	0	0	40.1%
Claude Opus 4	100	100	0	0	0	40.0%
ByteDance Seed 1.6 Flash	100	100	0	0	0	40.0%
GPT-5 Mini	100	35	31	21	2	37.7%
GPT-4o, Aug. 6th (temp=1)	100	36	0	0	0	27.2%
Ministral 3 3B	100	17	15	2	0	26.9%
Writer: Palmyra X5	100	19	0	0	0	23.8%
DeepSeek V3 (2025-03-24)	100	12	0	0	0	22.4%
Mistral Large 3	86	0	0	0	0	17.2%
Gemini 2.5 Flash	54	8	2	0	0	12.9%
MoonshotAI: Kimi K2.5	49	10	0	0	0	11.6%
Claude Sonnet 4.5	46	0	0	0	0	9.2%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Horror: alone in an eerie place at night

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	99	99.9%
GPT-4.1 Mini	100	100	100	100	96	99.2%
Claude Sonnet 4	100	100	100	100	88	97.6%
Gemini 3 Flash (Preview)	100	100	100	100	85	97.0%
Hermes 3 405B	100	100	100	100	85	97.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	80	96.0%
Mistral Small Creative	100	100	100	100	74	94.7%
DeepSeek V3 (2024-12-26)	100	100	100	100	70	93.9%
GPT-5.1	100	100	100	100	70	93.9%
Z.AI GLM 4.7	100	100	100	100	58	91.6%
Cohere Command R+ (Aug. 2024)	100	100	100	85	70	90.9%
Ministral 3 14B	100	100	100	76	68	88.8%
DeepSeek V3.1	100	100	100	100	43	88.6%
Claude 3.5 Sonnet	100	100	100	100	36	87.2%
Ministral 3 8B	100	100	100	100	32	86.4%
Hermes 3 70B	100	100	100	74	55	85.8%
ByteDance Seed 1.6	100	100	100	100	27	85.4%
Grok 4	100	100	100	63	60	84.6%
Ministral 8B	100	100	100	73	44	83.5%
Claude Opus 4.5	100	80	80	79	75	82.7%
Llama 3.1 Nemotron 70B	100	100	100	59	51	81.9%
Claude Sonnet 4.5	100	100	100	100	0	80.0%
Grok 4.1 Fast	100	100	100	100	0	80.0%
Stealth: Aurora Alpha	100	100	100	100	0	80.0%
DeepSeek-V2 Chat	100	100	100	100	0	80.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	0	80.0%
Rocinante 12B	100	100	100	100	0	80.0%
Gemini 3 Pro (Preview)	100	100	100	58	40	79.6%
Claude Sonnet 4.6	100	100	98	96	0	78.8%
GPT-5.2	100	100	100	92	0	78.5%
Gemini 2.5 Pro	100	100	100	90	0	78.1%
Mistral Large 3	100	100	100	89	0	77.9%
Z.AI GLM 4.6	100	100	95	74	19	77.7%
Mistral Large 2	100	100	100	36	35	74.1%
Gemma 3 4B	100	100	100	32	25	71.5%
Minimax M2.5	100	100	71	38	29	67.5%
Claude Opus 4.6	100	100	100	27	0	65.5%
ByteDance Seed 1.6 Flash	100	66	57	55	47	64.9%
Writer: Palmyra X5	100	100	92	33	0	64.9%
Grok 4 Fast	100	98	82	42	0	64.6%
GPT-4.1	100	100	100	0	0	60.0%
DeepSeek V3 (2025-03-24)	100	100	100	0	0	60.0%
Gemini 2.5 Flash	100	100	100	0	0	60.0%
Ministral 3 3B	100	100	100	0	0	60.0%
Ministral 3B	100	100	100	0	0	60.0%
GPT-5	100	100	57	35	4	59.1%
Gemma 3 27B	100	92	85	16	0	58.6%
Arcee AI: Trinity Mini	100	100	82	0	0	56.5%
MoonshotAI: Kimi K2.5	100	100	75	0	0	54.9%
Qwen 3.5 Plus (2026-02-15)	100	76	46	24	20	53.3%
Claude Haiku 4.5	100	100	33	22	0	51.0%
Claude 3.7 Sonnet	100	100	38	12	0	49.9%
Z.AI GLM 4.7 Flash	100	72	66	9	0	49.3%
WizardLM 2 8x22b	100	75	62	0	0	47.5%
Gemini 2.5 Flash Lite	100	100	32	0	0	46.4%
GPT-4.1 Nano	100	100	15	0	0	43.1%
GPT-5 Mini	100	79	0	0	0	35.8%
DeepSeek V3.2	100	42	30	0	0	34.4%
GPT-5 Nano	0	0	0	0	0	0.0%

▼

Literary fiction: old friends reunite

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
Claude Opus 4.6	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
Claude 3.5 Sonnet	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
DeepSeek V3.2	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
DeepSeek V3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Gemini 2.5 Flash Lite	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	99	99.8%
GPT-4.1 Mini	100	100	100	99	98	99.4%
GPT-5 Mini	100	100	100	100	95	98.9%
Mistral NeMO	100	100	100	100	95	98.9%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	93	98.7%
Gemini 2.5 Flash	100	100	100	100	90	97.9%
GPT-5.2	100	100	100	100	86	97.3%
Grok 4	100	100	100	100	85	97.1%
WizardLM 2 8x22b	100	100	100	100	73	94.5%
Ministral 3B	100	100	100	100	65	93.0%
Mistral Large 3	100	100	100	94	70	92.8%
Claude 3.7 Sonnet	100	100	100	100	51	90.3%
Gemini 2.5 Pro	100	100	100	100	50	90.1%
Gemma 3 4B	100	100	100	100	40	87.9%
MoonshotAI: Kimi K2.5	100	100	100	91	48	87.9%
GPT-4.1	100	100	100	100	37	87.3%
Writer: Palmyra X5	100	100	100	100	29	85.9%
Grok 4 Fast	100	100	100	62	55	83.3%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	15	83.1%
Claude Sonnet 4.6	100	100	100	100	13	82.5%
Claude Sonnet 4	100	100	100	100	3	80.6%
Z.AI GLM 4.5	100	100	100	100	0	80.0%
GPT-4o, May 13th (temp=1)	100	100	100	100	0	80.0%
Arcee AI: Trinity Mini	100	100	100	82	0	76.4%
Claude Haiku 4.5	100	100	57	43	36	67.1%
GPT-5 Nano	47	27	0	0	0	14.9%

▼

Mystery: examining a crime scene

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
DeepSeek V3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Ministral 3 14B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Arcee AI: Trinity Mini	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Gemini 2.5 Pro	100	100	100	100	98	99.6%
Z.AI GLM 4.5	100	100	100	100	98	99.6%
Claude Opus 4.5	100	100	100	100	97	99.3%
Hermes 3 70B	100	100	100	100	95	99.1%
GPT-4o, May 13th (temp=1)	100	100	100	100	95	99.0%
GPT-5.1	100	100	100	100	93	98.7%
Claude Opus 4	100	100	100	100	89	97.8%
Stealth: Aurora Alpha	100	100	100	100	88	97.7%
GPT-4.1 Mini	100	100	100	100	87	97.3%
Ministral 3B	100	100	100	100	85	96.9%
Ministral 8B	100	100	100	100	83	96.7%
GPT-5 Mini	100	100	100	100	80	96.0%
Mistral Medium 3.1	100	100	100	100	74	94.9%
GPT-5.2	100	100	100	100	74	94.9%
Arcee AI: Trinity Large (Preview)	100	100	100	100	64	92.9%
Gemma 3 4B	100	100	100	100	64	92.8%
Llama 3.1 70B	100	100	100	100	62	92.4%
Gemini 2.5 Flash Lite	100	100	100	100	52	90.5%
Claude 3.7 Sonnet	100	100	100	100	51	90.1%
WizardLM 2 8x22b	100	100	100	100	48	89.7%
GPT-4.1 Nano	100	100	100	96	31	85.3%
Claude 3.5 Sonnet	100	100	100	63	54	83.3%
GPT-4o Mini (temp=1)	100	100	100	100	13	82.7%
GPT-5	100	100	100	92	15	81.5%
Z.AI GLM 4.7 Flash	100	100	100	100	0	80.0%
Gemma 3 12B	100	100	100	100	0	80.0%
Writer: Palmyra X5	100	100	100	95	0	79.0%
Claude Opus 4.6	100	100	90	56	46	78.2%
Ministral 3 8B	100	100	100	41	39	76.0%
DeepSeek V3.2	100	100	100	77	0	75.5%
MoonshotAI: Kimi K2.5	100	100	100	54	13	73.5%
Qwen 3.5 Plus (2026-02-15)	100	100	100	66	0	73.2%
Mistral NeMO	100	100	100	54	0	70.8%
Grok 4	91	73	71	64	35	66.7%
Gemini 2.5 Flash	100	100	83	31	0	62.8%
GPT-4.1	100	100	100	14	0	62.7%
Gemma 3 27B	100	100	87	18	0	60.8%
Claude Sonnet 4.6	100	100	100	0	0	60.0%
Claude 3.5 Haiku	100	100	100	0	0	60.0%
Mistral Small Creative	100	100	95	0	0	58.9%
Grok 4 Fast	100	56	0	0	0	31.3%
GPT-5 Nano	6	0	0	0	0	1.3%

▼

Romance: separated couple reunites

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
GPT-5 Mini	100	100	100	100	100	100.0%
GPT-5.1	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
Z.AI GLM 5	100	100	100	100	100	100.0%
GPT-5.2	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Gemini 3 Pro (Preview)	100	100	100	100	100	100.0%
Claude Opus 4	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	100	100.0%
Claude Sonnet 4.6	100	100	100	100	100	100.0%
Claude Sonnet 4.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
Gemini 3 Flash (Preview)	100	100	100	100	100	100.0%
Z.AI GLM 4.7 Flash	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	100	100.0%
Claude 3.5 Sonnet	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
Claude 3.7 Sonnet	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
DeepSeek V3.2	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
ByteDance Seed 1.6 Flash	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=1)	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Llama 3.1 Nemotron 70B	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Mistral Small Creative	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
WizardLM 2 8x22b	100	100	100	100	100	100.0%
Ministral 3B	100	100	100	100	100	100.0%
Claude Sonnet 4	100	100	100	100	97	99.3%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	94	98.8%
DeepSeek V3.1	100	100	100	100	90	97.9%
DeepSeek V3 (2025-03-24)	100	100	100	100	89	97.9%
Mistral NeMO	100	100	100	100	85	97.0%
Mistral Large	100	100	100	100	85	96.9%
Ministral 3 8B	100	100	100	100	84	96.8%
GPT-4o, May 13th (temp=1)	100	100	100	100	75	95.0%
Gemini 2.5 Flash	100	100	100	100	72	94.4%
GPT-4.1 Mini	100	100	100	100	71	94.3%
GPT-4.1	100	100	100	100	62	92.3%
Ministral 3 14B	100	100	100	100	54	90.8%
Gemini 2.5 Pro	100	100	100	100	48	89.7%
Claude Opus 4.6	100	100	100	100	40	88.0%
Mistral Large 3	100	100	100	100	31	86.2%
Writer: Palmyra X5	100	100	100	100	31	86.1%
Grok 4 Fast	100	100	100	100	27	85.3%
DeepSeek-V2 Chat	100	100	100	100	0	80.0%
Gemini 2.5 Flash Lite	100	100	100	100	0	80.0%
Rocinante 12B	100	100	100	100	0	80.0%
Arcee AI: Trinity Mini	100	100	100	73	17	78.1%
MoonshotAI: Kimi K2.5	100	100	100	66	23	77.9%
Gemma 3 4B	100	100	100	28	27	71.1%
GPT-5 Nano	67	25	13	0	0	21.1%

▼

Thriller: chase through city streets

Model	# 1	# 2	# 3	# 4	# 5	Avg ▼
Gemini 3.1 Pro (Preview)	100	100	100	100	100	100.0%
Qwen 3.5 397B A17B	100	100	100	100	100	100.0%
GPT-5 Mini	100	100	100	100	100	100.0%
o4 Mini High	100	100	100	100	100	100.0%
Claude Opus 4.5	100	100	100	100	100	100.0%
GPT-5	100	100	100	100	100	100.0%
o4 Mini	100	100	100	100	100	100.0%
GPT-5.2	100	100	100	100	100	100.0%
Z.AI GLM 4.7	100	100	100	100	100	100.0%
Minimax M2.5	100	100	100	100	100	100.0%
Grok 4.1 Fast	100	100	100	100	100	100.0%
ByteDance Seed 1.6	100	100	100	100	100	100.0%
Z.AI GLM 4.6	100	100	100	100	100	100.0%
GPT-4.1	100	100	100	100	100	100.0%
Stealth: Aurora Alpha	100	100	100	100	100	100.0%
DeepSeek V3 (2025-03-24)	100	100	100	100	100	100.0%
Grok 4 Fast	100	100	100	100	100	100.0%
DeepSeek V3 (2024-12-26)	100	100	100	100	100	100.0%
Mistral Large 3	100	100	100	100	100	100.0%
GPT-4o, May 13th (temp=0)	100	100	100	100	100	100.0%
DeepSeek-V2 Chat	100	100	100	100	100	100.0%
Claude Haiku 4.5	100	100	100	100	100	100.0%
Z.AI GLM 4.5	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=1)	100	100	100	100	100	100.0%
GPT-4o, Aug. 6th (temp=0)	100	100	100	100	100	100.0%
Claude 3.5 Haiku	100	100	100	100	100	100.0%
GPT-4.1 Mini	100	100	100	100	100	100.0%
Mistral Medium 3.1	100	100	100	100	100	100.0%
Writer: Palmyra X5	100	100	100	100	100	100.0%
Mistral Large 2	100	100	100	100	100	100.0%
Hermes 3 405B	100	100	100	100	100	100.0%
GPT-4o Mini (temp=0)	100	100	100	100	100	100.0%
Gemma 3 12B	100	100	100	100	100	100.0%
Llama 3.1 70B	100	100	100	100	100	100.0%
Mistral Large	100	100	100	100	100	100.0%
Gemma 3 27B	100	100	100	100	100	100.0%
Qwen 2.5 72B	100	100	100	100	100	100.0%
Mistral Small 3.2 24B	100	100	100	100	100	100.0%
Arcee AI: Trinity Large (Preview)	100	100	100	100	100	100.0%
Hermes 3 70B	100	100	100	100	100	100.0%
Claude 3 Haiku	100	100	100	100	100	100.0%
Ministral 3 8B	100	100	100	100	100	100.0%
Ministral 3 3B	100	100	100	100	100	100.0%
Llama 3.1 8B	100	100	100	100	100	100.0%
Cohere Command R+ (Aug. 2024)	100	100	100	100	100	100.0%
Mistral NeMO	100	100	100	100	100	100.0%
GPT-4.1 Nano	100	100	100	100	100	100.0%
Gemma 3 4B	100	100	100	100	100	100.0%
Ministral 8B	100	100	100	100	100	100.0%
Ministral 3B	100	100	100	100	100	100.0%
Rocinante 12B	100	100	100	100	100	100.0%
Grok 4	100	100	100	100	99	99.7%
Claude Sonnet 4	100	100	100	100	97	99.4%
GPT-4o, May 13th (temp=1)	100	100	100	100	93	98.5%
DeepSeek V3.1	100	100	100	100	93	98.5%
Qwen 3.5 Plus (2026-02-15)	100	100	100	100	93	98.5%
WizardLM 2 8x22b	100	100	100	100	92	98.4%
Gemini 3 Pro (Preview)	100	100	100	100	91	98.2%
Gemini 3 Flash (Preview)	100	100	100	100	89	97.8%
Gemini 2.5 Flash Lite	100	100	100	100	86	97.2%
Claude Opus 4	100	100	100	100	76	95.2%
Llama 3.1 Nemotron 70B	100	100	100	100	68	93.6%
GPT-5.1	100	100	100	100	68	93.5%
Z.AI GLM 5	100	100	100	100	66	93.3%
Claude Sonnet 4.5	100	100	100	99	61	92.1%
Gemini 2.5 Flash	100	100	100	100	48	89.5%
Z.AI GLM 4.7 Flash	100	100	100	100	44	88.8%
MoonshotAI: Kimi K2.5	100	100	100	100	33	86.6%
Claude 3.5 Sonnet	100	100	100	100	22	84.5%
GPT-4o Mini (temp=1)	100	100	100	100	18	83.6%
ByteDance Seed 1.6 Flash	100	100	100	100	13	82.7%
Gemini 2.5 Pro	100	100	100	100	5	81.0%
Claude Sonnet 4.6	100	100	100	100	3	80.6%
Mistral Small Creative	100	100	100	90	11	80.4%
Ministral 3 14B	100	100	100	95	0	79.1%
GPT-5 Nano	100	100	88	73	0	72.3%
Arcee AI: Trinity Mini	100	100	77	66	14	71.3%
Claude 3.7 Sonnet	100	100	100	49	0	69.9%
Claude Opus 4.6	100	96	90	33	11	66.1%
DeepSeek V3.2	100	100	49	0	0	49.8%

"Not X but Y" pattern overuse

Overall Performance

Performance Score Distribution (Top 20)

Price-Performance Score Distribution (Top 20)

Most Stable Models (Top 20)

Top Overall Models (Top 20)

Individual Scenarios

Detailed Writing Rules

Fantasy: entering an ancient ruin

Horror: alone in an eerie place at night

Literary fiction: old friends reunite

Mystery: examining a crime scene

Romance: separated couple reunites

Thriller: chase through city streets

genre

Fantasy: entering an ancient ruin

Horror: alone in an eerie place at night

Literary fiction: old friends reunite

Mystery: examining a crime scene

Romance: separated couple reunites

Thriller: chase through city streets

Novelcrafter Default Prompt

Fantasy: entering an ancient ruin

Horror: alone in an eerie place at night

Literary fiction: old friends reunite

Mystery: examining a crime scene

Romance: separated couple reunites

Thriller: chase through city streets