Content Invention

Subcategory of Hallucination. 154 models scored.

Model Leaderboard

All models ranked by their Content Invention subcategory score.

#	Model	Content Invention	Hallucination	Overall
1	Qwen3.7 Max	100.00%	94.32%	95.75%
2	Claude Opus 4.6 (Reasoning)	100.00%	98.13%	95.02%
3	Qwen3.6 Max Preview	100.00%	91.72%	94.54%
4	Gemini 3.1 Pro (Preview)	100.00%	89.06%	94.37%
5	Z.AI GLM 5.1	100.00%	97.67%	94.37%
6	Z.AI GLM 5 Turbo	100.00%	98.64%	94.27%
7	Gemini 3.5 Flash (Reasoning)	100.00%	91.22%	94.08%
8	Claude Sonnet 4.6 (Reasoning)	100.00%	93.96%	93.66%
9	Grok 4.3 (Reasoning)	100.00%	97.40%	93.60%
10	GPT-5.4 (Reasoning)	100.00%	90.43%	93.24%
11	Claude Opus 4.7 (Reasoning)	100.00%	97.39%	93.23%
12	GPT-5.5 (Reasoning)	100.00%	84.24%	92.98%
13	GPT-5 Mini	100.00%	97.71%	92.62%
14	GPT-5.5 (Reasoning, Low)	100.00%	84.41%	92.59%
15	GPT-5.1	100.00%	98.44%	92.54%
16	Claude Opus 4.6	100.00%	93.60%	92.35%
17	MoonshotAI: Kimi K2.6	100.00%	88.85%	92.31%
18	Claude Opus 4.8 (Reasoning)	100.00%	94.83%	92.22%
19	Claude Opus 4.8 (Reasoning, Low)	100.00%	94.65%	92.14%
20	GPT-5	100.00%	91.84%	91.93%
21	Qwen 3.5 397B A17B	100.00%	82.10%	91.73%
22	Gemma 4 31B (Reasoning)	100.00%	94.41%	91.71%
23	Qwen 3.5 122B	100.00%	86.58%	91.53%
24	Qwen 3.5 Plus (2026-04-20)	100.00%	97.13%	91.51%
25	Gemma 4 26B (Reasoning)	100.00%	97.59%	91.49%
26	Grok 4.20 (Beta, Reasoning)	100.00%	96.28%	91.49%
27	GPT-5.4 (Reasoning, Low)	100.00%	92.29%	91.41%
28	Grok 4.20 (Reasoning)	100.00%	95.66%	91.39%
29	Z.AI GLM 5	100.00%	97.74%	91.23%
30	Claude Sonnet 4.6	100.00%	89.98%	91.15%
31	MoonshotAI: Kimi K2.5	100.00%	88.02%	91.04%
32	MiniMax M3	100.00%	95.66%	90.88%
33	Qwen 3.5 27B	100.00%	86.58%	90.85%
34	ByteDance Seed 1.6	100.00%	95.14%	90.70%
35	Qwen 3.6 Flash	100.00%	92.26%	90.65%
36	GPT-5.4 Mini (Reasoning)	100.00%	96.22%	90.65%
37	Gemini 3 Flash (Preview, Reasoning)	100.00%	85.32%	90.50%
38	o4 Mini High	100.00%	99.06%	90.29%
39	GPT-5.2	100.00%	95.09%	90.26%
40	DeepSeek V4 Pro (Reasoning)	100.00%	90.88%	90.10%
41	Claude Opus 4.7	100.00%	85.89%	89.93%
42	Qwen 3.6 27B	100.00%	95.73%	89.72%
43	Claude Opus 4.5	100.00%	82.06%	89.69%
44	Grok 4.1 Fast	100.00%	99.02%	89.55%
45	Z.AI GLM 4.6	100.00%	90.08%	89.11%
46	MiniMax M2.7	100.00%	96.47%	89.10%
47	GPT-5.5	100.00%	80.57%	89.09%
48	Qwen 3.6 35B	100.00%	90.19%	89.05%
49	Gemini 3 Pro (Preview)	100.00%	88.23%	88.79%
50	Claude Sonnet 4	100.00%	80.08%	88.72%
51	MiniMax M2.5	100.00%	92.94%	88.71%
52	Z.AI GLM 4.7	100.00%	88.47%	88.69%
53	GPT-4.1	100.00%	95.24%	88.68%
54	Gemini 2.5 Pro	100.00%	86.11%	88.53%
55	o4 Mini	100.00%	98.75%	88.35%
56	Grok 4	100.00%	89.45%	88.12%
57	Claude Sonnet 4.5	100.00%	75.57%	88.03%
58	Claude Opus 4	100.00%	75.68%	87.69%
59	Xiaomi MIMO v2.5 Pro	100.00%	95.17%	87.36%
60	Stealth: Hunter Alpha	100.00%	90.78%	87.34%
61	ByteDance Seed 2.0 Mini	100.00%	91.29%	86.91%
62	Gemma 4 31B	100.00%	90.17%	86.91%
63	Gemini 2.5 Flash (Reasoning)	100.00%	95.60%	86.51%
64	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%	79.09%	86.47%
65	GPT-OSS 120B	100.00%	95.29%	86.44%
66	Gemini 3.1 Flash Lite (Reasoning)	100.00%	75.26%	86.41%
67	Z.AI GLM 4.5	100.00%	87.05%	86.27%
68	Grok 4 Fast	100.00%	91.09%	86.15%
69	Qwen 3.5 Plus (2026-02-15)	100.00%	73.35%	85.96%
70	Stealth: Healer Alpha	100.00%	94.67%	85.93%
71	Gemini 3.1 Flash Lite (Preview)	100.00%	74.58%	85.87%
72	Gemma 4 26B	100.00%	88.45%	85.84%
73	Gemini 3.1 Flash Lite	100.00%	75.72%	85.75%
74	GPT-5.4 Mini (Reasoning, Low)	100.00%	98.43%	85.75%
75	Mistral Large 3	100.00%	78.17%	85.43%
76	GPT-4o, May 13th (temp=0)	100.00%	69.76%	85.36%
77	Gemini 3 Flash (Preview)	100.00%	71.24%	85.35%
78	Claude Haiku 4.5	100.00%	83.86%	85.14%
79	Xiaomi MIMO v2.5	100.00%	95.25%	85.05%
80	ByteDance Seed 2.0 Lite	100.00%	79.75%	84.80%
81	Nemotron 3 Super	100.00%	99.69%	84.56%
82	GPT-5.4	100.00%	73.78%	84.32%
83	Claude 3.5 Sonnet	100.00%	76.31%	84.24%
84	Grok 4.20 (Beta)	100.00%	78.28%	83.85%
85	GPT-4o, May 13th (temp=1)	100.00%	73.40%	83.80%
86	DeepSeek V3 (2024-12-26)	100.00%	73.11%	83.68%
87	Claude 3.7 Sonnet	100.00%	75.18%	83.39%
88	GPT-4.1 Mini	100.00%	81.14%	83.20%
89	Z.AI GLM 4.5 Air	100.00%	92.74%	83.12%
90	Hermes 3 405B	100.00%	79.70%	82.86%
91	DeepSeek V4 Pro	100.00%	78.72%	82.63%
92	GPT-4o, Aug. 6th (temp=1)	100.00%	79.53%	82.62%
93	GPT-4o, Aug. 6th (temp=0)	100.00%	73.35%	82.45%
94	GPT-5.4 Mini	100.00%	78.40%	82.43%
95	Mistral Large 2	100.00%	77.87%	82.41%
96	DeepSeek V3.1	100.00%	72.80%	82.39%
97	DeepSeek V3.2	100.00%	72.50%	82.25%
98	Qwen 3 32B	100.00%	89.56%	82.21%
99	DeepSeek V4 Flash	100.00%	75.47%	82.02%
100	Grok 4.20	100.00%	73.01%	81.70%
101	GPT-5.4 Nano (Reasoning)	100.00%	97.95%	81.36%
102	Gemini 2.5 Flash Lite	100.00%	76.17%	81.08%
103	Gemini 2.5 Flash	100.00%	71.70%	80.60%
104	Mistral Large	100.00%	77.50%	80.15%
105	GPT-5.4 Nano (Reasoning, Low)	100.00%	99.03%	79.48%
106	GPT-4o Mini (temp=1)	100.00%	76.78%	79.08%
107	Grok 4.3	100.00%	77.16%	78.66%
108	Mistral Small 3.2 24B	100.00%	75.64%	78.58%
109	Gemma 3 12B	100.00%	69.15%	78.41%
110	GPT-4o Mini (temp=0)	100.00%	73.56%	78.29%
111	Gemma 3 27B	100.00%	68.74%	77.85%
112	Mistral Medium 3.1	100.00%	82.09%	77.83%
113	Mistral Small 4	100.00%	73.41%	76.46%
114	Qwen 2.5 72B	100.00%	79.56%	75.46%
115	GPT-5.4 Nano	100.00%	89.29%	74.40%
116	Arcee AI: Trinity Large (Preview)	100.00%	76.47%	73.33%
117	Mistral Small Creative	100.00%	74.46%	73.27%
118	Ministral 3 14B	100.00%	79.99%	72.54%
119	GPT-4.1 Nano	100.00%	87.73%	71.94%
120	Ministral 3 8B	100.00%	92.52%	71.76%
121	Arcee AI: Trinity Mini	100.00%	91.12%	70.90%
122	Gemma 3 4B	100.00%	67.60%	68.57%
123	Ministral 8B	100.00%	89.19%	64.87%
124	LFM2 24B	100.00%	90.53%	58.77%
125	DeepSeek V4 Flash (Reasoning)	96.43%	95.02%	89.01%
126	Qwen 3.5 9B	96.43%	85.58%	86.05%
127	Mistral Small 4 (Reasoning)	96.43%	92.98%	82.39%
128	Llama 3.1 Nemotron 70B	96.43%	74.99%	74.70%
129	Gemini 2.5 Flash Lite (Reasoning)	95.24%	95.59%	85.75%
130	Qwen 3.5 Flash	92.90%	80.63%	86.38%
131	Aion 2.0	92.89%	93.57%	89.21%
132	Qwen 3.5 35B	92.86%	80.87%	88.00%
133	Z.AI GLM 4.7 Flash	92.86%	89.86%	84.82%
134	Writer: Palmyra X5	92.86%	72.04%	79.57%
135	Inception Mercury	92.86%	95.08%	79.50%
136	Llama 3.1 70B	92.86%	69.78%	78.40%
137	Cydonia 24B V4.1	91.07%	81.53%	75.09%
138	DeepSeek-V2 Chat	90.48%	69.48%	84.83%
139	ByteDance Seed 1.6 Flash	89.88%	94.52%	73.27%
140	Ministral 3 3B	89.29%	70.45%	67.22%
141	GPT-5 Nano	88.10%	93.52%	82.60%
142	DeepSeek V3 (2025-03-24)	88.10%	67.07%	81.99%
143	WizardLM 2 8x22b	85.73%	70.19%	71.06%
144	Qwen3 235B A22B Instruct 2507	85.71%	69.82%	80.10%
145	Ministral 3B	85.71%	70.75%	61.29%
146	Nemotron 3 Nano	82.14%	89.16%	77.73%
147	Inception Mercury 2	81.55%	92.60%	83.85%
148	Llama 3.1 8B	77.38%	43.27%	63.35%
149	Mistral NeMO	76.19%	62.63%	65.04%
150	Claude 3 Haiku	62.50%	60.81%	71.19%
151	Hermes 3 70B	60.79%	67.03%	72.57%
152	Skyfall 36B V2	56.07%	66.29%	65.76%
153	Rocinante 12B	38.95%	52.07%	54.54%
154	Cohere Command R+ (Aug. 2024)	37.90%	64.40%	69.03%