Hallucination

30 scenarios across 3 subcategories. 146 models scored.

Subcategories

Subcategory	Avg Score	Best Model	Best Score
False Positives	75.58%	o4 Mini High	98.59%
Content Invention	97.74%	GPT-5.6 Sol (Reasoning)	100.00%
Output Corruption	99.65%	GPT-5.6 Sol (Reasoning)	100.00%

Model Leaderboard

All models ranked by their Hallucination category score.

#	Model	Hallucination	False Positives	Content Invention	Output Corruption	Overall
1	o4 Mini High	99.53%	98.59%	100.00%	100.00%	88.78%
2	Z.AI GLM 5 Turbo	99.32%	97.97%	100.00%	99.99%	93.29%
3	Claude Opus 4.6 (Reasoning)	99.06%	97.19%	100.00%	100.00%	95.06%
4	Grok 4.5 (Reasoning, High)	99.06%	97.19%	100.00%	100.00%	94.12%
5	Grok 4.5 (Reasoning, Low)	99.01%	97.03%	100.00%	100.00%	90.94%
6	Z.AI GLM 5.2 (Reasoning, High)	98.99%	96.98%	100.00%	99.98%	93.41%
7	Z.AI GLM 5.1	98.84%	96.51%	100.00%	100.00%	93.74%
8	Gemma 4 26B (Reasoning)	98.79%	96.41%	100.00%	99.95%	89.02%
9	GPT-5.4 Mini (Reasoning, Low)	98.78%	96.36%	100.00%	99.98%	83.57%
10	Claude Opus 4.7 (Reasoning)	98.69%	96.09%	100.00%	99.98%	92.53%
11	GPT-5.6 Luna (Reasoning)	98.65%	95.94%	100.00%	100.00%	90.61%
12	o4 Mini	98.54%	95.63%	100.00%	100.00%	86.56%
13	Qwen 3.5 Plus (2026-04-20)	98.38%	96.25%	100.00%	98.89%	89.79%
14	GPT-5 Mini	98.28%	94.85%	100.00%	100.00%	91.31%
15	GPT-5.4 Mini (Reasoning)	98.11%	94.32%	100.00%	100.00%	89.82%
16	Z.AI GLM 5	98.04%	94.11%	100.00%	100.00%	89.60%
17	Grok 4.3 (Reasoning)	97.86%	93.59%	100.00%	100.00%	90.99%
18	MiniMax M3	97.83%	93.49%	100.00%	100.00%	90.45%
19	Gemini 2.5 Flash (Reasoning)	97.79%	93.44%	100.00%	99.94%	84.14%
20	GPT-OSS 120B	97.63%	92.97%	100.00%	99.92%	84.81%
21	Xiaomi MIMO v2.5	97.61%	92.92%	100.00%	99.93%	83.95%
22	ByteDance Seed 1.6	97.57%	92.71%	100.00%	100.00%	89.59%
23	GPT-5.2	97.55%	92.64%	100.00%	100.00%	89.45%
24	Qwen 3.6 27B	97.42%	92.33%	100.00%	99.95%	88.33%
25	Claude Opus 4.8 (Reasoning)	97.41%	92.24%	100.00%	100.00%	92.33%
26	Nemotron 3 Super	97.34%	92.03%	100.00%	100.00%	81.69%
27	Claude Opus 4.8 (Reasoning, Low)	97.33%	91.98%	100.00%	100.00%	91.89%
28	Qwen3.7 Max	97.15%	91.51%	100.00%	99.95%	94.55%
29	Xiaomi MIMO v2.5 Pro	97.02%	91.05%	100.00%	100.00%	86.05%
30	Gemini 2.5 Flash Lite (Reasoning)	96.99%	95.78%	95.24%	99.96%	83.10%
31	GPT-5.4 Nano (Reasoning)	96.98%	90.95%	100.00%	100.00%	80.02%
32	Claude Sonnet 4.6 (Reasoning)	96.98%	90.94%	100.00%	100.00%	93.64%
33	DeepSeek V4 Flash (Reasoning)	96.92%	94.32%	96.43%	100.00%	88.06%
34	Claude Opus 4.6	96.80%	90.40%	100.00%	100.00%	92.31%
35	Claude Sonnet 5 (Reasoning)	96.77%	90.31%	100.00%	100.00%	90.40%
36	Claude Sonnet 5 (Reasoning, Low)	96.44%	89.32%	100.00%	100.00%	90.16%
37	Gemma 4 31B (Reasoning)	96.37%	89.11%	100.00%	100.00%	89.64%
38	Grok 4.20 (Reasoning)	96.30%	88.89%	100.00%	100.00%	90.87%
39	GPT-5.1	96.22%	88.65%	100.00%	100.00%	90.73%
40	Qwen 3.6 Flash	96.13%	88.39%	100.00%	100.00%	89.31%
41	GPT-5	95.92%	87.76%	100.00%	100.00%	91.48%
42	GPT-5.6 Sol	95.87%	87.60%	100.00%	100.00%	90.76%
43	Qwen3.6 Max Preview	95.86%	87.58%	100.00%	100.00%	93.72%
44	GPT-5.4 (Reasoning, Low)	95.71%	87.14%	100.00%	100.00%	90.91%
45	GPT-5.4 Nano (Reasoning, Low)	95.71%	87.13%	100.00%	100.00%	77.46%
46	Gemini 3.5 Flash (Reasoning)	95.61%	86.82%	100.00%	100.00%	93.35%
47	GPT-4.1	95.60%	86.82%	100.00%	99.97%	86.82%
48	Aion 3.0	95.57%	86.82%	100.00%	99.88%	88.78%
49	DeepSeek V4 Pro (Reasoning)	95.25%	86.87%	100.00%	98.89%	89.28%
50	GPT-5.4 (Reasoning)	95.22%	85.65%	100.00%	100.00%	93.85%
51	Gemma 4 31B	95.09%	85.26%	100.00%	100.00%	85.23%
52	Claude Sonnet 4.6	94.99%	84.99%	100.00%	99.97%	90.66%
53	GPT-5.6 Terra (Reasoning)	94.98%	84.95%	100.00%	100.00%	92.49%
54	MiniMax M2.5	94.72%	84.66%	100.00%	99.50%	86.71%
55	MiniMax M2.7	94.69%	84.34%	100.00%	99.72%	86.23%
56	Gemini 3.1 Pro (Preview)	94.53%	83.59%	100.00%	100.00%	94.08%
57	MoonshotAI: Kimi K2.6	94.43%	83.28%	100.00%	100.00%	92.57%
58	Z.AI GLM 4.6	94.42%	84.38%	100.00%	98.89%	87.64%
59	Z.AI GLM 4.7	94.23%	82.73%	100.00%	99.96%	87.67%
60	Gemma 4 26B	94.22%	82.67%	100.00%	100.00%	84.89%
61	Aion 3.0 Mini	94.08%	88.15%	95.24%	98.85%	83.69%
62	Qwen 3.6 35B	94.07%	83.33%	100.00%	98.89%	87.66%
63	ByteDance Seed 2.0 Mini	94.07%	82.30%	100.00%	99.90%	85.69%
64	Ministral 3 8B	94.02%	82.06%	100.00%	100.00%	69.98%
65	MoonshotAI: Kimi K2.5	94.01%	82.03%	100.00%	100.00%	90.86%
66	Z.AI GLM 4.5 Air	94.00%	83.12%	100.00%	98.87%	80.74%
67	GPT-5.6 Luna	93.37%	80.10%	100.00%	100.00%	85.06%
68	Qwen 3.5 122B	93.29%	79.87%	100.00%	100.00%	90.32%
69	Qwen 3.5 27B	93.29%	79.86%	100.00%	100.00%	90.05%
70	Inception Mercury 2	93.22%	98.13%	81.55%	99.99%	81.99%
71	Mistral Small 4 (Reasoning)	93.18%	83.11%	96.43%	100.00%	79.48%
72	Aion 2.0	93.10%	86.41%	92.89%	100.00%	86.66%
73	Gemini 2.5 Pro	93.05%	79.16%	100.00%	100.00%	88.44%
74	GPT-5 Nano	92.99%	90.90%	88.10%	99.96%	80.16%
75	Claude Opus 4.7	92.95%	78.84%	100.00%	100.00%	89.90%
76	Gemini 3 Flash (Preview, Reasoning)	92.65%	77.99%	100.00%	99.97%	89.93%
77	GPT-5.5 (Reasoning)	92.12%	76.35%	100.00%	100.00%	93.72%
78	Claude Haiku 4.5	91.93%	75.80%	100.00%	100.00%	83.36%
79	Ministral 8B	91.89%	75.68%	100.00%	100.00%	63.77%
80	Z.AI GLM 4.5	91.83%	76.65%	100.00%	98.85%	84.95%
81	Qwen 3.5 9B	91.81%	81.32%	96.43%	97.67%	84.05%
82	GPT-5.6 Sol (Reasoning)	91.72%	75.16%	100.00%	100.00%	95.15%
83	GPT-5.5 (Reasoning, Low)	91.37%	74.11%	100.00%	100.00%	92.51%
84	Claude Opus 4.5	91.03%	73.09%	100.00%	100.00%	89.60%
85	ByteDance Seed 1.6 Flash	90.83%	83.03%	89.88%	99.57%	70.92%
86	GPT-5.6 Terra	90.63%	71.90%	100.00%	100.00%	88.23%
87	GPT-5.5	90.28%	70.85%	100.00%	100.00%	89.37%
88	GPT-4.1 Nano	90.07%	70.27%	100.00%	99.95%	69.90%
89	Claude Sonnet 4	90.04%	70.11%	100.00%	100.00%	87.64%
90	Qwen 3.5 397B A17B	90.04%	70.17%	100.00%	99.93%	91.09%
91	Z.AI GLM 4.7 Flash	90.00%	77.13%	92.86%	100.00%	82.21%
92	Gemini 3.5 Flash (Reasoning, Minimal)	89.55%	68.64%	100.00%	100.00%	85.88%
93	ByteDance Seed 2.0 Lite	89.50%	70.73%	100.00%	97.78%	84.27%
94	DeepSeek V4 Pro	89.36%	68.09%	100.00%	99.98%	82.05%
95	Qwen 3.5 35B	89.24%	74.87%	92.86%	100.00%	87.01%
96	Qwen 3 32B	89.06%	67.79%	100.00%	99.39%	79.37%
97	Mistral Medium 3.1	88.93%	66.78%	100.00%	100.00%	76.08%
98	Claude Sonnet 5	88.74%	66.23%	100.00%	100.00%	87.34%
99	Qwen 3.5 Flash	88.70%	73.21%	92.90%	99.98%	85.66%
100	Nemotron 3 Nano	88.55%	85.80%	82.14%	97.72%	74.50%
101	Arcee AI: Trinity Mini	88.35%	65.05%	100.00%	100.00%	67.68%
102	Gemini 2.5 Flash Lite	88.08%	64.28%	100.00%	99.96%	79.91%
103	Gemini 3.1 Flash Lite	87.86%	63.57%	100.00%	100.00%	85.09%
104	Claude Opus 4	87.84%	63.52%	100.00%	100.00%	87.22%
105	Grok 4.3	87.79%	63.41%	100.00%	99.96%	78.00%
106	Claude Sonnet 4.5	87.78%	63.35%	100.00%	100.00%	87.54%
107	DeepSeek V4 Flash	87.73%	63.20%	100.00%	100.00%	82.02%
108	Gemini 3.1 Flash Lite (Reasoning)	87.63%	62.89%	100.00%	100.00%	85.91%
109	Gemini 3.1 Flash Lite (Preview)	87.29%	61.89%	100.00%	99.97%	85.41%
110	Hermes 3 405B	87.02%	61.06%	100.00%	100.00%	80.80%
111	GPT-5.4 Nano	86.86%	60.58%	100.00%	100.00%	72.16%
112	Mistral Small 3.2 24B	86.84%	62.38%	100.00%	98.14%	77.36%
113	Qwen 3.5 Plus (2026-02-15)	86.62%	60.20%	100.00%	99.65%	86.17%
114	GPT-5.4	86.32%	58.96%	100.00%	100.00%	84.31%
115	DeepSeek V3.2	86.25%	58.75%	100.00%	100.00%	82.22%
116	GPT-4.1 Mini	86.23%	58.70%	100.00%	100.00%	81.40%
117	GPT-4o, Aug. 6th (temp=1)	86.00%	58.17%	100.00%	99.82%	81.28%
118	DeepSeek V3 (2024-12-26)	85.69%	57.07%	100.00%	100.00%	82.62%
119	Gemini 3 Flash (Preview)	85.61%	56.88%	100.00%	99.95%	85.47%
120	DeepSeek V3.1	85.55%	56.76%	100.00%	99.89%	82.35%
121	Qwen 2.5 72B	85.54%	58.36%	100.00%	98.28%	73.17%
122	Mistral Large 2	85.45%	56.35%	100.00%	100.00%	81.50%
123	GPT-4o Mini (temp=1)	85.28%	55.83%	100.00%	100.00%	77.82%
124	Gemini 2.5 Flash	85.27%	56.72%	100.00%	99.08%	80.61%
125	Cydonia 24B V4.1	85.22%	66.97%	91.07%	97.62%	72.68%
126	Mistral Large 3	85.18%	55.54%	100.00%	100.00%	84.29%
127	GPT-4o, Aug. 6th (temp=0)	83.33%	50.00%	100.00%	100.00%	82.18%
128	Ministral 3 14B	83.19%	49.57%	100.00%	100.00%	70.45%
129	DeepSeek-V2 Chat	82.48%	56.98%	90.48%	100.00%	84.09%
130	GPT-4o Mini (temp=0)	82.23%	46.70%	100.00%	100.00%	76.86%
131	GPT-5.4 Mini	81.85%	45.55%	100.00%	100.00%	80.45%
132	Mistral Small 4	81.83%	45.50%	100.00%	100.00%	75.23%
133	Gemma 3 27B	80.53%	41.59%	100.00%	100.00%	75.70%
134	Writer: Palmyra X5	80.49%	48.62%	92.86%	100.00%	78.11%
135	Grok 4.20	80.48%	41.45%	100.00%	99.99%	81.21%
136	Llama 3.1 70B	80.47%	61.33%	92.86%	87.24%	77.41%
137	WizardLM 2 8x22b	79.28%	57.02%	85.73%	95.09%	71.45%
138	Gemma 3 12B	77.13%	31.38%	100.00%	100.00%	76.07%
139	DeepSeek V3 (2025-03-24)	76.88%	47.98%	88.10%	94.58%	79.93%
140	Qwen3 235B A22B Instruct 2507	76.34%	43.34%	85.71%	99.97%	78.07%
141	Gemma 3 4B	75.94%	27.83%	100.00%	100.00%	66.33%
142	Ministral 3 3B	75.10%	36.00%	89.29%	100.00%	65.02%
143	Ministral 3B	73.79%	35.66%	85.71%	100.00%	59.25%
144	Hermes 3 70B	70.71%	53.99%	60.79%	97.34%	69.74%
145	Mistral NeMO	69.32%	31.81%	76.19%	99.97%	63.80%
146	Cohere Command R+ (Aug. 2024)	66.30%	61.00%	37.90%	100.00%	67.04%