Output Corruption

Subcategory of Hallucination. 155 models scored.

Model Leaderboard

All models ranked by their Output Corruption subcategory score.

#	Model	Output Corruption	Hallucination	Overall
1	Claude Opus 4.6 (Reasoning)	100.00%	98.13%	95.02%
2	Gemini 3.1 Pro (Preview)	100.00%	89.06%	94.37%
3	Z.AI GLM 5.1	100.00%	97.67%	94.37%
4	Gemini 3.5 Flash (Reasoning)	100.00%	91.22%	94.08%
5	Claude Sonnet 4.6 (Reasoning)	100.00%	93.96%	93.66%
6	Grok 4.3 (Reasoning)	100.00%	97.40%	93.60%
7	GPT-5.4 (Reasoning)	100.00%	90.43%	93.24%
8	GPT-5.5 (Reasoning)	100.00%	84.24%	92.98%
9	GPT-5 Mini	100.00%	97.71%	92.62%
10	GPT-5.5 (Reasoning, Low)	100.00%	84.41%	92.59%
11	GPT-5.1	100.00%	98.44%	92.54%
12	Claude Opus 4.6	100.00%	93.60%	92.35%
13	MoonshotAI: Kimi K2.6	100.00%	88.85%	92.31%
14	Claude Opus 4.8 (Reasoning)	100.00%	94.83%	92.22%
15	Claude Opus 4.8 (Reasoning, Low)	100.00%	94.65%	92.14%
16	GPT-5	100.00%	91.84%	91.93%
17	Gemma 4 31B (Reasoning)	100.00%	94.41%	91.71%
18	Qwen 3.5 122B	100.00%	86.58%	91.53%
19	Grok 4.20 (Beta, Reasoning)	100.00%	96.28%	91.49%
20	GPT-5.4 (Reasoning, Low)	100.00%	92.29%	91.41%
21	Grok 4.20 (Reasoning)	100.00%	95.66%	91.39%
22	MoonshotAI: Kimi K2.5	100.00%	88.02%	91.04%
23	MiniMax M3	100.00%	95.66%	90.88%
24	Qwen 3.5 27B	100.00%	86.58%	90.85%
25	ByteDance Seed 1.6	100.00%	95.14%	90.70%
26	Qwen 3.6 Flash	100.00%	92.26%	90.65%
27	o4 Mini High	100.00%	99.06%	90.29%
28	GPT-5.2	100.00%	95.09%	90.26%
29	Claude Opus 4.7	100.00%	85.89%	89.93%
30	Claude Opus 4.5	100.00%	82.06%	89.69%
31	Aion 2.0	100.00%	93.57%	89.21%
32	GPT-5.5	100.00%	80.57%	89.09%
33	DeepSeek V4 Flash (Reasoning)	100.00%	95.02%	89.01%
34	Gemini 3 Pro (Preview)	100.00%	88.23%	88.79%
35	Claude Sonnet 4	100.00%	80.08%	88.72%
36	Gemini 2.5 Pro	100.00%	86.11%	88.53%
37	o4 Mini	100.00%	98.75%	88.35%
38	Grok 4	100.00%	89.45%	88.12%
39	Claude Sonnet 4.5	100.00%	75.57%	88.03%
40	Qwen 3.5 35B	100.00%	80.87%	88.00%
41	Claude Opus 4	100.00%	75.68%	87.69%
42	Xiaomi MIMO v2.5 Pro	100.00%	95.17%	87.36%
43	Stealth: Hunter Alpha	100.00%	90.78%	87.34%
44	Gemma 4 31B	100.00%	90.17%	86.91%
45	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%	79.09%	86.47%
46	Gemini 3.1 Flash Lite (Reasoning)	100.00%	75.26%	86.41%
47	Grok 4 Fast	100.00%	91.09%	86.15%
48	Gemma 4 26B	100.00%	88.45%	85.84%
49	Gemini 3.1 Flash Lite	100.00%	75.72%	85.75%
50	Mistral Large 3	100.00%	78.17%	85.43%
51	GPT-4o, May 13th (temp=0)	100.00%	69.76%	85.36%
52	Claude Haiku 4.5	100.00%	83.86%	85.14%
53	DeepSeek-V2 Chat	100.00%	69.48%	84.83%
54	Z.AI GLM 4.7 Flash	100.00%	89.86%	84.82%
55	Nemotron 3 Super	100.00%	99.69%	84.56%
56	GPT-5.4	100.00%	73.78%	84.32%
57	Claude 3.5 Sonnet	100.00%	76.31%	84.24%
58	Grok 4.20 (Beta)	100.00%	78.28%	83.85%
59	GPT-4o, May 13th (temp=1)	100.00%	73.40%	83.80%
60	DeepSeek V3 (2024-12-26)	100.00%	73.11%	83.68%
61	Claude 3.7 Sonnet	100.00%	75.18%	83.39%
62	GPT-4.1 Mini	100.00%	81.14%	83.20%
63	Hermes 3 405B	100.00%	79.70%	82.86%
64	GPT-4o, Aug. 6th (temp=0)	100.00%	73.35%	82.45%
65	GPT-5.4 Mini	100.00%	78.40%	82.43%
66	Mistral Large 2	100.00%	77.87%	82.41%
67	Mistral Small 4 (Reasoning)	100.00%	92.98%	82.39%
68	DeepSeek V3.2	100.00%	72.50%	82.25%
69	DeepSeek V4 Flash	100.00%	75.47%	82.02%
70	GPT-5.4 Nano (Reasoning)	100.00%	97.95%	81.36%
71	Mistral Large	100.00%	77.50%	80.15%
72	Writer: Palmyra X5	100.00%	72.04%	79.57%
73	GPT-5.4 Nano (Reasoning, Low)	100.00%	99.03%	79.48%
74	GPT-4o Mini (temp=1)	100.00%	76.78%	79.08%
75	Gemma 3 12B	100.00%	69.15%	78.41%
76	GPT-4o Mini (temp=0)	100.00%	73.56%	78.29%
77	Gemma 3 27B	100.00%	68.74%	77.85%
78	Mistral Medium 3.1	100.00%	82.09%	77.83%
79	Mistral Small 4	100.00%	73.41%	76.46%
80	GPT-5.4 Nano	100.00%	89.29%	74.40%
81	Arcee AI: Trinity Large (Preview)	100.00%	76.47%	73.33%
82	Mistral Small Creative	100.00%	74.46%	73.27%
83	Ministral 3 14B	100.00%	79.99%	72.54%
84	Ministral 3 8B	100.00%	92.52%	71.76%
85	Claude 3 Haiku	100.00%	60.81%	71.19%
86	Arcee AI: Trinity Mini	100.00%	91.12%	70.90%
87	Cohere Command R+ (Aug. 2024)	100.00%	64.40%	69.03%
88	Gemma 3 4B	100.00%	67.60%	68.57%
89	Ministral 3 3B	100.00%	70.45%	67.22%
90	Ministral 8B	100.00%	89.19%	64.87%
91	Ministral 3B	100.00%	70.75%	61.29%
92	LFM2 24B	100.00%	90.53%	58.77%
93	GPT-5.4 Mini (Reasoning)	100.00%	96.22%	90.65%
94	Qwen3.6 Max Preview	100.00%	91.72%	94.54%
95	Z.AI GLM 5	100.00%	97.74%	91.23%
96	Z.AI GLM 5 Turbo	99.99%	98.64%	94.27%
97	Grok 4.20	99.99%	73.01%	81.70%
98	Grok 4.1 Fast	99.99%	99.02%	89.55%
99	Inception Mercury 2	99.99%	92.60%	83.85%
100	Claude Opus 4.7 (Reasoning)	99.98%	97.39%	93.23%
101	Qwen 3.5 Flash	99.98%	80.63%	86.38%
102	DeepSeek V4 Pro	99.98%	78.72%	82.63%
103	GPT-5.4 Mini (Reasoning, Low)	99.98%	98.43%	85.75%
104	GPT-4.1	99.97%	95.24%	88.68%
105	Gemini 3 Flash (Preview, Reasoning)	99.97%	85.32%	90.50%
106	Qwen3 235B A22B Instruct 2507	99.97%	69.82%	80.10%
107	Claude Sonnet 4.6	99.97%	89.98%	91.15%
108	Mistral NeMO	99.97%	62.63%	65.04%
109	Gemini 3.1 Flash Lite (Preview)	99.97%	74.58%	85.87%
110	Gemini 2.5 Flash Lite (Reasoning)	99.96%	95.59%	85.75%
111	Grok 4.3	99.96%	77.16%	78.66%
112	GPT-5 Nano	99.96%	93.52%	82.60%
113	Z.AI GLM 4.7	99.96%	88.47%	88.69%
114	Gemini 2.5 Flash Lite	99.96%	76.17%	81.08%
115	Gemini 3 Flash (Preview)	99.95%	71.24%	85.35%
116	Qwen3.7 Max	99.95%	94.32%	95.75%
117	Gemma 4 26B (Reasoning)	99.95%	97.59%	91.49%
118	Qwen 3.6 27B	99.95%	95.73%	89.72%
119	GPT-4.1 Nano	99.95%	87.73%	71.94%
120	Gemini 2.5 Flash (Reasoning)	99.94%	95.60%	86.51%
121	Stealth: Healer Alpha	99.93%	94.67%	85.93%
122	Stealth: Aurora Alpha	99.93%	99.93%	83.79%
123	Qwen 3.5 397B A17B	99.93%	82.10%	91.73%
124	Xiaomi MIMO v2.5	99.93%	95.25%	85.05%
125	GPT-OSS 120B	99.92%	95.29%	86.44%
126	ByteDance Seed 2.0 Mini	99.90%	91.29%	86.91%
127	DeepSeek V3.1	99.89%	72.80%	82.39%
128	GPT-4o, Aug. 6th (temp=1)	99.82%	79.53%	82.62%
129	Llama 3.1 Nemotron 70B	99.75%	74.99%	74.70%
130	MiniMax M2.7	99.72%	96.47%	89.10%
131	Qwen 3.5 Plus (2026-02-15)	99.65%	73.35%	85.96%
132	ByteDance Seed 1.6 Flash	99.57%	94.52%	73.27%
133	MiniMax M2.5	99.50%	92.94%	88.71%
134	Qwen 3 32B	99.39%	89.56%	82.21%
135	Gemini 2.5 Flash	99.08%	71.70%	80.60%
136	Qwen 3.5 Plus (2026-04-20)	98.89%	97.13%	91.51%
137	DeepSeek V4 Pro (Reasoning)	98.89%	90.88%	90.10%
138	Z.AI GLM 4.6	98.89%	90.08%	89.11%
139	Qwen 3.6 35B	98.89%	90.19%	89.05%
140	Z.AI GLM 4.5 Air	98.87%	92.74%	83.12%
141	Z.AI GLM 4.5	98.85%	87.05%	86.27%
142	Qwen 2.5 72B	98.28%	79.56%	75.46%
143	Mistral Small 3.2 24B	98.14%	75.64%	78.58%
144	Skyfall 36B V2	97.93%	66.29%	65.76%
145	ByteDance Seed 2.0 Lite	97.78%	79.75%	84.80%
146	Nemotron 3 Nano	97.72%	89.16%	77.73%
147	Qwen 3.5 9B	97.67%	85.58%	86.05%
148	Cydonia 24B V4.1	97.62%	81.53%	75.09%
149	Hermes 3 70B	97.34%	67.03%	72.57%
150	WizardLM 2 8x22b	95.09%	70.19%	71.06%
151	DeepSeek V3 (2025-03-24)	94.58%	67.07%	81.99%
152	Inception Mercury	94.25%	95.08%	79.50%
153	Rocinante 12B	89.87%	52.07%	54.54%
154	Llama 3.1 70B	87.24%	69.78%	78.40%
155	Llama 3.1 8B	34.01%	43.27%	63.35%