Deduction

Subcategory of Reasoning. 155 models scored.

Model Leaderboard

All models ranked by their Deduction subcategory score.

#	Model	Deduction	Reasoning	Overall
1	Gemini 3.5 Flash (Reasoning)	99.50%	98.45%	94.08%
2	Gemini 3 Flash (Preview, Reasoning)	99.50%	98.05%	90.50%
3	Gemma 4 26B (Reasoning)	99.00%	95.21%	91.49%
4	Gemma 4 31B (Reasoning)	98.50%	97.19%	91.71%
5	Gemma 4 31B	98.00%	96.52%	86.91%
6	Claude Opus 4.7	97.50%	95.72%	89.93%
7	Z.AI GLM 4.6	97.50%	95.12%	89.11%
8	Gemini 2.5 Pro	97.00%	96.91%	88.53%
9	Gemma 4 26B	97.00%	88.02%	85.84%
10	Z.AI GLM 5.1	96.50%	96.83%	94.37%
11	Gemini 2.5 Flash Lite (Reasoning)	96.50%	93.86%	85.75%
12	ByteDance Seed 2.0 Lite	96.50%	95.50%	84.80%
13	Qwen3.6 Max Preview	96.00%	96.51%	94.54%
14	Gemini 3.1 Pro (Preview)	96.00%	96.01%	94.37%
15	Z.AI GLM 5	96.00%	95.89%	91.23%
16	MoonshotAI: Kimi K2.5	96.00%	95.41%	91.04%
17	GPT-5.4 Mini	96.00%	88.04%	82.43%
18	Gemini 2.5 Flash (Reasoning)	95.89%	93.81%	86.51%
19	GPT-5 Mini	95.50%	94.36%	92.62%
20	GPT-5.2	95.50%	94.54%	90.26%
21	MiniMax M2.5	95.50%	92.42%	88.71%
22	o4 Mini	95.50%	94.45%	88.35%
23	Qwen3.7 Max	95.00%	95.84%	95.75%
24	Z.AI GLM 5 Turbo	95.00%	95.67%	94.27%
25	Grok 4.3 (Reasoning)	95.00%	95.46%	93.60%
26	GPT-5.4 (Reasoning)	95.00%	94.78%	93.24%
27	Claude Opus 4.7 (Reasoning)	95.00%	95.44%	93.23%
28	GPT-5.5 (Reasoning)	95.00%	94.89%	92.98%
29	GPT-5.5 (Reasoning, Low)	95.00%	95.01%	92.59%
30	GPT-5.1	95.00%	95.14%	92.54%
31	MoonshotAI: Kimi K2.6	95.00%	95.22%	92.31%
32	GPT-5	95.00%	95.67%	91.93%
33	Qwen 3.5 397B A17B	95.00%	95.06%	91.73%
34	Qwen 3.5 122B	95.00%	94.93%	91.53%
35	GPT-5.4 (Reasoning, Low)	95.00%	94.34%	91.41%
36	Qwen 3.5 27B	95.00%	92.73%	90.85%
37	Qwen 3.6 Flash	95.00%	94.20%	90.65%
38	GPT-5.4 Mini (Reasoning)	95.00%	94.56%	90.65%
39	o4 Mini High	95.00%	95.02%	90.29%
40	GPT-5.5	95.00%	95.16%	89.09%
41	Qwen 3.6 35B	95.00%	94.08%	89.05%
42	Gemini 3 Pro (Preview)	95.00%	95.24%	88.79%
43	Z.AI GLM 4.7	95.00%	94.99%	88.69%
44	Grok 4	95.00%	96.01%	88.12%
45	Qwen 3.5 35B	95.00%	94.88%	88.00%
46	GPT-OSS 120B	95.00%	92.42%	86.44%
47	Qwen 3.5 Flash	95.00%	94.66%	86.38%
48	Grok 4 Fast	95.00%	94.89%	86.15%
49	Qwen 3.5 9B	95.00%	92.93%	86.05%
50	Gemini 3.1 Flash Lite	95.00%	92.10%	85.75%
51	GPT-5.4 Mini (Reasoning, Low)	95.00%	92.28%	85.75%
52	Gemini 3 Flash (Preview)	95.00%	94.79%	85.35%
53	Z.AI GLM 4.7 Flash	95.00%	89.50%	84.82%
54	Nemotron 3 Super	95.00%	93.11%	84.56%
55	GPT-5.4	95.00%	93.92%	84.32%
56	Inception Mercury 2	95.00%	92.03%	83.85%
57	Stealth: Aurora Alpha	95.00%	90.11%	83.79%
58	GPT-5 Nano	95.00%	89.61%	82.60%
59	Gemini 2.5 Flash Lite	95.00%	85.80%	81.08%
60	Gemini 2.5 Flash	95.00%	92.60%	80.60%
61	Inception Mercury	95.00%	85.96%	79.50%
62	GPT-5.4 Nano (Reasoning, Low)	95.00%	78.93%	79.48%
63	Gemma 3 12B	95.00%	79.42%	78.41%
64	Gemma 3 27B	95.00%	86.74%	77.85%
65	Nemotron 3 Nano	95.00%	89.91%	77.73%
66	Gemma 3 4B	95.00%	73.64%	68.57%
67	MiniMax M2.7	94.94%	93.28%	89.10%
68	Qwen 3.5 Plus (2026-04-20)	94.44%	94.99%	91.51%
69	DeepSeek V4 Pro (Reasoning)	94.44%	94.61%	90.10%
70	DeepSeek V4 Flash (Reasoning)	94.44%	95.08%	89.01%
71	Claude Sonnet 4	94.44%	94.48%	88.72%
72	Gemini 3.1 Flash Lite (Reasoning)	94.44%	91.72%	86.41%
73	Gemini 3.1 Flash Lite (Preview)	94.44%	92.15%	85.87%
74	GPT-4o Mini (temp=0)	94.44%	81.26%	78.29%
75	GPT-4.1 Nano	94.44%	70.24%	71.94%
76	Qwen 3.6 27B	93.94%	93.08%	89.72%
77	Qwen 3.5 Plus (2026-02-15)	93.94%	93.45%	85.96%
78	GPT-5.4 Nano (Reasoning)	93.89%	88.48%	81.36%
79	GPT-4o Mini (temp=1)	93.44%	80.28%	79.08%
80	GPT-5.4 Nano	93.06%	75.66%	74.40%
81	Mistral Small Creative	92.94%	87.99%	73.27%
82	Gemini 3.5 Flash (Reasoning, Minimal)	92.83%	94.12%	86.47%
83	Aion 2.0	92.78%	94.13%	89.21%
84	Z.AI GLM 4.5	92.22%	91.03%	86.27%
85	Claude Opus 4	91.94%	92.59%	87.69%
86	ByteDance Seed 1.6 Flash	91.67%	86.52%	73.27%
87	Grok 4.1 Fast	91.11%	93.58%	89.55%
88	ByteDance Seed 2.0 Mini	91.11%	92.40%	86.91%
89	Xiaomi MIMO v2.5	90.56%	92.43%	85.05%
90	Mistral Small 4 (Reasoning)	90.44%	87.78%	82.39%
91	DeepSeek V3 (2024-12-26)	90.39%	88.71%	83.68%
92	DeepSeek V3.2	90.06%	89.46%	82.25%
93	ByteDance Seed 1.6	90.00%	91.49%	90.70%
94	Stealth: Healer Alpha	90.00%	91.67%	85.93%
95	GPT-4.1	89.94%	88.46%	88.68%
96	Stealth: Hunter Alpha	89.94%	91.67%	87.34%
97	DeepSeek V3 (2025-03-24)	89.61%	88.45%	81.99%
98	Mistral Small 4	89.50%	78.72%	76.46%
99	Claude Opus 4.6 (Reasoning)	89.44%	93.77%	95.02%
100	Claude Sonnet 4.6 (Reasoning)	89.44%	92.76%	93.66%
101	Claude Opus 4.8 (Reasoning)	89.44%	93.53%	92.22%
102	Claude Opus 4.8 (Reasoning, Low)	89.44%	93.37%	92.14%
103	Claude Opus 4.5	89.44%	93.93%	89.69%
104	Claude Sonnet 4.5	89.44%	92.50%	88.03%
105	Xiaomi MIMO v2.5 Pro	89.44%	92.07%	87.36%
106	Mistral Large 3	89.44%	88.95%	85.43%
107	GPT-4o, May 13th (temp=0)	89.44%	88.58%	85.36%
108	Claude Haiku 4.5	89.44%	87.76%	85.14%
109	DeepSeek-V2 Chat	89.44%	88.70%	84.83%
110	Claude 3.5 Sonnet	89.44%	90.30%	84.24%
111	GPT-4o, May 13th (temp=1)	89.44%	85.98%	83.80%
112	Claude 3.7 Sonnet	89.44%	89.94%	83.39%
113	GPT-4.1 Mini	89.44%	85.83%	83.20%
114	GPT-4o, Aug. 6th (temp=1)	89.44%	86.91%	82.62%
115	GPT-4o, Aug. 6th (temp=0)	89.44%	87.59%	82.45%
116	Writer: Palmyra X5	89.44%	86.57%	79.57%
117	Mistral Medium 3.1	89.44%	89.32%	77.83%
118	Qwen 2.5 72B	89.44%	83.43%	75.46%
119	Ministral 3 14B	89.44%	83.24%	72.54%
120	Hermes 3 70B	89.39%	79.08%	72.57%
121	Claude Opus 4.6	88.89%	93.33%	92.35%
122	Qwen 3 32B	88.89%	86.35%	82.21%
123	Claude 3 Haiku	88.89%	77.94%	71.19%
124	Z.AI GLM 4.5 Air	88.83%	87.91%	83.12%
125	Qwen3 235B A22B Instruct 2507	88.39%	85.82%	80.10%
126	Hermes 3 405B	88.28%	85.58%	82.86%
127	MiniMax M3	87.78%	91.84%	90.88%
128	Mistral Large 2	87.78%	88.20%	82.41%
129	Mistral NeMO	87.33%	57.59%	65.04%
130	Llama 3.1 Nemotron 70B	87.22%	82.19%	74.70%
131	Llama 3.1 8B	86.78%	69.12%	63.35%
132	Grok 4.20 (Beta)	86.72%	87.05%	83.85%
133	Grok 4.3	85.89%	84.62%	78.66%
134	Llama 3.1 70B	85.00%	79.31%	78.40%
135	DeepSeek V3.1	84.94%	83.95%	82.39%
136	DeepSeek V4 Pro	84.17%	87.07%	82.63%
137	Claude Sonnet 4.6	83.89%	88.48%	91.15%
138	Mistral Small 3.2 24B	83.89%	81.71%	78.58%
139	Arcee AI: Trinity Large (Preview)	83.89%	77.24%	73.33%
140	Cydonia 24B V4.1	83.61%	76.47%	75.09%
141	Arcee AI: Trinity Mini	82.28%	76.94%	70.90%
142	Grok 4.20	81.50%	84.81%	81.70%
143	LFM2 24B	78.89%	54.88%	58.77%
144	Ministral 3 3B	77.78%	71.88%	67.22%
145	Ministral 8B	76.00%	73.78%	64.87%
146	Rocinante 12B	74.89%	54.31%	54.54%
147	Ministral 3B	72.83%	69.70%	61.29%
148	Skyfall 36B V2	70.94%	62.23%	65.76%
149	WizardLM 2 8x22b	69.94%	67.36%	71.06%
150	DeepSeek V4 Flash	68.94%	79.16%	82.02%
151	Grok 4.20 (Beta, Reasoning)	68.33%	82.64%	91.49%
152	Ministral 3 8B	68.33%	71.64%	71.76%
153	Mistral Large	66.06%	76.31%	80.15%
154	Cohere Command R+ (Aug. 2024)	63.83%	65.10%	69.03%
155	Grok 4.20 (Reasoning)	61.11%	79.11%	91.39%