Reasoning

20 scenarios across 2 subcategories. 155 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Deduction 90.63% Gemini 3.5 Flash (Reasoning) 99.50%
Attention 85.94% Claude Opus 4.5 98.41%

Model Leaderboard

All models ranked by their Reasoning category score.

# Model Reasoning Deduction Attention Overall
1 Gemini 3.5 Flash (Reasoning) 98.45% 99.50% 97.41% 94.08%
2 Gemini 3 Flash (Preview, Reasoning) 98.05% 99.50% 96.60% 90.50%
3 Gemma 4 31B (Reasoning) 97.19% 98.50% 95.88% 91.71%
4 Gemini 2.5 Pro 96.91% 97.00% 96.82% 88.53%
5 Z.AI GLM 5.1 96.83% 96.50% 97.16% 94.37%
6 Gemma 4 31B 96.52% 98.00% 95.04% 86.91%
7 Qwen3.6 Max Preview 96.51% 96.00% 97.02% 94.54%
8 Gemini 3.1 Pro (Preview) 96.01% 96.00% 96.03% 94.37%
9 Grok 4 96.01% 95.00% 97.02% 88.12%
10 Z.AI GLM 5 95.89% 96.00% 95.78% 91.23%
11 Qwen3.7 Max 95.84% 95.00% 96.68% 95.75%
12 Claude Opus 4.7 95.72% 97.50% 93.95% 89.93%
13 Z.AI GLM 5 Turbo 95.67% 95.00% 96.34% 94.27%
14 GPT-5 95.67% 95.00% 96.33% 91.93%
15 ByteDance Seed 2.0 Lite 95.50% 96.50% 94.49% 84.80%
16 Grok 4.3 (Reasoning) 95.46% 95.00% 95.92% 93.60%
17 Claude Opus 4.7 (Reasoning) 95.44% 95.00% 95.87% 93.23%
18 MoonshotAI: Kimi K2.5 95.41% 96.00% 94.83% 91.04%
19 Gemini 3 Pro (Preview) 95.24% 95.00% 95.47% 88.79%
20 MoonshotAI: Kimi K2.6 95.22% 95.00% 95.44% 92.31%
21 Gemma 4 26B (Reasoning) 95.21% 99.00% 91.42% 91.49%
22 GPT-5.5 95.16% 95.00% 95.33% 89.09%
23 GPT-5.1 95.14% 95.00% 95.29% 92.54%
24 Z.AI GLM 4.6 95.12% 97.50% 92.74% 89.11%
25 DeepSeek V4 Flash (Reasoning) 95.08% 94.44% 95.71% 89.01%
26 Qwen 3.5 397B A17B 95.06% 95.00% 95.13% 91.73%
27 o4 Mini High 95.02% 95.00% 95.05% 90.29%
28 GPT-5.5 (Reasoning, Low) 95.01% 95.00% 95.03% 92.59%
29 Z.AI GLM 4.7 94.99% 95.00% 94.98% 88.69%
30 Qwen 3.5 Plus (2026-04-20) 94.99% 94.44% 95.53% 91.51%
31 Qwen 3.5 122B 94.93% 95.00% 94.87% 91.53%
32 Grok 4 Fast 94.89% 95.00% 94.78% 86.15%
33 GPT-5.5 (Reasoning) 94.89% 95.00% 94.78% 92.98%
34 Qwen 3.5 35B 94.88% 95.00% 94.75% 88.00%
35 Gemini 3 Flash (Preview) 94.79% 95.00% 94.58% 85.35%
36 GPT-5.4 (Reasoning) 94.78% 95.00% 94.55% 93.24%
37 Qwen 3.5 Flash 94.66% 95.00% 94.31% 86.38%
38 DeepSeek V4 Pro (Reasoning) 94.61% 94.44% 94.77% 90.10%
39 GPT-5.4 Mini (Reasoning) 94.56% 95.00% 94.13% 90.65%
40 GPT-5.2 94.54% 95.50% 93.58% 90.26%
41 Claude Sonnet 4 94.48% 94.44% 94.52% 88.72%
42 o4 Mini 94.45% 95.50% 93.40% 88.35%
43 GPT-5 Mini 94.36% 95.50% 93.22% 92.62%
44 GPT-5.4 (Reasoning, Low) 94.34% 95.00% 93.67% 91.41%
45 Qwen 3.6 Flash 94.20% 95.00% 93.41% 90.65%
46 Aion 2.0 94.13% 92.78% 95.49% 89.21%
47 Gemini 3.5 Flash (Reasoning, Minimal) 94.12% 92.83% 95.40% 86.47%
48 Qwen 3.6 35B 94.08% 95.00% 93.16% 89.05%
49 Claude Opus 4.5 93.93% 89.44% 98.41% 89.69%
50 GPT-5.4 93.92% 95.00% 92.85% 84.32%
51 Gemini 2.5 Flash Lite (Reasoning) 93.86% 96.50% 91.22% 85.75%
52 Gemini 2.5 Flash (Reasoning) 93.81% 95.89% 91.73% 86.51%
53 Claude Opus 4.6 (Reasoning) 93.77% 89.44% 98.10% 95.02%
54 Grok 4.1 Fast 93.58% 91.11% 96.05% 89.55%
55 Claude Opus 4.8 (Reasoning) 93.53% 89.44% 97.61% 92.22%
56 Qwen 3.5 Plus (2026-02-15) 93.45% 93.94% 92.96% 85.96%
57 Claude Opus 4.8 (Reasoning, Low) 93.37% 89.44% 97.30% 92.14%
58 Claude Opus 4.6 93.33% 88.89% 97.78% 92.35%
59 MiniMax M2.7 93.28% 94.94% 91.61% 89.10%
60 Nemotron 3 Super 93.11% 95.00% 91.22% 84.56%
61 Qwen 3.6 27B 93.08% 93.94% 92.21% 89.72%
62 Qwen 3.5 9B 92.93% 95.00% 90.87% 86.05%
63 Claude Sonnet 4.6 (Reasoning) 92.76% 89.44% 96.08% 93.66%
64 Qwen 3.5 27B 92.73% 95.00% 90.46% 90.85%
65 Gemini 2.5 Flash 92.60% 95.00% 90.21% 80.60%
66 Claude Opus 4 92.59% 91.94% 93.24% 87.69%
67 Claude Sonnet 4.5 92.50% 89.44% 95.56% 88.03%
68 Xiaomi MIMO v2.5 92.43% 90.56% 94.31% 85.05%
69 GPT-OSS 120B 92.42% 95.00% 89.84% 86.44%
70 MiniMax M2.5 92.42% 95.50% 89.33% 88.71%
71 ByteDance Seed 2.0 Mini 92.40% 91.11% 93.68% 86.91%
72 GPT-5.4 Mini (Reasoning, Low) 92.28% 95.00% 89.57% 85.75%
73 Gemini 3.1 Flash Lite (Preview) 92.15% 94.44% 89.85% 85.87%
74 Gemini 3.1 Flash Lite 92.10% 95.00% 89.19% 85.75%
75 Xiaomi MIMO v2.5 Pro 92.07% 89.44% 94.70% 87.36%
76 Inception Mercury 2 92.03% 95.00% 89.05% 83.85%
77 MiniMax M3 91.84% 87.78% 95.91% 90.88%
78 Gemini 3.1 Flash Lite (Reasoning) 91.72% 94.44% 89.00% 86.41%
79 Stealth: Healer Alpha 91.67% 90.00% 93.35% 85.93%
80 Stealth: Hunter Alpha 91.67% 89.94% 93.39% 87.34%
81 ByteDance Seed 1.6 91.49% 90.00% 92.98% 90.70%
82 Z.AI GLM 4.5 91.03% 92.22% 89.84% 86.27%
83 Claude 3.5 Sonnet 90.30% 89.44% 91.16% 84.24%
84 Stealth: Aurora Alpha 90.11% 95.00% 85.21% 83.79%
85 Claude 3.7 Sonnet 89.94% 89.44% 90.43% 83.39%
86 Nemotron 3 Nano 89.91% 95.00% 84.82% 77.73%
87 GPT-5 Nano 89.61% 95.00% 84.22% 82.60%
88 Z.AI GLM 4.7 Flash 89.50% 95.00% 83.99% 84.82%
89 DeepSeek V3.2 89.46% 90.06% 88.87% 82.25%
90 Mistral Medium 3.1 89.32% 89.44% 89.19% 77.83%
91 Mistral Large 3 88.95% 89.44% 88.45% 85.43%
92 DeepSeek V3 (2024-12-26) 88.71% 90.39% 87.02% 83.68%
93 DeepSeek-V2 Chat 88.70% 89.44% 87.95% 84.83%
94 GPT-4o, May 13th (temp=0) 88.58% 89.44% 87.71% 85.36%
95 GPT-5.4 Nano (Reasoning) 88.48% 93.89% 83.07% 81.36%
96 Claude Sonnet 4.6 88.48% 83.89% 93.07% 91.15%
97 GPT-4.1 88.46% 89.94% 86.98% 88.68%
98 DeepSeek V3 (2025-03-24) 88.45% 89.61% 87.29% 81.99%
99 Mistral Large 2 88.20% 87.78% 88.62% 82.41%
100 GPT-5.4 Mini 88.04% 96.00% 80.08% 82.43%
101 Gemma 4 26B 88.02% 97.00% 79.04% 85.84%
102 Mistral Small Creative 87.99% 92.94% 83.04% 73.27%
103 Z.AI GLM 4.5 Air 87.91% 88.83% 86.99% 83.12%
104 Mistral Small 4 (Reasoning) 87.78% 90.44% 85.12% 82.39%
105 Claude Haiku 4.5 87.76% 89.44% 86.08% 85.14%
106 GPT-4o, Aug. 6th (temp=0) 87.59% 89.44% 85.73% 82.45%
107 DeepSeek V4 Pro 87.07% 84.17% 89.98% 82.63%
108 Grok 4.20 (Beta) 87.05% 86.72% 87.37% 83.85%
109 GPT-4o, Aug. 6th (temp=1) 86.91% 89.44% 84.37% 82.62%
110 Gemma 3 27B 86.74% 95.00% 78.48% 77.85%
111 Writer: Palmyra X5 86.57% 89.44% 83.69% 79.57%
112 ByteDance Seed 1.6 Flash 86.52% 91.67% 81.38% 73.27%
113 Qwen 3 32B 86.35% 88.89% 83.81% 82.21%
114 GPT-4o, May 13th (temp=1) 85.98% 89.44% 82.52% 83.80%
115 Inception Mercury 85.96% 95.00% 76.92% 79.50%
116 GPT-4.1 Mini 85.83% 89.44% 82.22% 83.20%
117 Qwen3 235B A22B Instruct 2507 85.82% 88.39% 83.25% 80.10%
118 Gemini 2.5 Flash Lite 85.80% 95.00% 76.60% 81.08%
119 Hermes 3 405B 85.58% 88.28% 82.88% 82.86%
120 Grok 4.20 84.81% 81.50% 88.11% 81.70%
121 Grok 4.3 84.62% 85.89% 83.35% 78.66%
122 DeepSeek V3.1 83.95% 84.94% 82.96% 82.39%
123 Qwen 2.5 72B 83.43% 89.44% 77.42% 75.46%
124 Ministral 3 14B 83.24% 89.44% 77.03% 72.54%
125 Grok 4.20 (Beta, Reasoning) 82.64% 68.33% 96.94% 91.49%
126 Llama 3.1 Nemotron 70B 82.19% 87.22% 77.15% 74.70%
127 Mistral Small 3.2 24B 81.71% 83.89% 79.53% 78.58%
128 GPT-4o Mini (temp=0) 81.26% 94.44% 68.07% 78.29%
129 GPT-4o Mini (temp=1) 80.28% 93.44% 67.12% 79.08%
130 Gemma 3 12B 79.42% 95.00% 63.84% 78.41%
131 Llama 3.1 70B 79.31% 85.00% 73.63% 78.40%
132 DeepSeek V4 Flash 79.16% 68.94% 89.38% 82.02%
133 Grok 4.20 (Reasoning) 79.11% 61.11% 97.12% 91.39%
134 Hermes 3 70B 79.08% 89.39% 68.76% 72.57%
135 GPT-5.4 Nano (Reasoning, Low) 78.93% 95.00% 62.85% 79.48%
136 Mistral Small 4 78.72% 89.50% 67.95% 76.46%
137 Claude 3 Haiku 77.94% 88.89% 66.98% 71.19%
138 Arcee AI: Trinity Large (Preview) 77.24% 83.89% 70.59% 73.33%
139 Arcee AI: Trinity Mini 76.94% 82.28% 71.60% 70.90%
140 Cydonia 24B V4.1 76.47% 83.61% 69.33% 75.09%
141 Mistral Large 76.31% 66.06% 86.57% 80.15%
142 GPT-5.4 Nano 75.66% 93.06% 58.27% 74.40%
143 Ministral 8B 73.78% 76.00% 71.56% 64.87%
144 Gemma 3 4B 73.64% 95.00% 52.28% 68.57%
145 Ministral 3 3B 71.88% 77.78% 65.98% 67.22%
146 Ministral 3 8B 71.64% 68.33% 74.95% 71.76%
147 GPT-4.1 Nano 70.24% 94.44% 46.04% 71.94%
148 Ministral 3B 69.70% 72.83% 66.56% 61.29%
149 Llama 3.1 8B 69.12% 86.78% 51.46% 63.35%
150 WizardLM 2 8x22b 67.36% 69.94% 64.77% 71.06%
151 Cohere Command R+ (Aug. 2024) 65.10% 63.83% 66.37% 69.03%
152 Skyfall 36B V2 62.23% 70.94% 53.52% 65.76%
153 Mistral NeMO 57.59% 87.33% 27.84% 65.04%
154 LFM2 24B 54.88% 78.89% 30.88% 58.77%
155 Rocinante 12B 54.31% 74.89% 33.73% 54.54%