Reasoning

20 scenarios across 2 subcategories. 118 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Deduction 90.49% Gemini 3 Flash (Preview, Reasoning) 99.50%
Attention 83.98% Claude Opus 4.5 98.41%

Model Leaderboard

All models ranked by their Reasoning category score.

# Model Reasoning Deduction Attention Overall
1 Gemini 3 Flash (Preview, Reasoning) 98.05% 99.50% 96.60% 90.50%
2 Gemini 2.5 Pro 96.91% 97.00% 96.82% 88.53%
3 Gemini 3.1 Pro (Preview) 96.01% 96.00% 96.03% 94.37%
4 Grok 4 96.01% 95.00% 97.02% 88.12%
5 Z.AI GLM 5 95.89% 96.00% 95.78% 91.23%
6 Z.AI GLM 5 Turbo 95.67% 95.00% 96.34% 94.27%
7 GPT-5 95.67% 95.00% 96.33% 91.93%
8 ByteDance Seed 2.0 Lite 95.50% 96.50% 94.49% 84.80%
9 MoonshotAI: Kimi K2.5 95.41% 96.00% 94.83% 91.04%
10 Gemini 3 Pro (Preview) 95.24% 95.00% 95.47% 88.79%
11 GPT-5.1 95.14% 95.00% 95.29% 92.54%
12 Z.AI GLM 4.6 95.12% 97.50% 92.74% 89.11%
13 Qwen 3.5 397B A17B 95.06% 95.00% 95.13% 91.73%
14 o4 Mini High 95.02% 95.00% 95.05% 90.29%
15 Z.AI GLM 4.7 94.99% 95.00% 94.98% 88.69%
16 Qwen 3.5 122B 94.93% 95.00% 94.87% 91.53%
17 Grok 4 Fast 94.89% 95.00% 94.78% 86.15%
18 Qwen 3.5 35B 94.88% 95.00% 94.75% 88.00%
19 Gemini 3 Flash (Preview) 94.79% 95.00% 94.58% 85.35%
20 GPT-5.4 (Reasoning) 94.78% 95.00% 94.55% 93.24%
21 Qwen 3.5 Flash 94.66% 95.00% 94.31% 86.38%
22 GPT-5.4 Mini (Reasoning) 94.56% 95.00% 94.13% 90.65%
23 GPT-5.2 94.54% 95.50% 93.58% 90.26%
24 Claude Sonnet 4 94.48% 94.44% 94.52% 88.72%
25 o4 Mini 94.45% 95.50% 93.40% 88.35%
26 GPT-5 Mini 94.36% 95.50% 93.22% 92.62%
27 GPT-5.4 (Reasoning, Low) 94.34% 95.00% 93.67% 91.41%
28 Aion 2.0 94.13% 92.78% 95.49% 89.21%
29 Claude Opus 4.5 93.93% 89.44% 98.41% 89.69%
30 GPT-5.4 93.92% 95.00% 92.85% 84.32%
31 Gemini 2.5 Flash Lite (Reasoning) 93.86% 96.50% 91.22% 85.75%
32 Gemini 2.5 Flash (Reasoning) 93.81% 95.89% 91.73% 86.51%
33 Claude Opus 4.6 (Reasoning) 93.77% 89.44% 98.10% 95.02%
34 Grok 4.1 Fast 93.58% 91.11% 96.05% 89.55%
35 Qwen 3.5 Plus (2026-02-15) 93.45% 93.94% 92.96% 85.96%
36 Claude Opus 4.6 93.33% 88.89% 97.78% 92.35%
37 MiniMax M2.7 93.28% 94.94% 91.61% 89.10%
38 Nemotron 3 Super 93.11% 95.00% 91.22% 84.56%
39 Qwen 3.5 9B 92.93% 95.00% 90.87% 86.05%
40 Claude Sonnet 4.6 (Reasoning) 92.76% 89.44% 96.08% 93.66%
41 Qwen 3.5 27B 92.73% 95.00% 90.46% 90.85%
42 Gemini 2.5 Flash 92.60% 95.00% 90.21% 80.60%
43 Claude Opus 4 92.59% 91.94% 93.24% 87.69%
44 Claude Sonnet 4.5 92.50% 89.44% 95.56% 88.03%
45 MiniMax M2.5 92.42% 95.50% 89.33% 88.71%
46 ByteDance Seed 2.0 Mini 92.40% 91.11% 93.68% 86.91%
47 GPT-5.4 Mini (Reasoning, Low) 92.28% 95.00% 89.57% 85.75%
48 Gemini 3.1 Flash Lite (Preview) 92.15% 94.44% 89.85% 85.87%
49 Inception Mercury 2 92.03% 95.00% 89.05% 83.85%
50 Stealth: Healer Alpha 91.67% 90.00% 93.35% 85.93%
51 Stealth: Hunter Alpha 91.67% 89.94% 93.39% 87.34%
52 ByteDance Seed 1.6 91.49% 90.00% 92.98% 90.70%
53 Z.AI GLM 4.5 91.03% 92.22% 89.84% 86.27%
54 Claude 3.5 Sonnet 90.30% 89.44% 91.16% 84.24%
55 Stealth: Aurora Alpha 90.11% 95.00% 85.21% 83.79%
56 Claude 3.7 Sonnet 89.94% 89.44% 90.43% 83.39%
57 Nemotron 3 Nano 89.91% 95.00% 84.82% 77.73%
58 GPT-5 Nano 89.61% 95.00% 84.22% 82.60%
59 Z.AI GLM 4.7 Flash 89.50% 95.00% 83.99% 84.82%
60 DeepSeek V3.2 89.46% 90.06% 88.87% 82.25%
61 Mistral Medium 3.1 89.32% 89.44% 89.19% 77.83%
62 Mistral Large 3 88.95% 89.44% 88.45% 85.43%
63 DeepSeek V3 (2024-12-26) 88.71% 90.39% 87.02% 83.68%
64 DeepSeek-V2 Chat 88.70% 89.44% 87.95% 84.83%
65 GPT-4o, May 13th (temp=0) 88.58% 89.44% 87.71% 85.36%
66 GPT-5.4 Nano (Reasoning) 88.48% 93.89% 83.07% 81.36%
67 Claude Sonnet 4.6 88.48% 83.89% 93.07% 91.15%
68 GPT-4.1 88.46% 89.94% 86.98% 88.68%
69 DeepSeek V3 (2025-03-24) 88.45% 89.61% 87.29% 81.99%
70 Mistral Large 2 88.20% 87.78% 88.62% 82.41%
71 GPT-5.4 Mini 88.04% 96.00% 80.08% 82.43%
72 Mistral Small Creative 87.99% 92.94% 83.04% 73.27%
73 Mistral Small 4 (Reasoning) 87.78% 90.44% 85.12% 82.39%
74 Claude Haiku 4.5 87.76% 89.44% 86.08% 85.14%
75 GPT-4o, Aug. 6th (temp=0) 87.59% 89.44% 85.73% 82.45%
76 Grok 4.20 (Beta) 87.05% 86.72% 87.37% 83.85%
77 GPT-4o, Aug. 6th (temp=1) 86.91% 89.44% 84.37% 82.62%
78 Gemma 3 27B 86.74% 95.00% 78.48% 77.85%
79 Writer: Palmyra X5 86.57% 89.44% 83.69% 79.57%
80 ByteDance Seed 1.6 Flash 86.52% 91.67% 81.38% 73.27%
81 Qwen 3 32B 86.35% 88.89% 83.81% 82.21%
82 GPT-4o, May 13th (temp=1) 85.98% 89.44% 82.52% 83.80%
83 Inception Mercury 85.96% 95.00% 76.92% 79.50%
84 GPT-4.1 Mini 85.83% 89.44% 82.22% 83.20%
85 Qwen3 235B A22B Instruct 2507 85.82% 88.39% 83.25% 80.10%
86 Gemini 2.5 Flash Lite 85.80% 95.00% 76.60% 81.08%
87 Hermes 3 405B 85.58% 88.28% 82.88% 82.86%
88 DeepSeek V3.1 83.95% 84.94% 82.96% 82.39%
89 Qwen 2.5 72B 83.43% 89.44% 77.42% 75.46%
90 Ministral 3 14B 83.24% 89.44% 77.03% 72.54%
91 Grok 4.20 (Beta, Reasoning) 82.64% 68.33% 96.94% 91.49%
92 Claude 3.5 Haiku 82.23% 93.94% 70.51% 83.73%
93 Llama 3.1 Nemotron 70B 82.19% 87.22% 77.15% 74.70%
94 Mistral Small 3.2 24B 81.71% 83.89% 79.53% 78.60%
95 GPT-4o Mini (temp=0) 81.26% 94.44% 68.07% 78.29%
96 GPT-4o Mini (temp=1) 80.28% 93.44% 67.12% 79.08%
97 Gemma 3 12B 79.42% 95.00% 63.84% 78.41%
98 Llama 3.1 70B 79.31% 85.00% 73.63% 78.40%
99 Hermes 3 70B 79.08% 89.39% 68.76% 72.57%
100 GPT-5.4 Nano (Reasoning, Low) 78.93% 95.00% 62.85% 79.48%
101 Mistral Small 4 78.72% 89.50% 67.95% 76.46%
102 Claude 3 Haiku 77.94% 88.89% 66.98% 71.19%
103 Arcee AI: Trinity Large (Preview) 77.24% 83.89% 70.59% 73.33%
104 Arcee AI: Trinity Mini 76.94% 82.28% 71.60% 70.90%
105 Mistral Large 76.31% 66.06% 86.57% 80.15%
106 GPT-5.4 Nano 75.66% 93.06% 58.27% 74.40%
107 Ministral 8B 73.78% 76.00% 71.56% 64.87%
108 Gemma 3 4B 73.64% 95.00% 52.28% 68.57%
109 Ministral 3 3B 71.88% 77.78% 65.98% 67.22%
110 Ministral 3 8B 71.64% 68.33% 74.95% 71.76%
111 GPT-4.1 Nano 70.24% 94.44% 46.04% 71.94%
112 Ministral 3B 69.70% 72.83% 66.56% 61.29%
113 Llama 3.1 8B 69.12% 86.78% 51.46% 63.37%
114 WizardLM 2 8x22b 67.36% 69.94% 64.77% 71.07%
115 Cohere Command R+ (Aug. 2024) 65.10% 63.83% 66.37% 69.03%
116 Mistral NeMO 57.59% 87.33% 27.84% 65.04%
117 LFM2 24B 54.88% 78.89% 30.88% 58.77%
118 Rocinante 12B 54.31% 74.89% 33.73% 54.55%