Reasoning

20 scenarios across 2 subcategories. 104 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Deduction 90.46% Gemini 3 Flash (Preview, Reasoning) 99.50%
Attention 84.13% Claude Opus 4.5 98.41%

Model Leaderboard

All models ranked by their Reasoning category score.

# Model Reasoning Deduction Attention Overall
1 Gemini 3 Flash (Preview, Reasoning) 98.05% 99.50% 96.60% 90.50%
2 Gemini 2.5 Pro 96.91% 97.00% 96.82% 88.53%
3 Gemini 3.1 Pro (Preview) 96.01% 96.00% 96.03% 94.37%
4 Grok 4 96.01% 95.00% 97.02% 88.12%
5 Z.AI GLM 5 95.89% 96.00% 95.78% 91.23%
6 GPT-5 95.67% 95.00% 96.33% 91.93%
7 ByteDance Seed 2.0 Lite 95.50% 96.50% 94.49% 84.80%
8 MoonshotAI: Kimi K2.5 95.41% 96.00% 94.83% 91.04%
9 Gemini 3 Pro (Preview) 95.24% 95.00% 95.47% 88.79%
10 GPT-5.1 95.14% 95.00% 95.29% 92.54%
11 Z.AI GLM 4.6 95.12% 97.50% 92.74% 89.11%
12 Qwen 3.5 397B A17B 95.06% 95.00% 95.13% 91.73%
13 o4 Mini High 95.02% 95.00% 95.05% 90.29%
14 Z.AI GLM 4.7 94.99% 95.00% 94.98% 88.69%
15 Qwen 3.5 122B 94.93% 95.00% 94.87% 91.53%
16 Grok 4 Fast 94.89% 95.00% 94.78% 86.15%
17 Qwen 3.5 35B 94.88% 95.00% 94.75% 88.00%
18 Gemini 3 Flash (Preview) 94.79% 95.00% 94.58% 85.35%
19 GPT-5.4 (Reasoning) 94.78% 95.00% 94.55% 93.24%
20 Qwen 3.5 Flash 94.66% 95.00% 94.31% 86.38%
21 GPT-5.2 94.54% 95.50% 93.58% 90.26%
22 Claude Sonnet 4 94.48% 94.44% 94.52% 88.72%
23 o4 Mini 94.45% 95.50% 93.40% 88.35%
24 GPT-5 Mini 94.36% 95.50% 93.22% 92.62%
25 GPT-5.4 (Reasoning, Low) 94.34% 95.00% 93.67% 91.41%
26 Aion 2.0 94.13% 92.78% 95.49% 89.21%
27 Claude Opus 4.5 93.93% 89.44% 98.41% 89.69%
28 GPT-5.4 93.92% 95.00% 92.85% 84.32%
29 Gemini 2.5 Flash Lite (Reasoning) 93.86% 96.50% 91.22% 85.75%
30 Gemini 2.5 Flash (Reasoning) 93.81% 95.89% 91.73% 86.51%
31 Claude Opus 4.6 (Reasoning) 93.77% 89.44% 98.10% 95.02%
32 Grok 4.1 Fast 93.58% 91.11% 96.05% 89.55%
33 Qwen 3.5 Plus (2026-02-15) 93.45% 93.94% 92.96% 85.96%
34 Claude Opus 4.6 93.33% 88.89% 97.78% 92.35%
35 Nemotron 3 Super 93.11% 95.00% 91.22% 84.56%
36 Qwen 3.5 9B 92.93% 95.00% 90.87% 86.05%
37 Claude Sonnet 4.6 (Reasoning) 92.76% 89.44% 96.08% 93.66%
38 Qwen 3.5 27B 92.73% 95.00% 90.46% 90.85%
39 Gemini 2.5 Flash 92.60% 95.00% 90.21% 80.60%
40 Claude Opus 4 92.59% 91.94% 93.24% 87.69%
41 Claude Sonnet 4.5 92.50% 89.44% 95.56% 88.03%
42 Minimax M2.5 92.42% 95.50% 89.33% 88.71%
43 ByteDance Seed 2.0 Mini 92.40% 91.11% 93.68% 86.91%
44 Gemini 3.1 Flash Lite (Preview) 92.15% 94.44% 89.85% 85.87%
45 Inception Mercury 2 92.03% 95.00% 89.05% 83.85%
46 Stealth: Hunter Alpha 91.67% 89.94% 93.39% 87.33%
47 Stealth: Healer Alpha 91.57% 89.78% 93.35% 85.92%
48 ByteDance Seed 1.6 91.49% 90.00% 92.98% 90.70%
49 Z.AI GLM 4.5 91.03% 92.22% 89.84% 86.27%
50 Claude 3.5 Sonnet 90.30% 89.44% 91.16% 84.24%
51 Nemotron 3 Nano 90.14% 95.45% 84.82% 77.78%
52 Stealth: Aurora Alpha 90.11% 95.00% 85.21% 83.79%
53 Claude 3.7 Sonnet 89.94% 89.44% 90.43% 83.39%
54 GPT-5 Nano 89.61% 95.00% 84.22% 82.60%
55 Z.AI GLM 4.7 Flash 89.50% 95.00% 83.99% 84.82%
56 DeepSeek V3.2 89.46% 90.06% 88.87% 82.25%
57 Mistral Medium 3.1 89.32% 89.44% 89.19% 77.83%
58 Mistral Large 3 88.95% 89.44% 88.45% 85.43%
59 DeepSeek V3 (2024-12-26) 88.71% 90.39% 87.02% 83.68%
60 DeepSeek-V2 Chat 88.70% 89.44% 87.95% 84.83%
61 GPT-4o, May 13th (temp=0) 88.58% 89.44% 87.71% 85.36%
62 Claude Sonnet 4.6 88.48% 83.89% 93.07% 91.15%
63 GPT-4.1 88.46% 89.94% 86.98% 88.68%
64 DeepSeek V3 (2025-03-24) 88.45% 89.61% 87.29% 81.99%
65 Mistral Large 2 88.20% 87.78% 88.62% 82.41%
66 Mistral Small Creative 87.99% 92.94% 83.04% 73.27%
67 Claude Haiku 4.5 87.76% 89.44% 86.08% 85.14%
68 GPT-4o, Aug. 6th (temp=0) 87.59% 89.44% 85.73% 82.45%
69 GPT-4o, Aug. 6th (temp=1) 86.91% 89.44% 84.37% 82.62%
70 Gemma 3 27B 86.74% 95.00% 78.48% 77.85%
71 Writer: Palmyra X5 86.57% 89.44% 83.69% 79.57%
72 ByteDance Seed 1.6 Flash 86.52% 91.67% 81.38% 73.27%
73 GPT-4o, May 13th (temp=1) 85.98% 89.44% 82.52% 83.80%
74 Inception Mercury 85.96% 95.00% 76.92% 79.50%
75 GPT-4.1 Mini 85.83% 89.44% 82.22% 83.20%
76 Gemini 2.5 Flash Lite 85.80% 95.00% 76.60% 81.08%
77 Hermes 3 405B 85.58% 88.28% 82.88% 82.86%
78 DeepSeek V3.1 83.95% 84.94% 82.96% 82.39%
79 Qwen 2.5 72B 83.43% 89.44% 77.42% 75.46%
80 Ministral 3 14B 83.24% 89.44% 77.03% 72.54%
81 Claude 3.5 Haiku 82.23% 93.94% 70.51% 83.73%
82 Llama 3.1 Nemotron 70B 82.19% 87.22% 77.15% 74.70%
83 Mistral Small 3.2 24B 81.71% 83.89% 79.53% 78.60%
84 GPT-4o Mini (temp=0) 81.26% 94.44% 68.07% 78.29%
85 GPT-4o Mini (temp=1) 80.28% 93.44% 67.12% 79.08%
86 Gemma 3 12B 79.42% 95.00% 63.84% 78.41%
87 Llama 3.1 70B 79.31% 85.00% 73.63% 78.40%
88 Hermes 3 70B 79.08% 89.39% 68.76% 72.57%
89 Claude 3 Haiku 77.94% 88.89% 66.98% 71.19%
90 Arcee AI: Trinity Large (Preview) 77.24% 83.89% 70.59% 73.33%
91 Arcee AI: Trinity Mini 76.94% 82.28% 71.60% 70.90%
92 Mistral Large 76.31% 66.06% 86.57% 80.15%
93 Ministral 8B 73.78% 76.00% 71.56% 64.87%
94 Gemma 3 4B 73.64% 95.00% 52.28% 68.57%
95 Ministral 3 3B 71.88% 77.78% 65.98% 67.22%
96 Ministral 3 8B 71.64% 68.33% 74.95% 71.76%
97 GPT-4.1 Nano 70.24% 94.44% 46.04% 71.94%
98 Ministral 3B 69.70% 72.83% 66.56% 61.29%
99 Llama 3.1 8B 69.12% 86.78% 51.46% 63.37%
100 WizardLM 2 8x22b 67.36% 69.94% 64.77% 71.07%
101 Cohere Command R+ (Aug. 2024) 65.10% 63.83% 66.37% 69.03%
102 Mistral NeMO 57.59% 87.33% 27.84% 65.04%
103 LFM2 24B 54.88% 78.89% 30.88% 58.77%
104 Rocinante 12B 54.31% 74.89% 33.73% 54.55%