Reasoning

20 scenarios across 2 subcategories. 91 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
Deduction 89.95% Gemini 3 Flash (Preview, Reasoning) 99.50%
Attention 83.20% Claude Opus 4.5 98.41%

Model Leaderboard

All models ranked by their Reasoning category score.

# Model Reasoning Deduction Attention Overall
1 Gemini 3 Flash (Preview, Reasoning) 98.05% 99.50% 96.60% 90.50%
2 Gemini 2.5 Pro 96.91% 97.00% 96.82% 88.53%
3 Gemini 3.1 Pro (Preview) 96.01% 96.00% 96.03% 94.37%
4 Grok 4 96.01% 95.00% 97.02% 88.12%
5 Z.AI GLM 5 95.89% 96.00% 95.78% 91.23%
6 GPT-5 95.67% 95.00% 96.33% 91.93%
7 MoonshotAI: Kimi K2.5 95.41% 96.00% 94.83% 91.04%
8 Gemini 3 Pro (Preview) 95.24% 95.00% 95.47% 88.79%
9 GPT-5.1 95.14% 95.00% 95.29% 92.54%
10 Z.AI GLM 4.6 95.12% 97.50% 92.74% 89.11%
11 Qwen 3.5 397B A17B 95.06% 95.00% 95.13% 91.73%
12 o4 Mini High 95.02% 95.00% 95.05% 90.29%
13 Z.AI GLM 4.7 94.99% 95.00% 94.98% 88.69%
14 Qwen 3.5 122B 94.93% 95.00% 94.87% 91.53%
15 Grok 4 Fast 94.89% 95.00% 94.78% 86.15%
16 Qwen 3.5 35B 94.88% 95.00% 94.75% 88.00%
17 Gemini 3 Flash (Preview) 94.79% 95.00% 94.58% 85.35%
18 Qwen 3.5 Flash 94.66% 95.00% 94.31% 86.38%
19 GPT-5.2 94.54% 95.50% 93.58% 90.26%
20 Claude Sonnet 4 94.48% 94.44% 94.52% 88.72%
21 o4 Mini 94.45% 95.50% 93.40% 88.35%
22 GPT-5 Mini 94.36% 95.50% 93.22% 92.62%
23 Aion 2.0 94.13% 92.78% 95.49% 89.21%
24 Claude Opus 4.5 93.93% 89.44% 98.41% 89.69%
25 Gemini 2.5 Flash Lite (Reasoning) 93.86% 96.50% 91.22% 85.75%
26 Gemini 2.5 Flash (Reasoning) 93.81% 95.89% 91.73% 86.51%
27 Claude Opus 4.6 (Reasoning) 93.77% 89.44% 98.10% 95.02%
28 Grok 4.1 Fast 93.58% 91.11% 96.05% 89.55%
29 Qwen 3.5 Plus (2026-02-15) 93.45% 93.94% 92.96% 85.96%
30 Claude Opus 4.6 93.33% 88.89% 97.78% 92.35%
31 Claude Sonnet 4.6 (Reasoning) 92.76% 89.44% 96.08% 93.66%
32 Qwen 3.5 27B 92.73% 95.00% 90.46% 90.85%
33 Gemini 2.5 Flash 92.60% 95.00% 90.21% 80.60%
34 Claude Opus 4 92.59% 91.94% 93.24% 87.69%
35 Claude Sonnet 4.5 92.50% 89.44% 95.56% 88.03%
36 Minimax M2.5 92.42% 95.50% 89.33% 88.71%
37 ByteDance Seed 1.6 91.49% 90.00% 92.98% 90.70%
38 Z.AI GLM 4.5 91.03% 92.22% 89.84% 86.27%
39 Claude 3.5 Sonnet 90.30% 89.44% 91.16% 84.24%
40 Stealth: Aurora Alpha 90.11% 95.00% 85.21% 83.79%
41 Claude 3.7 Sonnet 89.94% 89.44% 90.43% 83.39%
42 GPT-5 Nano 89.61% 95.00% 84.22% 82.60%
43 Z.AI GLM 4.7 Flash 89.50% 95.00% 83.99% 84.82%
44 DeepSeek V3.2 89.46% 90.06% 88.87% 82.25%
45 Mistral Medium 3.1 89.32% 89.44% 89.19% 77.83%
46 Mistral Large 3 88.95% 89.44% 88.45% 85.43%
47 DeepSeek V3 (2024-12-26) 88.71% 90.39% 87.02% 83.68%
48 DeepSeek-V2 Chat 88.70% 89.44% 87.95% 84.83%
49 GPT-4o, May 13th (temp=0) 88.58% 89.44% 87.71% 85.36%
50 Claude Sonnet 4.6 88.48% 83.89% 93.07% 91.15%
51 GPT-4.1 88.46% 89.94% 86.98% 88.68%
52 DeepSeek V3 (2025-03-24) 88.45% 89.61% 87.29% 81.99%
53 Mistral Large 2 88.20% 87.78% 88.62% 82.41%
54 Mistral Small Creative 87.99% 92.94% 83.04% 73.27%
55 Claude Haiku 4.5 87.76% 89.44% 86.08% 85.14%
56 GPT-4o, Aug. 6th (temp=0) 87.59% 89.44% 85.73% 82.45%
57 GPT-4o, Aug. 6th (temp=1) 86.91% 89.44% 84.37% 82.62%
58 Gemma 3 27B 86.74% 95.00% 78.48% 77.85%
59 Writer: Palmyra X5 86.57% 89.44% 83.69% 79.57%
60 ByteDance Seed 1.6 Flash 86.52% 91.67% 81.38% 73.27%
61 GPT-4o, May 13th (temp=1) 85.98% 89.44% 82.52% 83.80%
62 GPT-4.1 Mini 85.83% 89.44% 82.22% 83.20%
63 Gemini 2.5 Flash Lite 85.80% 95.00% 76.60% 81.08%
64 Hermes 3 405B 85.58% 88.28% 82.88% 82.86%
65 DeepSeek V3.1 83.95% 84.94% 82.96% 82.39%
66 Qwen 2.5 72B 83.43% 89.44% 77.42% 75.46%
67 Ministral 3 14B 83.24% 89.44% 77.03% 72.54%
68 Claude 3.5 Haiku 82.23% 93.94% 70.51% 83.73%
69 Llama 3.1 Nemotron 70B 82.19% 87.22% 77.15% 74.70%
70 Mistral Small 3.2 24B 81.71% 83.89% 79.53% 78.60%
71 GPT-4o Mini (temp=0) 81.26% 94.44% 68.07% 78.29%
72 GPT-4o Mini (temp=1) 80.28% 93.44% 67.12% 79.08%
73 Gemma 3 12B 79.42% 95.00% 63.84% 78.41%
74 Llama 3.1 70B 79.31% 85.00% 73.63% 78.40%
75 Hermes 3 70B 79.08% 89.39% 68.76% 72.57%
76 Claude 3 Haiku 77.94% 88.89% 66.98% 71.19%
77 Arcee AI: Trinity Large (Preview) 77.24% 83.89% 70.59% 73.33%
78 Arcee AI: Trinity Mini 76.94% 82.28% 71.60% 70.90%
79 Mistral Large 76.31% 66.06% 86.57% 80.15%
80 Ministral 8B 73.78% 76.00% 71.56% 64.87%
81 Gemma 3 4B 73.64% 95.00% 52.28% 68.57%
82 Ministral 3 3B 71.88% 77.78% 65.98% 67.22%
83 Ministral 3 8B 71.64% 68.33% 74.95% 71.76%
84 GPT-4.1 Nano 70.24% 94.44% 46.04% 71.94%
85 Ministral 3B 69.70% 72.83% 66.56% 61.29%
86 Llama 3.1 8B 69.12% 86.78% 51.46% 63.37%
87 WizardLM 2 8x22b 67.36% 69.94% 64.77% 71.07%
88 Cohere Command R+ (Aug. 2024) 65.10% 63.83% 66.37% 69.03%
89 Mistral NeMO 57.59% 87.33% 27.84% 65.04%
90 LFM2 24B 54.88% 78.89% 30.88% 58.77%
91 Rocinante 12B 54.31% 74.89% 33.73% 54.55%