Tooling

15 scenarios across 1 subcategory. 155 models scored.

Subcategories

Subcategory Avg Score Best Model Best Score
XML 95.20% Claude Opus 4.6 (Reasoning) 100.00%

Model Leaderboard

All models ranked by their Tooling category score.

# Model Tooling XML Overall
1 Claude Opus 4.6 (Reasoning) 100.00% 100.00% 95.06%
2 Qwen3.7 Max 100.00% 100.00% 94.55%
3 GPT-5.4 (Reasoning) 100.00% 100.00% 93.85%
4 Z.AI GLM 5.1 100.00% 100.00% 93.74%
5 Qwen3.6 Max Preview 100.00% 100.00% 93.72%
6 GPT-5.5 (Reasoning) 100.00% 100.00% 93.72%
7 Claude Sonnet 4.6 (Reasoning) 100.00% 100.00% 93.64%
8 Gemini 3.5 Flash (Reasoning) 100.00% 100.00% 93.35%
9 Claude Opus 4.7 (Reasoning) 100.00% 100.00% 92.53%
10 Claude Opus 4.6 100.00% 100.00% 92.31%
11 Grok 4.20 (Beta, Reasoning) 100.00% 100.00% 90.98%
12 Grok 4.20 (Reasoning) 100.00% 100.00% 90.87%
13 MoonshotAI: Kimi K2.5 100.00% 100.00% 90.86%
14 Claude Sonnet 4.6 100.00% 100.00% 90.66%
15 Gemini 3 Flash (Preview, Reasoning) 100.00% 100.00% 89.93%
16 GPT-5.4 Mini (Reasoning) 100.00% 100.00% 89.82%
17 Claude Opus 4.5 100.00% 100.00% 89.60%
18 Grok 4.1 Fast 100.00% 100.00% 89.55%
19 GPT-5.2 100.00% 100.00% 89.45%
20 GPT-5.5 100.00% 100.00% 89.37%
21 o4 Mini High 100.00% 100.00% 88.78%
22 Gemini 2.5 Pro 100.00% 100.00% 88.44%
23 Claude Sonnet 4 100.00% 100.00% 87.64%
24 Claude Opus 4 100.00% 100.00% 87.22%
25 o4 Mini 100.00% 100.00% 86.56%
26 Xiaomi MIMO v2.5 Pro 100.00% 100.00% 86.05%
27 Stealth: Healer Alpha 100.00% 100.00% 85.93%
28 Gemini 3.5 Flash (Reasoning, Minimal) 100.00% 100.00% 85.88%
29 Gemma 4 31B 100.00% 100.00% 85.23%
30 GPT-OSS 120B 100.00% 100.00% 84.81%
31 Claude 3.5 Sonnet 100.00% 100.00% 84.24%
32 Gemini 2.5 Flash (Reasoning) 100.00% 100.00% 84.14%
33 GPT-5.4 Mini (Reasoning, Low) 100.00% 100.00% 83.57%
34 Grok 4.20 (Beta) 100.00% 100.00% 82.64%
35 DeepSeek V3.2 99.99% 99.99% 82.22%
36 Stealth: Hunter Alpha 99.99% 99.99% 87.34%
37 Grok 4 99.99% 99.99% 88.12%
38 GPT-5 Mini 99.99% 99.99% 91.31%
39 Gemini 3.1 Flash Lite (Preview) 99.98% 99.98% 85.41%
40 Gemini 3 Pro (Preview) 99.98% 99.98% 88.79%
41 Gemini 3.1 Flash Lite 99.97% 99.97% 85.09%
42 GPT-5.4 (Reasoning, Low) 99.97% 99.97% 90.91%
43 Gemini 2.5 Flash 99.97% 99.97% 80.61%
44 MiniMax M3 99.96% 99.96% 90.45%
45 GPT-4o, Aug. 6th (temp=0) 99.96% 99.96% 82.18%
46 ByteDance Seed 2.0 Lite 99.93% 99.93% 84.27%
47 Gemini 3.1 Pro (Preview) 99.92% 99.92% 94.08%
48 MoonshotAI: Kimi K2.6 99.81% 99.81% 92.57%
49 Qwen 3.5 397B A17B 99.81% 99.81% 91.09%
50 Gemini 3.1 Flash Lite (Reasoning) 99.81% 99.81% 85.91%
51 Qwen 3.5 Plus (2026-02-15) 99.78% 99.78% 86.17%
52 Qwen 3.6 Flash 99.78% 99.78% 89.31%
53 ByteDance Seed 2.0 Mini 99.70% 99.70% 85.69%
54 Stealth: Aurora Alpha 99.69% 99.69% 83.79%
55 Grok 4 Fast 99.65% 99.65% 86.15%
56 DeepSeek V4 Pro 99.57% 99.57% 82.05%
57 Claude Opus 4.8 (Reasoning) 99.53% 99.53% 92.33%
58 Claude Opus 4.8 (Reasoning, Low) 99.48% 99.48% 91.89%
59 Claude Opus 4.7 99.48% 99.48% 89.90%
60 Xiaomi MIMO v2.5 99.46% 99.46% 83.95%
61 DeepSeek V4 Pro (Reasoning) 99.40% 99.40% 89.28%
62 GPT-5 Nano 99.34% 99.34% 80.16%
63 Qwen 3.5 122B 99.33% 99.33% 90.32%
64 MiniMax M2.7 99.32% 99.32% 86.23%
65 Claude 3.7 Sonnet 99.32% 99.32% 83.39%
66 GPT-4o Mini (temp=1) 99.26% 99.26% 79.08%
67 Qwen 3.5 27B 99.17% 99.17% 90.05%
68 Gemma 4 26B (Reasoning) 99.04% 99.04% 89.02%
69 GPT-4o, Aug. 6th (temp=1) 98.86% 98.86% 81.28%
70 DeepSeek-V2 Chat 98.80% 98.80% 84.09%
71 Z.AI GLM 4.5 Air 98.75% 98.75% 80.74%
72 GPT-5 98.67% 98.67% 91.48%
73 Z.AI GLM 4.7 98.67% 98.67% 87.67%
74 DeepSeek V3 (2024-12-26) 98.58% 98.58% 82.62%
75 Z.AI GLM 4.5 98.58% 98.58% 84.95%
76 GPT-5.4 Mini 98.55% 98.55% 80.45%
77 GPT-4o Mini (temp=0) 98.54% 98.54% 78.29%
78 GPT-5.5 (Reasoning, Low) 98.50% 98.50% 92.51%
79 Gemma 4 31B (Reasoning) 98.50% 98.50% 89.64%
80 Z.AI GLM 5 98.50% 98.50% 89.60%
81 Grok 4.3 (Reasoning) 98.47% 98.47% 90.99%
82 GPT-5.1 98.33% 98.33% 90.73%
83 ByteDance Seed 1.6 98.33% 98.33% 89.59%
84 Inception Mercury 2 98.33% 98.33% 81.99%
85 Gemini 2.5 Flash Lite (Reasoning) 98.28% 98.28% 83.10%
86 MiniMax M2.5 98.27% 98.27% 86.71%
87 GPT-5.4 98.04% 98.04% 84.31%
88 Gemini 3 Flash (Preview) 98.04% 98.04% 85.47%
89 Z.AI GLM 4.6 97.99% 97.99% 87.64%
90 Inception Mercury 97.98% 97.98% 79.50%
91 Qwen 3.6 27B 97.91% 97.91% 88.33%
92 Claude Sonnet 4.5 97.75% 97.75% 87.54%
93 GPT-4.1 97.71% 97.71% 86.82%
94 GPT-4o, May 13th (temp=1) 97.31% 97.31% 82.99%
95 Mistral Large 2 97.31% 97.31% 81.50%
96 Mistral Small 4 (Reasoning) 97.28% 97.28% 79.48%
97 Mistral Large 3 97.22% 97.22% 84.29%
98 GPT-4o, May 13th (temp=0) 97.18% 97.18% 84.73%
99 Hermes 3 405B 96.98% 96.98% 80.80%
100 GPT-4.1 Mini 96.94% 96.94% 81.40%
101 Qwen 3.5 9B 96.84% 96.84% 84.05%
102 Mistral Small 3.2 24B 96.82% 96.82% 77.36%
103 Qwen 2.5 72B 96.82% 96.82% 73.17%
104 DeepSeek V3.1 96.80% 96.80% 82.35%
105 Claude Haiku 4.5 96.75% 96.75% 83.36%
106 GPT-5.4 Nano (Reasoning) 96.43% 96.43% 80.02%
107 Mistral Large 96.39% 96.39% 79.91%
108 DeepSeek V4 Flash (Reasoning) 96.17% 96.17% 88.06%
109 Gemma 3 27B 96.15% 96.15% 75.70%
110 Writer: Palmyra X5 96.12% 96.12% 78.11%
111 Gemini 2.5 Flash Lite 96.00% 96.00% 79.91%
112 Ministral 3 8B 95.77% 95.77% 69.98%
113 Llama 3.1 Nemotron 70B 95.74% 95.74% 74.70%
114 Arcee AI: Trinity Mini 95.64% 95.64% 67.68%
115 Qwen 3.5 Plus (2026-04-20) 95.33% 95.33% 89.79%
116 Qwen 3 32B 95.19% 95.19% 79.37%
117 Aion 2.0 95.10% 95.10% 86.66%
118 Claude 3 Haiku 95.06% 95.06% 70.13%
119 Grok 4.20 95.00% 95.00% 81.21%
120 Qwen 3.5 35B 94.74% 94.74% 87.01%
121 Hermes 3 70B 94.47% 94.47% 69.74%
122 Z.AI GLM 5 Turbo 94.30% 94.30% 93.29%
123 Mistral Medium 3.1 94.17% 94.17% 76.08%
124 Gemma 3 12B 94.16% 94.16% 76.07%
125 DeepSeek V4 Flash 94.00% 94.00% 82.02%
126 Qwen3 235B A22B Instruct 2507 93.86% 93.86% 78.07%
127 Mistral Small 4 93.85% 93.85% 75.23%
128 Arcee AI: Trinity Large (Preview) 93.42% 93.42% 73.33%
129 Z.AI GLM 4.7 Flash 92.53% 92.53% 82.21%
130 Gemma 3 4B 92.40% 92.40% 66.33%
131 Gemma 4 26B 91.94% 91.94% 84.89%
132 GPT-5.4 Nano 91.85% 91.85% 72.16%
133 GPT-5.4 Nano (Reasoning, Low) 91.46% 91.46% 77.46%
134 WizardLM 2 8x22b 91.22% 91.22% 71.45%
135 Grok 4.3 91.21% 91.21% 78.00%
136 Ministral 3 3B 89.41% 89.41% 65.02%
137 Qwen 3.5 Flash 89.39% 89.39% 85.66%
138 Ministral 3 14B 87.84% 87.84% 70.45%
139 Llama 3.1 70B 87.82% 87.82% 77.41%
140 Cohere Command R+ (Aug. 2024) 87.53% 87.53% 67.04%
141 Mistral Small Creative 86.85% 86.85% 73.27%
142 Cydonia 24B V4.1 86.72% 86.72% 72.68%
143 DeepSeek V3 (2025-03-24) 86.36% 86.36% 79.93%
144 Ministral 3B 84.70% 84.70% 59.25%
145 Ministral 8B 84.65% 84.65% 63.77%
146 Nemotron 3 Nano 83.65% 83.65% 74.50%
147 Qwen 3.6 35B 82.67% 82.67% 87.66%
148 Nemotron 3 Super 79.58% 79.58% 78.99%
149 Mistral NeMO 79.34% 79.34% 63.80%
150 GPT-4.1 Nano 78.47% 78.47% 69.90%
151 Llama 3.1 8B 68.57% 68.57% 61.44%
152 Skyfall 36B V2 67.04% 67.04% 63.65%
153 ByteDance Seed 1.6 Flash 48.22% 48.22% 71.22%
154 Rocinante 12B 39.83% 39.83% 54.02%
155 LFM2 24B 15.71% 15.71% 57.93%