Tooling

13 scenarios across 1 subcategory. 155 models scored.

Subcategories

Subcategory	Avg Score	Best Model	Best Score
XML	96.04%	Qwen3.7 Max	100.00%

Model Leaderboard

All models ranked by their Tooling category score.

#	Model	Tooling	XML	Overall
1	Qwen3.7 Max	100.00%	100.00%	95.75%
2	Claude Opus 4.6 (Reasoning)	100.00%	100.00%	95.02%
3	Qwen3.6 Max Preview	100.00%	100.00%	94.54%
4	Z.AI GLM 5.1	100.00%	100.00%	94.37%
5	Gemini 3.5 Flash (Reasoning)	100.00%	100.00%	94.08%
6	Claude Sonnet 4.6 (Reasoning)	100.00%	100.00%	93.66%
7	GPT-5.4 (Reasoning)	100.00%	100.00%	93.24%
8	Claude Opus 4.7 (Reasoning)	100.00%	100.00%	93.23%
9	GPT-5.5 (Reasoning)	100.00%	100.00%	92.98%
10	GPT-5.5 (Reasoning, Low)	100.00%	100.00%	92.59%
11	Claude Opus 4.6	100.00%	100.00%	92.35%
12	GPT-5	100.00%	100.00%	91.93%
13	Gemma 4 31B (Reasoning)	100.00%	100.00%	91.71%
14	Qwen 3.5 122B	100.00%	100.00%	91.53%
15	Grok 4.20 (Beta, Reasoning)	100.00%	100.00%	91.49%
16	Grok 4.20 (Reasoning)	100.00%	100.00%	91.39%
17	Z.AI GLM 5	100.00%	100.00%	91.23%
18	Claude Sonnet 4.6	100.00%	100.00%	91.15%
19	MoonshotAI: Kimi K2.5	100.00%	100.00%	91.04%
20	GPT-5.4 Mini (Reasoning)	100.00%	100.00%	90.65%
21	Gemini 3 Flash (Preview, Reasoning)	100.00%	100.00%	90.50%
22	o4 Mini High	100.00%	100.00%	90.29%
23	GPT-5.2	100.00%	100.00%	90.26%
24	Claude Opus 4.5	100.00%	100.00%	89.69%
25	Grok 4.1 Fast	100.00%	100.00%	89.55%
26	GPT-5.5	100.00%	100.00%	89.09%
27	Claude Sonnet 4	100.00%	100.00%	88.72%
28	Z.AI GLM 4.7	100.00%	100.00%	88.69%
29	Gemini 2.5 Pro	100.00%	100.00%	88.53%
30	o4 Mini	100.00%	100.00%	88.35%
31	Claude Sonnet 4.5	100.00%	100.00%	88.03%
32	Claude Opus 4	100.00%	100.00%	87.69%
33	Xiaomi MIMO v2.5 Pro	100.00%	100.00%	87.36%
34	Gemma 4 31B	100.00%	100.00%	86.91%
35	Gemini 2.5 Flash (Reasoning)	100.00%	100.00%	86.51%
36	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%	100.00%	86.47%
37	GPT-OSS 120B	100.00%	100.00%	86.44%
38	Stealth: Healer Alpha	100.00%	100.00%	85.93%
39	GPT-5.4 Mini (Reasoning, Low)	100.00%	100.00%	85.75%
40	Claude 3.5 Sonnet	100.00%	100.00%	84.24%
41	Grok 4.20 (Beta)	100.00%	100.00%	83.85%
42	DeepSeek V3 (2024-12-26)	100.00%	100.00%	83.68%
43	Stealth: Hunter Alpha	99.99%	99.99%	87.34%
44	Grok 4	99.99%	99.99%	88.12%
45	DeepSeek V3.2	99.99%	99.99%	82.25%
46	Z.AI GLM 4.6	99.99%	99.99%	89.11%
47	GPT-5 Mini	99.99%	99.99%	92.62%
48	MiniMax M2.7	99.98%	99.98%	89.10%
49	Gemini 3.1 Flash Lite (Preview)	99.98%	99.98%	85.87%
50	Gemini 3 Pro (Preview)	99.98%	99.98%	88.79%
51	Grok 4.3 (Reasoning)	99.97%	99.97%	93.60%
52	Gemini 3.1 Flash Lite	99.97%	99.97%	85.75%
53	GPT-5.4 (Reasoning, Low)	99.96%	99.96%	91.41%
54	Gemini 2.5 Flash	99.96%	99.96%	80.60%
55	Xiaomi MIMO v2.5	99.96%	99.96%	85.05%
56	MiniMax M3	99.95%	99.95%	90.88%
57	GPT-4o, Aug. 6th (temp=0)	99.95%	99.95%	82.45%
58	ByteDance Seed 2.0 Lite	99.91%	99.91%	84.80%
59	Gemini 3.1 Pro (Preview)	99.90%	99.90%	94.37%
60	Z.AI GLM 4.5	99.89%	99.89%	86.27%
61	Mistral Small 3.2 24B	99.89%	99.89%	78.58%
62	Gemma 3 27B	99.88%	99.88%	77.85%
63	GPT-5.4 Mini	99.86%	99.86%	82.43%
64	Hermes 3 405B	99.78%	99.78%	82.86%
65	Mistral Large 2	99.78%	99.78%	82.41%
66	MoonshotAI: Kimi K2.6	99.77%	99.77%	92.31%
67	Qwen 3.5 397B A17B	99.77%	99.77%	91.73%
68	Gemini 3.1 Flash Lite (Reasoning)	99.77%	99.77%	86.41%
69	DeepSeek-V2 Chat	99.76%	99.76%	84.83%
70	Qwen 3.5 Plus (2026-02-15)	99.74%	99.74%	85.96%
71	GPT-4o, Aug. 6th (temp=1)	99.73%	99.73%	82.62%
72	Qwen 3.6 Flash	99.73%	99.73%	90.65%
73	Mistral Small 4 (Reasoning)	99.73%	99.73%	82.39%
74	Stealth: Aurora Alpha	99.69%	99.69%	83.79%
75	Claude Opus 4.8 (Reasoning, Low)	99.68%	99.68%	92.14%
76	GPT-4o, May 13th (temp=1)	99.68%	99.68%	83.80%
77	Mistral Large 3	99.66%	99.66%	85.43%
78	Grok 4 Fast	99.65%	99.65%	86.15%
79	ByteDance Seed 2.0 Mini	99.64%	99.64%	86.91%
80	Z.AI GLM 4.5 Air	99.60%	99.60%	83.12%
81	Gemini 2.5 Flash Lite (Reasoning)	99.54%	99.54%	85.75%
82	Aion 2.0	99.53%	99.53%	89.21%
83	DeepSeek V4 Pro	99.49%	99.49%	82.63%
84	Claude 3 Haiku	99.47%	99.47%	71.19%
85	Claude Opus 4.8 (Reasoning)	99.44%	99.44%	92.22%
86	Ministral 3 8B	99.42%	99.42%	71.76%
87	Qwen 2.5 72B	99.38%	99.38%	75.46%
88	Claude Opus 4.7	99.38%	99.38%	89.93%
89	Writer: Palmyra X5	99.34%	99.34%	79.57%
90	Claude 3.7 Sonnet	99.32%	99.32%	83.39%
91	DeepSeek V4 Pro (Reasoning)	99.28%	99.28%	90.10%
92	GPT-4o Mini (temp=1)	99.26%	99.26%	79.08%
93	Qwen3 235B A22B Instruct 2507	99.23%	99.23%	80.10%
94	GPT-4o, May 13th (temp=0)	99.22%	99.22%	85.36%
95	GPT-5 Nano	99.21%	99.21%	82.60%
96	Claude Haiku 4.5	99.10%	99.10%	85.14%
97	Qwen 3.5 27B	99.00%	99.00%	90.85%
98	GPT-4.1	98.86%	98.86%	88.68%
99	Gemma 4 26B (Reasoning)	98.85%	98.85%	91.49%
100	Mistral Large	98.67%	98.67%	80.15%
101	GPT-4o Mini (temp=0)	98.54%	98.54%	78.29%
102	GPT-5.1	98.00%	98.00%	92.54%
103	ByteDance Seed 1.6	98.00%	98.00%	90.70%
104	Inception Mercury 2	98.00%	98.00%	83.85%
105	Inception Mercury	97.98%	97.98%	79.50%
106	DeepSeek V3.1	97.96%	97.96%	82.39%
107	MiniMax M2.5	97.93%	97.93%	88.71%
108	GPT-4.1 Mini	97.92%	97.92%	83.20%
109	Gemma 3 4B	97.88%	97.88%	68.57%
110	Hermes 3 70B	97.86%	97.86%	72.57%
111	Gemma 3 12B	97.69%	97.69%	78.41%
112	GPT-5.4	97.65%	97.65%	84.32%
113	Gemini 3 Flash (Preview)	97.64%	97.64%	85.35%
114	Mistral Medium 3.1	97.50%	97.50%	77.83%
115	Qwen 3.6 27B	97.49%	97.49%	89.72%
116	Qwen 3 32B	97.43%	97.43%	82.21%
117	Arcee AI: Trinity Mini	97.16%	97.16%	70.90%
118	Qwen 3.5 9B	97.00%	97.00%	86.05%
119	Gemini 2.5 Flash Lite	96.60%	96.60%	81.08%
120	Qwen 3.5 Plus (2026-04-20)	96.00%	96.00%	91.51%
121	DeepSeek V4 Flash (Reasoning)	96.00%	96.00%	89.01%
122	DeepSeek V4 Flash	95.80%	95.80%	82.02%
123	Llama 3.1 Nemotron 70B	95.74%	95.74%	74.70%
124	GPT-5.4 Nano (Reasoning)	95.71%	95.71%	81.36%
125	Mistral Small 4	95.02%	95.02%	76.46%
126	Grok 4.20	94.00%	94.00%	81.70%
127	Qwen 3.5 35B	93.98%	93.98%	88.00%
128	Z.AI GLM 5 Turbo	93.95%	93.95%	94.27%
129	Ministral 3 3B	93.79%	93.79%	67.22%
130	Z.AI GLM 4.7 Flash	93.74%	93.74%	84.82%
131	DeepSeek V3 (2025-03-24)	93.53%	93.53%	81.99%
132	Nemotron 3 Super	93.49%	93.49%	84.56%
133	Arcee AI: Trinity Large (Preview)	93.42%	93.42%	73.33%
134	Grok 4.3	92.65%	92.65%	78.66%
135	Ministral 3 14B	91.91%	91.91%	72.54%
136	Cohere Command R+ (Aug. 2024)	91.84%	91.84%	69.03%
137	Gemma 4 26B	91.13%	91.13%	85.84%
138	WizardLM 2 8x22b	90.27%	90.27%	71.06%
139	GPT-5.4 Nano	90.22%	90.22%	74.40%
140	Cydonia 24B V4.1	90.17%	90.17%	75.09%
141	GPT-5.4 Nano (Reasoning, Low)	89.75%	89.75%	79.48%
142	Llama 3.1 70B	88.59%	88.59%	78.40%
143	Qwen 3.5 Flash	87.87%	87.87%	86.38%
144	Ministral 3B	87.64%	87.64%	61.29%
145	Mistral Small Creative	86.85%	86.85%	73.27%
146	Ministral 8B	85.58%	85.58%	64.87%
147	Nemotron 3 Nano	83.98%	83.98%	77.73%
148	Mistral NeMO	83.21%	83.21%	65.04%
149	GPT-4.1 Nano	81.37%	81.37%	71.94%
150	Qwen 3.6 35B	80.00%	80.00%	89.05%
151	Skyfall 36B V2	69.65%	69.65%	65.76%
152	Llama 3.1 8B	69.49%	69.49%	63.35%
153	ByteDance Seed 1.6 Flash	39.46%	39.46%	73.27%
154	Rocinante 12B	38.30%	38.30%	54.54%
155	LFM2 24B	16.85%	16.85%	58.77%