Rule Following

12 scenarios across 1 subcategory. 155 models scored.

Subcategories

Subcategory	Avg Score	Best Model	Best Score
Constraint Adherence	61.63%	Qwen3.7 Max	95.76%

Model Leaderboard

All models ranked by their Rule Following category score.

#	Model	Rule Following	Constraint Adherence	Overall
1	Qwen3.7 Max	95.76%	95.76%	95.75%
2	Gemini 3.5 Flash (Reasoning)	92.04%	92.04%	94.08%
3	Gemini 3.1 Pro (Preview)	91.21%	91.21%	94.37%
4	Claude Opus 4.6 (Reasoning)	89.78%	89.78%	95.02%
5	Z.AI GLM 5.1	88.41%	88.41%	94.37%
6	Z.AI GLM 5 Turbo	86.78%	86.78%	94.27%
7	Claude Sonnet 4.6 (Reasoning)	85.73%	85.73%	93.66%
8	Gemma 4 31B (Reasoning)	85.00%	85.00%	91.71%
9	Claude Opus 4.6	83.11%	83.11%	92.35%
10	Grok 4.3 (Reasoning)	82.80%	82.80%	93.60%
11	Qwen3.6 Max Preview	82.79%	82.79%	94.54%
12	Claude Sonnet 4.6	82.50%	82.50%	91.15%
13	Grok 4.20 (Reasoning)	82.04%	82.04%	91.39%
14	Claude Sonnet 4	81.52%	81.52%	88.72%
15	Qwen 3.5 122B	80.00%	80.00%	91.53%
16	GPT-5.5 (Reasoning)	79.40%	79.40%	92.98%
17	Qwen 3.5 397B A17B	79.39%	79.39%	91.73%
18	GPT-5.4 (Reasoning)	79.29%	79.29%	93.24%
19	ByteDance Seed 1.6	77.71%	77.71%	90.70%
20	Qwen 3.6 35B	77.34%	77.34%	89.05%
21	GPT-5	77.13%	77.13%	91.93%
22	MoonshotAI: Kimi K2.6	76.94%	76.94%	92.31%
23	GPT-5.5 (Reasoning, Low)	76.90%	76.90%	92.59%
24	Claude Sonnet 4.5	76.80%	76.80%	88.03%
25	GPT-5 Mini	76.44%	76.44%	92.62%
26	Qwen 3.5 27B	76.04%	76.04%	90.85%
27	Grok 4.20 (Beta, Reasoning)	75.31%	75.31%	91.49%
28	Gemma 4 26B (Reasoning)	74.75%	74.75%	91.49%
29	Gemini 3 Flash (Preview, Reasoning)	74.48%	74.48%	90.50%
30	GPT-4o, Aug. 6th (temp=0)	74.19%	74.19%	82.45%
31	GPT-5.1	74.05%	74.05%	92.54%
32	Claude Opus 4.7 (Reasoning)	74.04%	74.04%	93.23%
33	Claude 3.7 Sonnet	73.78%	73.78%	83.39%
34	GPT-4o, May 13th (temp=0)	73.24%	73.24%	85.36%
35	DeepSeek V4 Pro (Reasoning)	72.74%	72.74%	90.10%
36	Gemma 4 31B	72.72%	72.72%	86.91%
37	o4 Mini High	72.70%	72.70%	90.29%
38	Claude Opus 4.5	72.61%	72.61%	89.69%
39	GPT-5.5	72.34%	72.34%	89.09%
40	MoonshotAI: Kimi K2.5	72.03%	72.03%	91.04%
41	Gemma 4 26B	71.75%	71.75%	85.84%
42	Qwen 3.6 Flash	71.50%	71.50%	90.65%
43	Qwen 3.6 27B	71.37%	71.37%	89.72%
44	Grok 4.1 Fast	70.87%	70.87%	89.55%
45	Claude Opus 4.8 (Reasoning, Low)	70.56%	70.56%	92.14%
46	Claude Opus 4	70.37%	70.37%	87.69%
47	Claude Haiku 4.5	70.35%	70.35%	85.14%
48	Claude Opus 4.8 (Reasoning)	70.27%	70.27%	92.22%
49	GPT-5.4 (Reasoning, Low)	70.02%	70.02%	91.41%
50	GPT-4o, May 13th (temp=1)	69.88%	69.88%	83.80%
51	Claude 3.5 Sonnet	69.67%	69.67%	84.24%
52	MiniMax M3	69.18%	69.18%	90.88%
53	Z.AI GLM 4.7	69.16%	69.16%	88.69%
54	MiniMax M2.7	68.90%	68.90%	89.10%
55	DeepSeek-V2 Chat	68.78%	68.78%	84.83%
56	Claude Opus 4.7	68.08%	68.08%	89.93%
57	DeepSeek V3 (2025-03-24)	67.94%	67.94%	81.99%
58	Grok 4 Fast	67.91%	67.91%	86.15%
59	GPT-4o, Aug. 6th (temp=1)	67.91%	67.91%	82.62%
60	Z.AI GLM 5	67.78%	67.78%	91.23%
61	Qwen 3.5 Plus (2026-04-20)	67.53%	67.53%	91.51%
62	Qwen 3.5 35B	67.42%	67.42%	88.00%
63	Writer: Palmyra X5	67.19%	67.19%	79.57%
64	GPT-5.2	67.10%	67.10%	90.26%
65	Gemini 2.5 Flash Lite (Reasoning)	66.81%	66.81%	85.75%
66	GPT-4.1	66.78%	66.78%	88.68%
67	DeepSeek V3 (2024-12-26)	66.39%	66.39%	83.68%
68	DeepSeek V3.1	66.15%	66.15%	82.39%
69	Z.AI GLM 4.6	65.85%	65.85%	89.11%
70	Z.AI GLM 4.7 Flash	65.63%	65.63%	84.82%
71	Qwen3 235B A22B Instruct 2507	65.42%	65.42%	80.10%
72	Gemini 3 Flash (Preview)	65.14%	65.14%	85.35%
73	o4 Mini	64.61%	64.61%	88.35%
74	DeepSeek V4 Flash (Reasoning)	64.50%	64.50%	89.01%
75	Gemini 3 Pro (Preview)	64.47%	64.47%	88.79%
76	Mistral Large 3	64.41%	64.41%	85.43%
77	Xiaomi MIMO v2.5 Pro	64.29%	64.29%	87.36%
78	Qwen 3.5 Plus (2026-02-15)	64.21%	64.21%	85.96%
79	Mistral Small 3.2 24B	64.08%	64.08%	78.58%
80	Z.AI GLM 4.5	63.79%	63.79%	86.27%
81	Aion 2.0	63.77%	63.77%	89.21%
82	DeepSeek V4 Pro	63.74%	63.74%	82.63%
83	Stealth: Hunter Alpha	63.63%	63.63%	87.34%
84	Llama 3.1 70B	63.45%	63.45%	78.40%
85	Qwen 3.5 Flash	63.19%	63.19%	86.38%
86	Grok 4	63.09%	63.09%	88.12%
87	Mistral Large 2	63.05%	63.05%	82.41%
88	MiniMax M2.5	62.69%	62.69%	88.71%
89	Gemini 3.1 Flash Lite (Reasoning)	62.26%	62.26%	86.41%
90	Mistral Small 4	62.17%	62.17%	76.46%
91	Gemini 3.1 Flash Lite	61.21%	61.21%	85.75%
92	Gemma 3 12B	61.05%	61.05%	78.41%
93	Qwen 3.5 9B	60.98%	60.98%	86.05%
94	Gemini 2.5 Pro	60.89%	60.89%	88.53%
95	Xiaomi MIMO v2.5	60.75%	60.75%	85.05%
96	Gemini 3.5 Flash (Reasoning, Minimal)	60.38%	60.38%	86.47%
97	Mistral Small 4 (Reasoning)	60.28%	60.28%	82.39%
98	Gemini 2.5 Flash (Reasoning)	59.97%	59.97%	86.51%
99	Gemini 2.5 Flash Lite	59.96%	59.96%	81.08%
100	Grok 4.20	59.71%	59.71%	81.70%
101	Hermes 3 405B	59.17%	59.17%	82.86%
102	Gemini 3.1 Flash Lite (Preview)	59.04%	59.04%	85.87%
103	GPT-4o Mini (temp=0)	58.84%	58.84%	78.29%
104	ByteDance Seed 2.0 Mini	58.77%	58.77%	86.91%
105	Cohere Command R+ (Aug. 2024)	58.70%	58.70%	69.03%
106	GPT-4.1 Mini	58.59%	58.59%	83.20%
107	GPT-5.4	58.11%	58.11%	84.32%
108	GPT-5 Nano	57.57%	57.57%	82.60%
109	Gemini 2.5 Flash	57.47%	57.47%	80.60%
110	Nemotron 3 Super	57.43%	57.43%	84.56%
111	GPT-5.4 Mini (Reasoning)	57.38%	57.38%	90.65%
112	DeepSeek V4 Flash	57.32%	57.32%	82.02%
113	GPT-4o Mini (temp=1)	56.50%	56.50%	79.08%
114	Stealth: Healer Alpha	56.03%	56.03%	85.93%
115	GPT-OSS 120B	55.03%	55.03%	86.44%
116	Inception Mercury 2	54.41%	54.41%	83.85%
117	Grok 4.20 (Beta)	53.89%	53.89%	83.85%
118	DeepSeek V3.2	53.75%	53.75%	82.25%
119	Hermes 3 70B	53.00%	53.00%	72.57%
120	Claude 3 Haiku	51.15%	51.15%	71.19%
121	Ministral 3 14B	50.83%	50.83%	72.54%
122	Llama 3.1 Nemotron 70B	50.62%	50.62%	74.70%
123	Cydonia 24B V4.1	50.36%	50.36%	75.09%
124	Mistral Large	49.87%	49.87%	80.15%
125	Grok 4.3	49.02%	49.02%	78.66%
126	Mistral Medium 3.1	48.60%	48.60%	77.83%
127	Mistral Small Creative	48.15%	48.15%	73.27%
128	Gemma 3 27B	47.98%	47.98%	77.85%
129	ByteDance Seed 1.6 Flash	47.15%	47.15%	73.27%
130	Qwen 3 32B	46.83%	46.83%	82.21%
131	GPT-5.4 Mini	46.32%	46.32%	82.43%
132	Stealth: Aurora Alpha	44.19%	44.19%	83.79%
133	Z.AI GLM 4.5 Air	44.11%	44.11%	83.12%
134	Nemotron 3 Nano	43.47%	43.47%	77.73%
135	Rocinante 12B	41.51%	41.51%	54.54%
136	Skyfall 36B V2	41.44%	41.44%	65.76%
137	GPT-4.1 Nano	40.88%	40.88%	71.94%
138	Inception Mercury	39.68%	39.68%	79.50%
139	Arcee AI: Trinity Large (Preview)	38.52%	38.52%	73.33%
140	ByteDance Seed 2.0 Lite	36.85%	36.85%	84.80%
141	Mistral NeMO	34.11%	34.11%	65.04%
142	Llama 3.1 8B	34.03%	34.03%	63.35%
143	GPT-5.4 Mini (Reasoning, Low)	33.99%	33.99%	85.75%
144	GPT-5.4 Nano (Reasoning, Low)	31.65%	31.65%	79.48%
145	Qwen 2.5 72B	31.55%	31.55%	75.46%
146	Ministral 3 8B	31.34%	31.34%	71.76%
147	WizardLM 2 8x22b	28.27%	28.27%	71.06%
148	GPT-5.4 Nano (Reasoning)	27.15%	27.15%	81.36%
149	Gemma 3 4B	26.37%	26.37%	68.57%
150	Ministral 3B	24.45%	24.45%	61.29%
151	LFM2 24B	24.12%	24.12%	58.77%
152	Arcee AI: Trinity Mini	23.57%	23.57%	70.90%
153	GPT-5.4 Nano	20.94%	20.94%	74.40%
154	Ministral 3 3B	15.87%	15.87%	67.22%
155	Ministral 8B	15.27%	15.27%	64.87%