Paragraph Counting

Subcategory of Utility. 155 models scored.

Model Leaderboard

All models ranked by their Paragraph Counting subcategory score.

#	Model	Paragraph Counting	Utility	Overall
1	Qwen3.7 Max	100.00%	99.54%	95.75%
2	Claude Opus 4.6 (Reasoning)	100.00%	98.93%	95.02%
3	Qwen3.6 Max Preview	100.00%	98.34%	94.54%
4	Gemini 3.1 Pro (Preview)	100.00%	99.91%	94.37%
5	Z.AI GLM 5.1	100.00%	97.51%	94.37%
6	Z.AI GLM 5 Turbo	100.00%	96.36%	94.27%
7	Gemini 3.5 Flash (Reasoning)	100.00%	98.86%	94.08%
8	Claude Sonnet 4.6 (Reasoning)	100.00%	97.88%	93.66%
9	Grok 4.3 (Reasoning)	100.00%	92.94%	93.60%
10	GPT-5.4 (Reasoning)	100.00%	96.89%	93.24%
11	Claude Opus 4.7 (Reasoning)	100.00%	97.87%	93.23%
12	GPT-5.5 (Reasoning)	100.00%	96.60%	92.98%
13	GPT-5 Mini	100.00%	98.39%	92.62%
14	GPT-5.5 (Reasoning, Low)	100.00%	96.36%	92.59%
15	GPT-5.1	100.00%	95.33%	92.54%
16	Claude Opus 4.6	100.00%	90.72%	92.35%
17	Claude Opus 4.8 (Reasoning)	100.00%	99.26%	92.22%
18	Claude Opus 4.8 (Reasoning, Low)	100.00%	98.00%	92.14%
19	GPT-5	100.00%	93.53%	91.93%
20	Qwen 3.5 397B A17B	100.00%	97.50%	91.73%
21	Gemma 4 31B (Reasoning)	100.00%	96.32%	91.71%
22	Qwen 3.5 122B	100.00%	96.36%	91.53%
23	Qwen 3.5 Plus (2026-04-20)	100.00%	96.42%	91.51%
24	Gemma 4 26B (Reasoning)	100.00%	95.69%	91.49%
25	Grok 4.20 (Beta, Reasoning)	100.00%	95.41%	91.49%
26	GPT-5.4 (Reasoning, Low)	100.00%	95.32%	91.41%
27	Grok 4.20 (Reasoning)	100.00%	92.61%	91.39%
28	Z.AI GLM 5	100.00%	94.11%	91.23%
29	Claude Sonnet 4.6	100.00%	88.52%	91.15%
30	MoonshotAI: Kimi K2.5	100.00%	96.63%	91.04%
31	MiniMax M3	100.00%	93.59%	90.88%
32	Qwen 3.5 27B	100.00%	95.67%	90.85%
33	ByteDance Seed 1.6	100.00%	90.83%	90.70%
34	Qwen 3.6 Flash	100.00%	96.09%	90.65%
35	GPT-5.4 Mini (Reasoning)	100.00%	94.44%	90.65%
36	Gemini 3 Flash (Preview, Reasoning)	100.00%	97.20%	90.50%
37	o4 Mini High	100.00%	98.67%	90.29%
38	GPT-5.2	100.00%	96.22%	90.26%
39	DeepSeek V4 Pro (Reasoning)	100.00%	93.24%	90.10%
40	Claude Opus 4.7	100.00%	95.77%	89.93%
41	Qwen 3.6 27B	100.00%	94.32%	89.72%
42	Claude Opus 4.5	100.00%	89.84%	89.69%
43	Grok 4.1 Fast	100.00%	84.12%	89.55%
44	Aion 2.0	100.00%	90.91%	89.21%
45	Z.AI GLM 4.6	100.00%	88.58%	89.11%
46	MiniMax M2.7	100.00%	95.50%	89.10%
47	GPT-5.5	100.00%	81.88%	89.09%
48	Qwen 3.6 35B	100.00%	96.20%	89.05%
49	DeepSeek V4 Flash (Reasoning)	100.00%	87.53%	89.01%
50	Gemini 3 Pro (Preview)	100.00%	96.14%	88.79%
51	Claude Sonnet 4	100.00%	84.02%	88.72%
52	MiniMax M2.5	100.00%	90.42%	88.71%
53	Z.AI GLM 4.7	100.00%	94.31%	88.69%
54	GPT-4.1	100.00%	90.57%	88.68%
55	Gemini 2.5 Pro	100.00%	92.18%	88.53%
56	o4 Mini	100.00%	96.31%	88.35%
57	Grok 4	100.00%	89.67%	88.12%
58	Claude Sonnet 4.5	100.00%	83.78%	88.03%
59	Qwen 3.5 35B	100.00%	96.42%	88.00%
60	Claude Opus 4	100.00%	88.81%	87.69%
61	Xiaomi MIMO v2.5 Pro	100.00%	82.62%	87.36%
62	ByteDance Seed 2.0 Mini	100.00%	91.88%	86.91%
63	Gemma 4 31B	100.00%	86.69%	86.91%
64	Gemini 2.5 Flash (Reasoning)	100.00%	82.25%	86.51%
65	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%	83.90%	86.47%
66	GPT-OSS 120B	100.00%	92.03%	86.44%
67	Gemini 3.1 Flash Lite (Reasoning)	100.00%	92.32%	86.41%
68	Qwen 3.5 Flash	100.00%	96.11%	86.38%
69	Z.AI GLM 4.5	100.00%	79.19%	86.27%
70	Grok 4 Fast	100.00%	76.76%	86.15%
71	Qwen 3.5 9B	100.00%	94.02%	86.05%
72	Qwen 3.5 Plus (2026-02-15)	100.00%	86.65%	85.96%
73	Gemini 3.1 Flash Lite (Preview)	100.00%	94.00%	85.87%
74	Gemma 4 26B	100.00%	83.17%	85.84%
75	Gemini 3.1 Flash Lite	100.00%	92.77%	85.75%
76	GPT-5.4 Mini (Reasoning, Low)	100.00%	88.49%	85.75%
77	Gemini 2.5 Flash Lite (Reasoning)	100.00%	89.63%	85.75%
78	Mistral Large 3	100.00%	84.91%	85.43%
79	GPT-4o, May 13th (temp=0)	100.00%	83.13%	85.36%
80	Gemini 3 Flash (Preview)	100.00%	86.39%	85.35%
81	DeepSeek-V2 Chat	100.00%	83.82%	84.83%
82	Z.AI GLM 4.7 Flash	100.00%	88.98%	84.82%
83	ByteDance Seed 2.0 Lite	100.00%	92.23%	84.80%
84	Nemotron 3 Super	100.00%	95.29%	84.56%
85	GPT-5.4	100.00%	81.95%	84.32%
86	Claude 3.5 Sonnet	100.00%	76.75%	84.24%
87	Inception Mercury 2	100.00%	92.86%	83.85%
88	GPT-4o, May 13th (temp=1)	100.00%	80.69%	83.80%
89	Stealth: Aurora Alpha	100.00%	92.59%	83.79%
90	DeepSeek V3 (2024-12-26)	100.00%	81.87%	83.68%
91	GPT-4.1 Mini	100.00%	82.30%	83.20%
92	Z.AI GLM 4.5 Air	100.00%	76.57%	83.12%
93	DeepSeek V4 Pro	100.00%	77.57%	82.63%
94	GPT-4o, Aug. 6th (temp=1)	100.00%	82.44%	82.62%
95	GPT-5 Nano	100.00%	93.91%	82.60%
96	GPT-4o, Aug. 6th (temp=0)	100.00%	82.11%	82.45%
97	GPT-5.4 Mini	100.00%	79.37%	82.43%
98	Mistral Large 2	100.00%	69.19%	82.41%
99	Mistral Small 4 (Reasoning)	100.00%	85.61%	82.39%
100	DeepSeek V3.2	100.00%	81.58%	82.25%
101	Qwen 3 32B	100.00%	81.66%	82.21%
102	DeepSeek V4 Flash	100.00%	83.26%	82.02%
103	Grok 4.20	100.00%	84.11%	81.70%
104	GPT-5.4 Nano (Reasoning)	100.00%	93.34%	81.36%
105	Gemini 2.5 Flash Lite	100.00%	80.14%	81.08%
106	Gemini 2.5 Flash	100.00%	61.45%	80.60%
107	Mistral Large	100.00%	73.04%	80.15%
108	Qwen3 235B A22B Instruct 2507	100.00%	83.15%	80.10%
109	Writer: Palmyra X5	100.00%	79.71%	79.57%
110	Inception Mercury	100.00%	87.38%	79.50%
111	GPT-5.4 Nano (Reasoning, Low)	100.00%	91.42%	79.48%
112	GPT-4o Mini (temp=1)	100.00%	82.16%	79.08%
113	Mistral Small 3.2 24B	100.00%	73.17%	78.58%
114	Gemma 3 12B	100.00%	79.28%	78.41%
115	Llama 3.1 70B	100.00%	81.03%	78.40%
116	GPT-4o Mini (temp=0)	100.00%	81.43%	78.29%
117	Gemma 3 27B	100.00%	76.82%	77.85%
118	Mistral Medium 3.1	100.00%	80.13%	77.83%
119	Nemotron 3 Nano	100.00%	86.00%	77.73%
120	Mistral Small 4	100.00%	78.28%	76.46%
121	Qwen 2.5 72B	100.00%	76.43%	75.46%
122	Llama 3.1 Nemotron 70B	100.00%	88.31%	74.70%
123	GPT-5.4 Nano	100.00%	78.57%	74.40%
124	ByteDance Seed 1.6 Flash	100.00%	84.16%	73.27%
125	Mistral Small Creative	100.00%	76.28%	73.27%
126	Ministral 3 14B	100.00%	79.03%	72.54%
127	GPT-4.1 Nano	100.00%	68.45%	71.94%
128	Ministral 3 8B	100.00%	74.43%	71.76%
129	Ministral 3 3B	100.00%	72.38%	67.22%
130	LFM2 24B	100.00%	69.48%	58.77%
131	MoonshotAI: Kimi K2.6	96.67%	97.42%	92.31%
132	Stealth: Hunter Alpha	96.67%	84.63%	87.34%
133	Grok 4.20 (Beta)	96.67%	82.15%	83.85%
134	Claude 3 Haiku	96.67%	68.47%	71.19%
135	Stealth: Healer Alpha	93.33%	82.30%	85.93%
136	Xiaomi MIMO v2.5	93.33%	81.15%	85.05%
137	DeepSeek V3.1	93.33%	76.65%	82.39%
138	DeepSeek V3 (2025-03-24)	93.33%	80.62%	81.99%
139	Cydonia 24B V4.1	93.33%	69.32%	75.09%
140	WizardLM 2 8x22b	90.00%	67.14%	71.06%
141	Cohere Command R+ (Aug. 2024)	90.00%	59.51%	69.03%
142	Llama 3.1 8B	90.00%	74.82%	63.35%
143	Hermes 3 405B	86.67%	69.02%	82.86%
144	Grok 4.3	86.67%	66.41%	78.66%
145	Arcee AI: Trinity Mini	80.00%	59.94%	70.90%
146	Arcee AI: Trinity Large (Preview)	73.33%	60.74%	73.33%
147	Claude Haiku 4.5	70.00%	72.48%	85.14%
148	Skyfall 36B V2	70.00%	52.53%	65.76%
149	Hermes 3 70B	66.67%	61.15%	72.57%
150	Gemma 3 4B	53.33%	60.30%	68.57%
151	Mistral NeMO	40.00%	51.55%	65.04%
152	Rocinante 12B	36.67%	48.47%	54.54%
153	Claude 3.7 Sonnet	33.33%	62.54%	83.39%
154	Ministral 8B	33.33%	46.82%	64.87%
155	Ministral 3B	33.33%	49.17%	61.29%