Structural Integrity

Subcategory of Text Editing. 154 models scored.

Model Leaderboard

All models ranked by their Structural Integrity subcategory score.

#	Model	Structural Integrity	Text Editing	Overall
1	Qwen3.7 Max	100.00%	98.08%	95.75%
2	Claude Opus 4.6 (Reasoning)	100.00%	98.86%	95.02%
3	Qwen3.6 Max Preview	100.00%	98.58%	94.54%
4	Gemini 3.1 Pro (Preview)	100.00%	98.51%	94.37%
5	Z.AI GLM 5.1	100.00%	98.90%	94.37%
6	Z.AI GLM 5 Turbo	100.00%	98.17%	94.27%
7	Claude Sonnet 4.6 (Reasoning)	100.00%	98.30%	93.66%
8	GPT-5.4 (Reasoning)	100.00%	98.42%	93.24%
9	Claude Opus 4.7 (Reasoning)	100.00%	97.58%	93.23%
10	GPT-5.5 (Reasoning)	100.00%	98.79%	92.98%
11	GPT-5 Mini	100.00%	97.13%	92.62%
12	GPT-5.5 (Reasoning, Low)	100.00%	98.59%	92.59%
13	GPT-5.1	100.00%	98.54%	92.54%
14	Claude Opus 4.6	100.00%	98.35%	92.35%
15	Claude Opus 4.8 (Reasoning)	100.00%	98.78%	92.22%
16	Claude Opus 4.8 (Reasoning, Low)	100.00%	98.71%	92.14%
17	GPT-5	100.00%	98.90%	91.93%
18	Qwen 3.5 397B A17B	100.00%	98.05%	91.73%
19	Gemma 4 31B (Reasoning)	100.00%	98.83%	91.71%
20	Qwen 3.5 Plus (2026-04-20)	100.00%	97.70%	91.51%
21	Gemma 4 26B (Reasoning)	100.00%	98.26%	91.49%
22	Grok 4.20 (Beta, Reasoning)	100.00%	98.69%	91.49%
23	GPT-5.4 (Reasoning, Low)	100.00%	98.01%	91.41%
24	Grok 4.20 (Reasoning)	100.00%	98.83%	91.39%
25	Z.AI GLM 5	100.00%	98.59%	91.23%
26	Claude Sonnet 4.6	100.00%	96.37%	91.15%
27	MiniMax M3	100.00%	97.51%	90.88%
28	Qwen 3.5 27B	100.00%	98.69%	90.85%
29	ByteDance Seed 1.6	100.00%	98.40%	90.70%
30	Gemini 3 Flash (Preview, Reasoning)	100.00%	98.12%	90.50%
31	GPT-5.2	100.00%	97.54%	90.26%
32	DeepSeek V4 Pro (Reasoning)	100.00%	98.56%	90.10%
33	Claude Opus 4.7	100.00%	97.55%	89.93%
34	Claude Opus 4.5	100.00%	97.69%	89.69%
35	Grok 4.1 Fast	100.00%	97.87%	89.55%
36	Z.AI GLM 4.6	100.00%	97.78%	89.11%
37	GPT-5.5	100.00%	98.20%	89.09%
38	Gemini 3 Pro (Preview)	100.00%	98.86%	88.79%
39	Claude Sonnet 4	100.00%	99.13%	88.72%
40	MiniMax M2.5	100.00%	96.02%	88.71%
41	Z.AI GLM 4.7	100.00%	98.22%	88.69%
42	GPT-4.1	100.00%	94.40%	88.68%
43	Gemini 2.5 Pro	100.00%	98.58%	88.53%
44	Grok 4	100.00%	98.76%	88.12%
45	Claude Sonnet 4.5	100.00%	99.02%	88.03%
46	Claude Opus 4	100.00%	97.25%	87.69%
47	Xiaomi MIMO v2.5 Pro	100.00%	95.96%	87.36%
48	Gemma 4 31B	100.00%	98.56%	86.91%
49	Gemini 2.5 Flash (Reasoning)	100.00%	98.12%	86.51%
50	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%	98.26%	86.47%
51	Gemini 3.1 Flash Lite (Reasoning)	100.00%	96.90%	86.41%
52	Grok 4 Fast	100.00%	97.26%	86.15%
53	Qwen 3.5 Plus (2026-02-15)	100.00%	98.10%	85.96%
54	Stealth: Healer Alpha	100.00%	96.04%	85.93%
55	Gemini 3.1 Flash Lite (Preview)	100.00%	96.46%	85.87%
56	Gemma 4 26B	100.00%	97.04%	85.84%
57	Gemini 3.1 Flash Lite	100.00%	97.35%	85.75%
58	GPT-5.4 Mini (Reasoning, Low)	100.00%	92.63%	85.75%
59	Gemini 2.5 Flash Lite (Reasoning)	100.00%	94.54%	85.75%
60	Mistral Large 3	100.00%	94.09%	85.43%
61	GPT-4o, May 13th (temp=0)	100.00%	95.35%	85.36%
62	Gemini 3 Flash (Preview)	100.00%	97.54%	85.35%
63	Claude Haiku 4.5	100.00%	96.81%	85.14%
64	Xiaomi MIMO v2.5	100.00%	95.34%	85.05%
65	GPT-5.4	100.00%	96.73%	84.32%
66	Claude 3.5 Sonnet	100.00%	96.57%	84.24%
67	Grok 4.20 (Beta)	100.00%	95.49%	83.85%
68	GPT-4o, May 13th (temp=1)	100.00%	92.41%	83.80%
69	Claude 3.7 Sonnet	100.00%	97.12%	83.39%
70	GPT-4.1 Mini	100.00%	95.62%	83.20%
71	Z.AI GLM 4.5 Air	100.00%	94.38%	83.12%
72	Hermes 3 405B	100.00%	89.14%	82.86%
73	DeepSeek V4 Pro	100.00%	97.98%	82.63%
74	GPT-5.4 Mini	100.00%	90.60%	82.43%
75	Mistral Large 2	100.00%	94.16%	82.41%
76	DeepSeek V3.2	100.00%	95.78%	82.25%
77	DeepSeek V4 Flash	100.00%	93.25%	82.02%
78	Grok 4.20	100.00%	95.63%	81.70%
79	GPT-5.4 Nano (Reasoning)	100.00%	83.32%	81.36%
80	Gemini 2.5 Flash Lite	100.00%	92.13%	81.08%
81	Gemini 2.5 Flash	100.00%	97.83%	80.60%
82	Mistral Large	100.00%	95.14%	80.15%
83	Qwen3 235B A22B Instruct 2507	100.00%	91.75%	80.10%
84	Writer: Palmyra X5	100.00%	91.20%	79.57%
85	GPT-5.4 Nano (Reasoning, Low)	100.00%	82.23%	79.48%
86	GPT-4o Mini (temp=1)	100.00%	85.78%	79.08%
87	Grok 4.3	100.00%	90.19%	78.66%
88	Mistral Small 3.2 24B	100.00%	89.48%	78.58%
89	Gemma 3 12B	100.00%	85.18%	78.41%
90	GPT-4o Mini (temp=0)	100.00%	84.62%	78.29%
91	Gemma 3 27B	100.00%	86.63%	77.85%
92	Mistral Medium 3.1	100.00%	93.77%	77.83%
93	Mistral Small 4	100.00%	91.00%	76.46%
94	Qwen 2.5 72B	100.00%	89.18%	75.46%
95	Cydonia 24B V4.1	100.00%	86.15%	75.09%
96	GPT-5.4 Nano	100.00%	79.22%	74.40%
97	Arcee AI: Trinity Large (Preview)	100.00%	86.62%	73.33%
98	Mistral Small Creative	100.00%	90.31%	73.27%
99	Ministral 3 14B	100.00%	86.20%	72.54%
100	GPT-4.1 Nano	100.00%	76.06%	71.94%
101	Ministral 3 8B	100.00%	78.52%	71.76%
102	Gemma 3 4B	100.00%	78.38%	68.57%
103	Ministral 8B	100.00%	77.52%	64.87%
104	Grok 4.3 (Reasoning)	98.81%	97.64%	93.60%
105	MoonshotAI: Kimi K2.6	98.81%	98.00%	92.31%
106	Qwen 3.5 122B	98.81%	96.31%	91.53%
107	Qwen 3.5 35B	98.81%	94.95%	88.00%
108	Stealth: Hunter Alpha	98.81%	95.53%	87.34%
109	Z.AI GLM 4.5	98.81%	95.32%	86.27%
110	ByteDance Seed 2.0 Lite	98.81%	95.03%	84.80%
111	DeepSeek V3 (2024-12-26)	98.81%	93.58%	83.68%
112	Qwen 3 32B	98.81%	89.95%	82.21%
113	MoonshotAI: Kimi K2.5	98.81%	97.79%	91.04%
114	Qwen 3.6 Flash	98.81%	96.09%	90.65%
115	GPT-5.4 Mini (Reasoning)	98.81%	95.78%	90.65%
116	o4 Mini High	98.81%	94.36%	90.29%
117	Mistral Small 4 (Reasoning)	98.81%	90.58%	82.39%
118	DeepSeek V3 (2025-03-24)	98.81%	89.57%	81.99%
119	MiniMax M2.7	97.62%	92.14%	89.10%
120	Qwen 3.6 27B	97.62%	93.97%	89.72%
121	Qwen 3.6 35B	97.62%	95.10%	89.05%
122	o4 Mini	97.62%	90.61%	88.35%
123	GPT-OSS 120B	97.62%	91.73%	86.44%
124	DeepSeek-V2 Chat	97.62%	90.90%	84.83%
125	Llama 3.1 70B	97.62%	92.10%	78.40%
126	Ministral 3B	97.62%	70.91%	61.29%
127	ByteDance Seed 1.6 Flash	97.62%	91.64%	73.27%
128	Ministral 3 3B	96.43%	69.80%	67.22%
129	Gemini 3.5 Flash (Reasoning)	96.43%	97.78%	94.08%
130	ByteDance Seed 2.0 Mini	96.43%	91.08%	86.91%
131	Qwen 3.5 Flash	96.43%	92.80%	86.38%
132	Z.AI GLM 4.7 Flash	96.43%	85.82%	84.82%
133	GPT-4o, Aug. 6th (temp=0)	96.43%	93.77%	82.45%
134	Cohere Command R+ (Aug. 2024)	95.39%	68.40%	69.03%
135	GPT-5 Nano	95.24%	82.74%	82.60%
136	DeepSeek V3.1	95.24%	87.27%	82.39%
137	Aion 2.0	95.24%	95.34%	89.21%
138	DeepSeek V4 Flash (Reasoning)	95.24%	96.15%	89.01%
139	Llama 3.1 Nemotron 70B	95.04%	87.26%	74.70%
140	Inception Mercury 2	94.05%	85.26%	83.85%
141	Llama 3.1 8B	94.05%	75.45%	63.35%
142	Inception Mercury	93.32%	79.53%	79.50%
143	Mistral NeMO	93.21%	73.69%	65.04%
144	WizardLM 2 8x22b	93.12%	88.13%	71.06%
145	GPT-4o, Aug. 6th (temp=1)	92.86%	86.72%	82.62%
146	Qwen 3.5 9B	91.67%	85.35%	86.05%
147	Skyfall 36B V2	91.15%	76.69%	65.76%
148	LFM2 24B	90.94%	71.56%	58.77%
149	Nemotron 3 Super	90.48%	86.34%	84.56%
150	Nemotron 3 Nano	87.00%	75.81%	77.73%
151	Arcee AI: Trinity Mini	85.71%	73.88%	70.90%
152	Claude 3 Haiku	81.03%	64.36%	71.19%
153	Hermes 3 70B	78.57%	63.34%	72.57%
154	Rocinante 12B	72.78%	56.31%	54.54%