Categories
NC Bench evaluates models across 8 categories and 23 subcategories.
Category Distribution
Shows the number of scenarios in each category. Some scenarios may be in multiple categories.
Tooling (13)
Creative Writing (18)
Language (9)
Utility (32)
Reasoning (20)
Text Editing (18)
Rule Following (12)
Hallucination (28)
Creative Writing
18 scenarios · 6 subcategories
Top Models
| 87.20% | GPT-5.1 |
| 86.93% | Qwen 3.5 397B A17B |
| 86.87% | GPT-5 |
Subcategories
| 79.49% | AI-isms |
| 67.74% | Prose Variety |
| 74.35% | Dialogue |
| 87.65% | Purple Prose |
| 84.74% | Mechanical Style |
| 77.40% | Clichés |
Tooling
13 scenarios · 1 subcategory
Top Models
| 100.00% | Claude Opus 4.6 (Reasoning) |
| 100.00% | Claude Sonnet 4.6 (Reasoning) |
| 100.00% | Claude Opus 4.6 |
Subcategories
| 95.16% | XML |
Language
9 scenarios · 2 subcategories
Top Models
| 100.00% | Claude Sonnet 4.6 |
| 100.00% | DeepSeek-V2 Chat |
| 99.66% | Claude Opus 4.5 |
Subcategories
| 81.76% | Comprehension |
| 84.10% | Generation |
Utility
32 scenarios · 5 subcategories
Top Models
| 99.91% | Gemini 3.1 Pro (Preview) |
| 98.93% | Claude Opus 4.6 (Reasoning) |
| 98.67% | o4 Mini High |
Subcategories
| 60.30% | Word Counting |
| 85.34% | Sentence Counting |
| 94.07% | Paragraph Counting |
| 71.12% | Structural Counting |
| 97.58% | Data Extraction |
Reasoning
20 scenarios · 2 subcategories
Top Models
| 98.05% | Gemini 3 Flash (Preview, Reasoning) |
| 96.91% | Gemini 2.5 Pro |
| 96.01% | Gemini 3.1 Pro (Preview) |
Text Editing
18 scenarios · 3 subcategories
Top Models
| 99.13% | Claude Sonnet 4 |
| 99.02% | Claude Sonnet 4.5 |
| 98.90% | GPT-5 |
Subcategories
| 81.25% | Transformation |
| 92.40% | Preservation |
| 98.02% | Structural Integrity |
Rule Following
12 scenarios · 1 subcategory
Top Models
| 91.21% | Gemini 3.1 Pro (Preview) |
| 89.78% | Claude Opus 4.6 (Reasoning) |
| 85.73% | Claude Sonnet 4.6 (Reasoning) |
Subcategories
| 60.47% | Constraint Adherence |
Hallucination
28 scenarios · 3 subcategories
Top Models
| 100.00% | Claude 3.5 Haiku |
| 99.93% | Stealth: Aurora Alpha |
| 99.06% | o4 Mini High |
Subcategories
| 49.53% | False Positives |
| 95.73% | Content Invention |
| 98.79% | Output Corruption |