Tests
This page shows the performance of each test and its scenarios.
Bad Writing Habits
Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.
Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
| Scenario | Best Model | Score |
|---|---|---|
| matrix | ||
| Large codex (40 entries), long passage (1,019 words) | Grok 4 | 97.89% |
| Large codex (40 entries), short passage (165 words) | Claude Opus 4.5 | 100.00% |
| Small codex (7 entries), long passage (734 words) | Claude Opus 4.5 | 100.00% |
| Small codex (7 entries), short passage (165 words) | Gemini 3.1 Pro (Preview) | 100.00% |
| tiers | ||
| 5 codex entries | Qwen 3.5 Flash | 100.00% |
| 10 codex entries | Stealth: Aurora Alpha | 100.00% |
| 20 codex entries | Claude Opus 4.6 (Reasoning) | 100.00% |
| 40 codex entries | Grok 4 | 100.00% |
Codex Extraction
Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.
| Scenario | Best Model | Score |
|---|---|---|
| Long: The Spire of Echoes (Dense) | Gemini 3 Pro (Preview) | 98.80% |
| Medium: The Hollow (Inferred) | Claude Opus 4.6 | 99.14% |
| Medium: Through the Thornveil (Scattered) | Claude Opus 4.6 (Reasoning) | 98.74% |
| Short: The Rusty Lantern (Explicit) | Qwen 3.5 Plus (2026-02-15) | 99.48% |
Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
| Scenario | Best Model | Score |
|---|---|---|
| basic entries | ||
| Long text (~1594 words), big codex (51 entries) | GPT-5.1 | 100.00% |
| Long text (~1594 words), small codex (11 entries) | LFM2 24B | 100.00% |
| Short text (~524 words), big codex (51 entries) | Grok 4.1 Fast | 100.00% |
| Short text (~524 words), small codex (11 entries) | Claude Opus 4.6 | 100.00% |
| detailed entries | ||
| Long text (~1594 words), big codex (51 detailed entries) | Aion 2.0 | 100.00% |
| Long text (~1594 words), small codex (11 detailed entries) | Inception Mercury 2 | 100.00% |
| Short text (~524 words), big codex (51 detailed entries) | GPT-5.4 (Reasoning) | 100.00% |
| Short text (~524 words), small codex (11 detailed entries) | ByteDance Seed 1.6 | 100.00% |
Data extraction
Extract key details from a given block of text.
| Scenario | Best Model | Score |
|---|---|---|
| All valid emails | MoonshotAI: Kimi K2.5 | 100.00% |
| Contextual pronoun | Claude Sonnet 4.6 | 100.00% |
| Fruits excluding citrus | Gemma 3 12B | 100.00% |
| Future event time | Z.AI GLM 4.7 Flash | 100.00% |
| Guess the pet | GPT-5 Mini | 100.00% |
| Highest-rated movie | DeepSeek V3.1 | 100.00% |
| Indirect birth year | GPT-5.4 (Reasoning, Low) | 100.00% |
| What instrument does Lucy play? | GPT-5 | 100.00% |
| What's the color of the car? | Claude Opus 4.5 | 100.00% |
| What's the correct time? | Claude Sonnet 4 | 100.00% |
| Who's the sister? | Llama 3.1 70B | 100.00% |
| Who's the tallest? | Hermes 3 405B | 100.00% |
Dialogue tags
Various tasks related to dialogue tags in text.
| Scenario | Best Model | Score |
|---|---|---|
| dialogue-200 | ||
| Write 200 words with 10% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 90% dialogue | Gemini 3.1 Pro (Preview) | 98.93% |
| dialogue-500 | ||
| Write 500 words with 30% dialogue | Gemini 3.1 Pro (Preview) | 96.01% |
| Write 500 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 500 words with 70% dialogue | Gemini 3.1 Pro (Preview) | 99.97% |
| Ungrouped | ||
| Write unattributed dialogue | GPT-4.1 | 100.00% |
Language Comprehension
Does the model understand more than just English?
| Scenario | Best Model | Score |
|---|---|---|
| Asking for directions (Dutch) | GPT-4.1 | 100.00% |
| Asking for directions (German) | Ministral 3 3B | 100.00% |
| Friend got new kittens (German) | Claude Opus 4.6 | 100.00% |
| Friend got new kittens (Tagalog) | Gemini 3 Pro (Preview) | 100.00% |
Language Writing
Can the model generate text in different languages?
| Scenario | Best Model | Score |
|---|---|---|
| Character dialogue (French) in a story | Gemini 3 Flash (Preview) | 100.00% |
| Character dialogue (German) in a story | GPT-5 Nano | 100.00% |
| Character dialogue (Hindi) in a story | Claude Sonnet 4.6 | 100.00% |
| Character dialogue (Italian) in a story | Claude Opus 4.6 (Reasoning) | 100.00% |
| Character dialogue (Spanish) in a story | Claude Opus 4.5 | 100.00% |
Novel outline
Handle questions about the outline of a novel in various formats
| Scenario | Best Model | Score |
|---|---|---|
| outline-count | ||
| Count acts | Gemini 3 Flash (Preview, Reasoning) | 100.00% |
| Count chapters | Stealth: Hunter Alpha | 100.00% |
| Count scenes | Qwen 3.5 35B | 100.00% |
| pov-count | ||
| Count point of views for Jack and Olivia | Writer: Palmyra X5 | 100.00% |
| Count point of views for Jack Harper | Qwen 3.5 122B | 100.00% |
| Count point of views for Olivia | Qwen 3.5 Flash | 100.00% |
Tool usage within Novelcrafter
Output messages that are related to tool usage within Novelcrafter
| Scenario | Best Model | Score |
|---|---|---|
| Create alternate prose sections | Claude 3.5 Sonnet | 100.00% |
N-Length Sentences
Write sentences with exactly N words
| Scenario | Best Model | Score |
|---|---|---|
| Write sentences with 5 words each | Llama 3.1 Nemotron 70B | 100.00% |
| Write sentences with 10 words each | Nemotron 3 Super | 100.00% |
| Write sentences with 20 words each | o4 Mini | 100.00% |
Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Voice/dialogue sheets
Extract dialogue from given text as voice sheets.
| Scenario | Best Model | Score |
|---|---|---|
| Multiple speakers | Gemini 3 Pro (Preview) | 100.00% |
| Simple | Claude Opus 4 | 100.00% |
| Simple (1-shot) | DeepSeek V3 (2025-03-24) | 100.00% |
| Simple (5-shot) | Grok 4.1 Fast | 100.00% |
| Unattributed dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
Write N of X
Write exactly N words/sentences/paragraphs...
| Scenario | Best Model | Score |
|---|---|---|
| paragraphs | ||
| 1 paragraph summary | Gemini 2.5 Flash Lite (Reasoning) | 100.00% |
| 3 paragraph summary | Ministral 3 14B | 100.00% |
| 5 paragraph summary | GPT-4.1 Mini | 100.00% |
| sentences | ||
| 1 sentence summary | DeepSeek-V2 Chat | 100.00% |
| 3 sentence summary | Grok 4 | 100.00% |
| 10 sentence summary | GPT-4o Mini (temp=1) | 100.00% |
| 20 sentence summary | ByteDance Seed 2.0 Lite | 100.00% |
| 50 sentence summary | Gemini 3.1 Pro (Preview) | 100.00% |
| words | ||
| 10 word summary | ByteDance Seed 2.0 Lite | 100.00% |
| 20 word summary | GPT-5.4 (Reasoning) | 100.00% |
| 50 word summary | GPT-4o Mini (temp=0) | 100.00% |
| 100 word summary | Gemini 3 Pro (Preview) | 100.00% |
| 200 word summary | Z.AI GLM 5 | 100.00% |