Tests
This page shows the performance of each test and its scenarios.
Bad Writing Habits
Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.
Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
| Scenario | Best Model | Score |
|---|---|---|
| matrix | ||
| Large codex (40 entries), long passage (1,019 words) | Qwen3.6 Max Preview | 98.08% |
| Large codex (40 entries), short passage (165 words) | Claude Sonnet 4.5 | 100.00% |
| Small codex (7 entries), long passage (734 words) | Claude Opus 4.5 | 100.00% |
| Small codex (7 entries), short passage (165 words) | Qwen 3.5 Plus (2026-04-20) | 100.00% |
| tiers | ||
| 5 codex entries | Qwen3.6 Max Preview | 100.00% |
| 10 codex entries | Claude Opus 4.6 | 100.00% |
| 20 codex entries | Claude Opus 4.6 | 100.00% |
| 40 codex entries | Qwen 3.5 Plus (2026-04-20) | 100.00% |
Codex Extraction
Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.
| Scenario | Best Model | Score |
|---|---|---|
| Long: The Spire of Echoes (Dense) | Gemini 3 Pro (Preview) | 98.80% |
| Medium: The Hollow (Inferred) | Claude Opus 4.6 | 99.14% |
| Medium: Through the Thornveil (Scattered) | Z.AI GLM 5.1 | 98.92% |
| Short: The Rusty Lantern (Explicit) | Z.AI GLM 5.1 | 99.49% |
Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Data extraction
Extract key details from a given block of text.
| Scenario | Best Model | Score |
|---|---|---|
| All valid emails | Z.AI GLM 4.5 | 100.00% |
| Contextual pronoun | Gemini 3.1 Flash Lite | 100.00% |
| Fruits excluding citrus | Ministral 3 14B | 100.00% |
| Future event time | Gemini 3 Pro (Preview) | 100.00% |
| Guess the pet | Mistral Large 3 | 100.00% |
| Highest-rated movie | Nemotron 3 Nano | 100.00% |
| Indirect birth year | Llama 3.1 Nemotron 70B | 100.00% |
| What instrument does Lucy play? | DeepSeek V3.2 | 100.00% |
| What's the color of the car? | GPT-5.4 Mini | 100.00% |
| What's the correct time? | Claude Sonnet 4 | 100.00% |
| Who's the sister? | Stealth: Healer Alpha | 100.00% |
| Who's the tallest? | Mistral Small 3.2 24B | 100.00% |
Dialogue tags
Various tasks related to dialogue tags in text.
| Scenario | Best Model | Score |
|---|---|---|
| dialogue-200 | ||
| Write 200 words with 10% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 90% dialogue | Gemini 3.1 Pro (Preview) | 98.93% |
| dialogue-500 | ||
| Write 500 words with 30% dialogue | Z.AI GLM 5 Turbo | 96.52% |
| Write 500 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 500 words with 70% dialogue | Gemini 3.1 Pro (Preview) | 99.97% |
| Ungrouped | ||
| Write unattributed dialogue | Claude Sonnet 4.5 | 100.00% |
Language Comprehension
Does the model understand more than just English?
| Scenario | Best Model | Score |
|---|---|---|
| Asking for directions (Dutch) | Claude Opus 4.7 | 100.00% |
| Asking for directions (German) | Mistral Large 3 | 100.00% |
| Friend got new kittens (German) | Mistral Large 2 | 100.00% |
| Friend got new kittens (Tagalog) | ByteDance Seed 1.6 | 100.00% |
Language Writing
Can the model generate text in different languages?
| Scenario | Best Model | Score |
|---|---|---|
| Character dialogue (French) in a story | ByteDance Seed 2.0 Lite | 100.00% |
| Character dialogue (German) in a story | Grok 4.20 | 100.00% |
| Character dialogue (Hindi) in a story | GPT-5.4 Mini | 100.00% |
| Character dialogue (Italian) in a story | Gemma 4 31B | 100.00% |
| Character dialogue (Spanish) in a story | Qwen 3.5 27B | 100.00% |
Novel outline
Handle questions about the outline of a novel in various formats
| Scenario | Best Model | Score |
|---|---|---|
| outline-count | ||
| Count acts | Claude 3.7 Sonnet | 100.00% |
| Count chapters | Cohere Command R+ (Aug. 2024) | 100.00% |
| Count scenes | Grok 4.3 (Reasoning) | 100.00% |
| pov-count | ||
| Count point of views for Jack and Olivia | ByteDance Seed 1.6 Flash | 100.00% |
| Count point of views for Jack Harper | GPT-5.5 (Reasoning, Low) | 100.00% |
| Count point of views for Olivia | GPT-5.5 (Reasoning, Low) | 100.00% |
Tool usage within Novelcrafter
Output messages that are related to tool usage within Novelcrafter
| Scenario | Best Model | Score |
|---|---|---|
| Create alternate prose sections | Ministral 3 8B | 100.00% |
N-Length Sentences
Write sentences with exactly N words
| Scenario | Best Model | Score |
|---|---|---|
| Write sentences with 5 words each | GPT-5.4 Mini (Reasoning) | 100.00% |
| Write sentences with 10 words each | Qwen 3.5 122B | 100.00% |
| Write sentences with 20 words each | GPT-5.4 Mini (Reasoning) | 100.00% |
Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Voice/dialogue sheets
Extract dialogue from given text as voice sheets.
| Scenario | Best Model | Score |
|---|---|---|
| Multiple speakers | Claude Sonnet 4.5 | 100.00% |
| Simple | GPT-4o, May 13th (temp=0) | 100.00% |
| Simple (1-shot) | GPT-5.5 | 100.00% |
| Simple (5-shot) | Claude 3 Haiku | 100.00% |
| Unattributed dialogue | Grok 4.3 (Reasoning) | 100.00% |
Write N of X
Write exactly N words/sentences/paragraphs...
| Scenario | Best Model | Score |
|---|---|---|
| paragraphs | ||
| 1 paragraph summary | Gemini 3.1 Flash Lite (Reasoning) | 100.00% |
| 3 paragraph summary | Stealth: Aurora Alpha | 100.00% |
| 5 paragraph summary | Claude Sonnet 4.5 | 100.00% |
| sentences | ||
| 1 sentence summary | DeepSeek V3.1 | 100.00% |
| 3 sentence summary | Hermes 3 405B | 100.00% |
| 10 sentence summary | Grok 4.20 (Reasoning) | 100.00% |
| 20 sentence summary | Gemini 3.1 Pro (Preview) | 100.00% |
| 50 sentence summary | ByteDance Seed 1.6 | 100.00% |
| words | ||
| 10 word summary | Qwen3.6 Max Preview | 100.00% |
| 20 word summary | Inception Mercury 2 | 100.00% |
| 50 word summary | Qwen 3.5 122B | 100.00% |
| 100 word summary | Gemma 4 31B (Reasoning) | 100.00% |
| 200 word summary | Qwen 3.5 Flash | 100.00% |