Tests
This page shows the performance of each test and its scenarios.
Bad Writing Habits
Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.
Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
| Scenario | Best Model | Score |
|---|---|---|
| matrix | ||
| Large codex (40 entries), long passage (1,019 words) | Qwen3.6 Max Preview | 98.08% |
| Large codex (40 entries), short passage (165 words) | Claude Sonnet 4.5 | 100.00% |
| Small codex (7 entries), long passage (734 words) | Claude Opus 4.5 | 100.00% |
| Small codex (7 entries), short passage (165 words) | Z.AI GLM 5 | 100.00% |
| tiers | ||
| 5 codex entries | Hermes 3 405B | 100.00% |
| 10 codex entries | Qwen 3.6 27B | 100.00% |
| 20 codex entries | Claude Opus 4.6 (Reasoning) | 100.00% |
| 40 codex entries | Qwen 3.5 Plus (2026-04-20) | 100.00% |
Codex Extraction
Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.
| Scenario | Best Model | Score |
|---|---|---|
| Long: The Spire of Echoes (Dense) | Claude Opus 4.8 (Reasoning) | 99.21% |
| Medium: The Hollow (Inferred) | Claude Opus 4.8 (Reasoning) | 99.41% |
| Medium: Through the Thornveil (Scattered) | Gemini 3.5 Flash (Reasoning, Minimal) | 99.08% |
| Short: The Rusty Lantern (Explicit) | Z.AI GLM 5.1 | 99.49% |
Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Data extraction
Extract key details from a given block of text.
| Scenario | Best Model | Score |
|---|---|---|
| All valid emails | Claude Haiku 4.5 | 100.00% |
| Contextual pronoun | Gemini 2.5 Flash Lite | 100.00% |
| Fruits excluding citrus | Ministral 3 14B | 100.00% |
| Future event time | Gemini 3 Pro (Preview) | 100.00% |
| Guess the pet | GPT-OSS 120B | 100.00% |
| Highest-rated movie | Skyfall 36B V2 | 100.00% |
| Indirect birth year | Gemini 3 Flash (Preview) | 100.00% |
| What instrument does Lucy play? | Nemotron 3 Super | 100.00% |
| What's the color of the car? | GPT-5.4 Mini | 100.00% |
| Who's the sister? | Claude Opus 4.8 (Reasoning) | 100.00% |
| Who's the tallest? | GPT-4.1 Mini | 100.00% |
Dialogue tags
Various tasks related to dialogue tags in text.
| Scenario | Best Model | Score |
|---|---|---|
| dialogue-200 | ||
| Write 200 words with 10% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 90% dialogue | Gemini 3.1 Pro (Preview) | 98.93% |
| dialogue-500 | ||
| Write 500 words with 30% dialogue | Qwen3.7 Max | 100.00% |
| Write 500 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 500 words with 70% dialogue | Gemini 3.1 Pro (Preview) | 99.97% |
| Ungrouped | ||
| Write unattributed dialogue | Writer: Palmyra X5 | 100.00% |
Language Comprehension
Does the model understand more than just English?
| Scenario | Best Model | Score |
|---|---|---|
| Asking for directions (Dutch) | Gemini 3.1 Flash Lite (Reasoning) | 100.00% |
| Asking for directions (German) | Z.AI GLM 5.1 | 100.00% |
| Friend got new kittens (German) | Inception Mercury | 100.00% |
| Friend got new kittens (Tagalog) | Hermes 3 70B | 100.00% |
Language Writing
Can the model generate text in different languages?
| Scenario | Best Model | Score |
|---|---|---|
| Character dialogue (French) in a story | Qwen 3.5 27B | 100.00% |
| Character dialogue (German) in a story | DeepSeek-V2 Chat | 100.00% |
| Character dialogue (Hindi) in a story | GPT-4o Mini (temp=1) | 100.00% |
| Character dialogue (Italian) in a story | Gemma 4 31B | 100.00% |
| Character dialogue (Spanish) in a story | Gemini 3.1 Flash Lite | 100.00% |
Novel outline
Handle questions about the outline of a novel in various formats
| Scenario | Best Model | Score |
|---|---|---|
| outline-count | ||
| Count acts | Claude Opus 4.7 | 100.00% |
| Count chapters | GPT-5.4 (Reasoning, Low) | 100.00% |
| Count scenes | Gemini 3 Pro (Preview) | 100.00% |
| pov-count | ||
| Count point of views for Jack and Olivia | Gemini 3.1 Pro (Preview) | 100.00% |
| Count point of views for Jack Harper | Claude Sonnet 4.5 | 100.00% |
| Count point of views for Olivia | Gemini 2.5 Flash Lite (Reasoning) | 100.00% |
Tool usage within Novelcrafter
Output messages that are related to tool usage within Novelcrafter
| Scenario | Best Model | Score |
|---|---|---|
| Create alternate prose sections | GPT-5 Nano | 100.00% |
Relationship tree
Extracts a deterministic XML family and relationship tree from cumulative literary prose.
| Scenario | Best Model | Score |
|---|---|---|
| Core relationship tree | GPT-5.4 (Reasoning) | 98.71% |
| Family relationship tree | GPT-5.4 (Reasoning) | 92.77% |
N-Length Sentences
Write sentences with exactly N words
| Scenario | Best Model | Score |
|---|---|---|
| Write sentences with 5 words each | GPT-5.4 Mini (Reasoning) | 100.00% |
| Write sentences with 10 words each | Gemma 4 26B (Reasoning) | 100.00% |
| Write sentences with 20 words each | GPT-OSS 120B | 100.00% |
Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Voice/dialogue sheets
Extract dialogue from given text as voice sheets.
| Scenario | Best Model | Score |
|---|---|---|
| Multiple speakers | Gemini 3.5 Flash (Reasoning, Minimal) | 100.00% |
| Simple | Claude Sonnet 4.6 (Reasoning) | 100.00% |
| Simple (1-shot) | GPT-5.5 | 100.00% |
| Simple (5-shot) | GPT-5.1 | 100.00% |
| Unattributed dialogue | GPT-5 Mini | 100.00% |
Write N of X
Write exactly N words/sentences/paragraphs...
| Scenario | Best Model | Score |
|---|---|---|
| paragraphs | ||
| 1 paragraph summary | Qwen 3.5 Flash | 100.00% |
| 3 paragraph summary | ByteDance Seed 2.0 Lite | 100.00% |
| 5 paragraph summary | Claude Sonnet 4.5 | 100.00% |
| sentences | ||
| 1 sentence summary | Mistral Medium 3.1 | 100.00% |
| 3 sentence summary | GPT-5.4 (Reasoning) | 100.00% |
| 10 sentence summary | Grok 4.20 (Reasoning) | 100.00% |
| 20 sentence summary | GPT-5 Mini | 100.00% |
| 50 sentence summary | Grok 4.3 (Reasoning) | 100.00% |
| words | ||
| 10 word summary | o4 Mini | 100.00% |
| 20 word summary | Grok 4.3 (Reasoning) | 100.00% |
| 50 word summary | Z.AI GLM 4.7 | 100.00% |
| 100 word summary | Z.AI GLM 5 Turbo | 100.00% |
| 200 word summary | Qwen3.7 Max | 100.00% |