Tests
This page shows the performance of each test and its scenarios.
Bad Writing Habits
Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.
Codex Violation Detection
Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.
| Scenario | Best Model | Score |
|---|---|---|
| matrix | ||
| Large codex (40 entries), long passage (1,019 words) | Grok 4 | 97.89% |
| Large codex (40 entries), short passage (165 words) | Claude Sonnet 4.5 | 100.00% |
| Small codex (7 entries), long passage (734 words) | Grok 4.20 (Beta, Reasoning) | 100.00% |
| Small codex (7 entries), short passage (165 words) | Gemini 3 Flash (Preview, Reasoning) | 100.00% |
| tiers | ||
| 5 codex entries | GPT-5 Mini | 100.00% |
| 10 codex entries | Gemini 3 Flash (Preview) | 100.00% |
| 20 codex entries | Claude Opus 4.5 | 100.00% |
| 40 codex entries | Grok 4 Fast | 100.00% |
Codex Extraction
Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.
| Scenario | Best Model | Score |
|---|---|---|
| Long: The Spire of Echoes (Dense) | Gemini 3 Pro (Preview) | 98.80% |
| Medium: The Hollow (Inferred) | Claude Opus 4.6 | 99.14% |
| Medium: Through the Thornveil (Scattered) | Claude Opus 4.6 (Reasoning) | 98.74% |
| Short: The Rusty Lantern (Explicit) | Qwen 3.5 Plus (2026-02-15) | 99.48% |
Codex Red Herring (False Positive Detection)
Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.
Data extraction
Extract key details from a given block of text.
| Scenario | Best Model | Score |
|---|---|---|
| All valid emails | ByteDance Seed 1.6 Flash | 100.00% |
| Contextual pronoun | Stealth: Aurora Alpha | 100.00% |
| Fruits excluding citrus | Qwen 3.5 122B | 100.00% |
| Future event time | Qwen 3.5 397B A17B | 100.00% |
| Guess the pet | Gemini 3 Pro (Preview) | 100.00% |
| Highest-rated movie | DeepSeek V3.1 | 100.00% |
| Indirect birth year | Claude Sonnet 4 | 100.00% |
| What instrument does Lucy play? | Qwen 3 32B | 100.00% |
| What's the color of the car? | Gemini 3.1 Pro (Preview) | 100.00% |
| What's the correct time? | GPT-4o Mini (temp=0) | 100.00% |
| Who's the sister? | Claude Haiku 4.5 | 100.00% |
| Who's the tallest? | Gemini 2.5 Flash (Reasoning) | 100.00% |
Dialogue tags
Various tasks related to dialogue tags in text.
| Scenario | Best Model | Score |
|---|---|---|
| dialogue-200 | ||
| Write 200 words with 10% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 200 words with 90% dialogue | Gemini 3.1 Pro (Preview) | 98.93% |
| dialogue-500 | ||
| Write 500 words with 30% dialogue | Z.AI GLM 5 Turbo | 96.52% |
| Write 500 words with 50% dialogue | Gemini 3.1 Pro (Preview) | 100.00% |
| Write 500 words with 70% dialogue | Gemini 3.1 Pro (Preview) | 99.97% |
| Ungrouped | ||
| Write unattributed dialogue | Ministral 3 14B | 100.00% |
Language Comprehension
Does the model understand more than just English?
| Scenario | Best Model | Score |
|---|---|---|
| Asking for directions (Dutch) | GPT-5.4 Nano (Reasoning) | 100.00% |
| Asking for directions (German) | Claude Sonnet 4.6 | 100.00% |
| Friend got new kittens (German) | Claude 3.5 Haiku | 100.00% |
| Friend got new kittens (Tagalog) | Mistral Small Creative | 100.00% |
Language Writing
Can the model generate text in different languages?
| Scenario | Best Model | Score |
|---|---|---|
| Character dialogue (French) in a story | Claude Sonnet 4.6 | 100.00% |
| Character dialogue (German) in a story | GPT-4o Mini (temp=1) | 100.00% |
| Character dialogue (Hindi) in a story | DeepSeek-V2 Chat | 100.00% |
| Character dialogue (Italian) in a story | Claude Opus 4.6 | 100.00% |
| Character dialogue (Spanish) in a story | Gemini 3 Flash (Preview, Reasoning) | 100.00% |
Novel outline
Handle questions about the outline of a novel in various formats
| Scenario | Best Model | Score |
|---|---|---|
| outline-count | ||
| Count acts | Mistral Large | 100.00% |
| Count chapters | GPT-5.4 Mini | 100.00% |
| Count scenes | GPT-5 Nano | 100.00% |
| pov-count | ||
| Count point of views for Jack and Olivia | Gemini 2.5 Pro | 100.00% |
| Count point of views for Jack Harper | o4 Mini High | 100.00% |
| Count point of views for Olivia | Ministral 3 14B | 100.00% |
Tool usage within Novelcrafter
Output messages that are related to tool usage within Novelcrafter
| Scenario | Best Model | Score |
|---|---|---|
| Create alternate prose sections | ByteDance Seed 2.0 Lite | 100.00% |
N-Length Sentences
Write sentences with exactly N words
| Scenario | Best Model | Score |
|---|---|---|
| Write sentences with 5 words each | Llama 3.1 Nemotron 70B | 100.00% |
| Write sentences with 10 words each | o4 Mini | 100.00% |
| Write sentences with 20 words each | GPT-5 | 100.00% |
Text Replacement
Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.
Voice/dialogue sheets
Extract dialogue from given text as voice sheets.
| Scenario | Best Model | Score |
|---|---|---|
| Multiple speakers | Qwen 3.5 Plus (2026-02-15) | 100.00% |
| Simple | GPT-4o Mini (temp=0) | 100.00% |
| Simple (1-shot) | DeepSeek-V2 Chat | 100.00% |
| Simple (5-shot) | Z.AI GLM 4.5 | 100.00% |
| Unattributed dialogue | Claude 3.5 Sonnet | 100.00% |
Write N of X
Write exactly N words/sentences/paragraphs...
| Scenario | Best Model | Score |
|---|---|---|
| paragraphs | ||
| 1 paragraph summary | DeepSeek V3 (2025-03-24) | 100.00% |
| 3 paragraph summary | ByteDance Seed 2.0 Lite | 100.00% |
| 5 paragraph summary | MoonshotAI: Kimi K2.5 | 100.00% |
| sentences | ||
| 1 sentence summary | Mistral Large 3 | 100.00% |
| 3 sentence summary | Z.AI GLM 5 Turbo | 100.00% |
| 10 sentence summary | Mistral Small 3.2 24B | 100.00% |
| 20 sentence summary | Claude Opus 4.6 (Reasoning) | 100.00% |
| 50 sentence summary | Qwen 3.5 35B | 100.00% |
| words | ||
| 10 word summary | GPT-5.1 | 100.00% |
| 20 word summary | Z.AI GLM 4.7 | 100.00% |
| 50 word summary | Z.AI GLM 5 Turbo | 100.00% |
| 100 word summary | Z.AI GLM 4.7 | 100.00% |
| 200 word summary | Gemini 3.1 Pro (Preview) | 100.00% |