Tests

This page shows the performance of each test and its scenarios.

Bad Writing Habits

Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.

Scenario	Best Model	Score
Detailed Writing Rules
Fantasy: entering an ancient ruin	GPT-5.4 (Reasoning)	92.13%
Horror: alone in an eerie place at night	GPT-5.4 (Reasoning)	92.58%
Literary fiction: old friends reunite	Claude Opus 4	91.78%
Mystery: examining a crime scene	GPT-5.4 (Reasoning, Low)	93.24%
Romance: separated couple reunites	GPT-5.4 (Reasoning, Low)	92.41%
Thriller: chase through city streets	GPT-5.4	92.82%
genre
Fantasy: entering an ancient ruin	GPT-5.5 (Reasoning, Low)	90.13%
Horror: alone in an eerie place at night	GPT-5.5 (Reasoning, Low)	88.53%
Literary fiction: old friends reunite	GPT-5.5	89.76%
Mystery: examining a crime scene	GPT-5.4 (Reasoning)	92.47%
Romance: separated couple reunites	GPT-5.4	92.16%
Thriller: chase through city streets	GPT-5.4	92.04%
Novelcrafter Default Prompt
Fantasy: entering an ancient ruin	GPT-5.4	90.45%
Horror: alone in an eerie place at night	GPT-5.4 (Reasoning)	91.68%
Literary fiction: old friends reunite	Grok 4.20 (Reasoning)	94.33%
Mystery: examining a crime scene	GPT-5.4 (Reasoning)	92.51%
Romance: separated couple reunites	GPT-5.4 (Reasoning, Low)	90.19%
Thriller: chase through city streets	GPT-5.4 (Reasoning)	92.77%

Codex Violation Detection

Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.

Scenario	Best Model	Score
matrix
Large codex (40 entries), long passage (1,019 words)	Qwen3.6 Max Preview	98.08%
Large codex (40 entries), short passage (165 words)	Claude Sonnet 4.5	100.00%
Small codex (7 entries), long passage (734 words)	Claude Opus 4.5	100.00%
Small codex (7 entries), short passage (165 words)	Z.AI GLM 5	100.00%
tiers
5 codex entries	Hermes 3 405B	100.00%
10 codex entries	Qwen 3.6 27B	100.00%
20 codex entries	Claude Opus 4.6 (Reasoning)	100.00%
40 codex entries	Qwen 3.5 Plus (2026-04-20)	100.00%

Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Scenario	Best Model	Score
Long: The Spire of Echoes (Dense)	Claude Opus 4.8 (Reasoning)	99.21%
Medium: The Hollow (Inferred)	Claude Opus 4.8 (Reasoning)	99.41%
Medium: Through the Thornveil (Scattered)	Gemini 3.5 Flash (Reasoning, Minimal)	99.08%
Short: The Rusty Lantern (Explicit)	Z.AI GLM 5.1	99.49%

Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Scenario	Best Model	Score
basic entries
Long text (~1594 words), big codex (51 entries)	GPT-5 Nano	100.00%
Long text (~1594 words), small codex (11 entries)	Claude Opus 4.6 (Reasoning)	100.00%
Short text (~524 words), big codex (51 entries)	Claude Opus 4.8 (Reasoning, Low)	100.00%
Short text (~524 words), small codex (11 entries)	Gemma 4 26B (Reasoning)	100.00%
detailed entries
Long text (~1594 words), big codex (51 detailed entries)	Xiaomi MIMO v2.5	100.00%
Long text (~1594 words), small codex (11 detailed entries)	Z.AI GLM 5 Turbo	100.00%
Short text (~524 words), big codex (51 detailed entries)	GPT-5.4 Mini (Reasoning, Low)	100.00%
Short text (~524 words), small codex (11 detailed entries)	Grok 4	100.00%

Data extraction

Extract key details from a given block of text.

Scenario	Best Model	Score
All valid emails	Claude Haiku 4.5	100.00%
Contextual pronoun	Gemini 2.5 Flash Lite	100.00%
Fruits excluding citrus	Ministral 3 14B	100.00%
Future event time	Gemini 3 Pro (Preview)	100.00%
Guess the pet	GPT-OSS 120B	100.00%
Highest-rated movie	Skyfall 36B V2	100.00%
Indirect birth year	Gemini 3 Flash (Preview)	100.00%
What instrument does Lucy play?	Nemotron 3 Super	100.00%
What's the color of the car?	GPT-5.4 Mini	100.00%
Who's the sister?	Claude Opus 4.8 (Reasoning)	100.00%
Who's the tallest?	GPT-4.1 Mini	100.00%

Dialogue tags

Various tasks related to dialogue tags in text.

Scenario	Best Model	Score
dialogue-200
Write 200 words with 10% dialogue	Gemini 3.1 Pro (Preview)	100.00%
Write 200 words with 50% dialogue	Gemini 3.1 Pro (Preview)	100.00%
Write 200 words with 90% dialogue	Gemini 3.1 Pro (Preview)	98.93%
dialogue-500
Write 500 words with 30% dialogue	Qwen3.7 Max	100.00%
Write 500 words with 50% dialogue	Gemini 3.1 Pro (Preview)	100.00%
Write 500 words with 70% dialogue	Gemini 3.1 Pro (Preview)	99.97%
Ungrouped
Write unattributed dialogue	Writer: Palmyra X5	100.00%

Language Comprehension

Does the model understand more than just English?

Scenario	Best Model	Score
Asking for directions (Dutch)	Gemini 3.1 Flash Lite (Reasoning)	100.00%
Asking for directions (German)	Z.AI GLM 5.1	100.00%
Friend got new kittens (German)	Inception Mercury	100.00%
Friend got new kittens (Tagalog)	Hermes 3 70B	100.00%

Language Writing

Can the model generate text in different languages?

Scenario	Best Model	Score
Character dialogue (French) in a story	Qwen 3.5 27B	100.00%
Character dialogue (German) in a story	DeepSeek-V2 Chat	100.00%
Character dialogue (Hindi) in a story	GPT-4o Mini (temp=1)	100.00%
Character dialogue (Italian) in a story	Gemma 4 31B	100.00%
Character dialogue (Spanish) in a story	Gemini 3.1 Flash Lite	100.00%

Novel outline

Handle questions about the outline of a novel in various formats

Scenario	Best Model	Score
outline-count
Count acts	Claude Opus 4.7	100.00%
Count chapters	GPT-5.4 (Reasoning, Low)	100.00%
Count scenes	Gemini 3 Pro (Preview)	100.00%
pov-count
Count point of views for Jack and Olivia	Gemini 3.1 Pro (Preview)	100.00%
Count point of views for Jack Harper	Claude Sonnet 4.5	100.00%
Count point of views for Olivia	Gemini 2.5 Flash Lite (Reasoning)	100.00%

Tool usage within Novelcrafter

Output messages that are related to tool usage within Novelcrafter

Scenario	Best Model	Score
Create alternate prose sections	GPT-5 Nano	100.00%

Relationship tree

Extracts a deterministic XML family and relationship tree from cumulative literary prose.

Scenario	Best Model	Score
Core relationship tree	GPT-5.4 (Reasoning)	98.71%
Family relationship tree	GPT-5.4 (Reasoning)	92.77%

N-Length Sentences

Write sentences with exactly N words

Scenario	Best Model	Score
Write sentences with 5 words each	GPT-5.4 Mini (Reasoning)	100.00%
Write sentences with 10 words each	Gemma 4 26B (Reasoning)	100.00%
Write sentences with 20 words each	GPT-OSS 120B	100.00%

Text Replacement

Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.

Scenario	Best Model	Score
Generic Prompt
Avoid said/asked/replied/answered	Mistral Medium 3.1	100.00%
Character rename: Elena->Mirabel, Gregor->Aldric	Mistral Large 3	100.00%
Combined: 3rd person past → 1st person present	Z.AI GLM 5 Turbo	99.83%
Expand all contractions	Z.AI GLM 4.5 Air	100.00%
Location rename: market square, outer ring, bridge, northern mines	Mistral Small 3.2 24B	100.00%
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged	Claude 3.7 Sonnet	100.00%
Passive voice → active voice	Claude Opus 4.8 (Reasoning)	98.46%
POV shift: 3rd person to 1st person (Elena's perspective)	MiniMax M3	100.00%
Tense rewriting: past to present	Claude Sonnet 4.5	99.91%
Specific Prompt
Avoid said/asked/replied/answered	GPT-5.4 (Reasoning)	100.00%
Character rename: Elena->Mirabel, Gregor->Aldric	Gemma 4 26B	100.00%
Combined: 3rd person past → 1st person present	Gemini 3.5 Flash (Reasoning)	100.00%
Expand all contractions	Claude 3.7 Sonnet	100.00%
Location rename: market square, outer ring, bridge, northern mines	GPT-5.5	100.00%
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged	Claude Sonnet 4.6 (Reasoning)	100.00%
Passive voice → active voice	Gemini 3.1 Pro (Preview)	99.23%
POV shift: 3rd person to 1st person (Elena's perspective)	GPT-5.4 Mini (Reasoning)	100.00%
Tense rewriting: past to present	Mistral Large 2	100.00%

Voice/dialogue sheets

Extract dialogue from given text as voice sheets.

Scenario	Best Model	Score
Multiple speakers	Gemini 3.5 Flash (Reasoning, Minimal)	100.00%
Simple	Claude Sonnet 4.6 (Reasoning)	100.00%
Simple (1-shot)	GPT-5.5	100.00%
Simple (5-shot)	GPT-5.1	100.00%
Unattributed dialogue	GPT-5 Mini	100.00%

Write N of X

Write exactly N words/sentences/paragraphs...

Scenario	Best Model	Score
paragraphs
1 paragraph summary	Qwen 3.5 Flash	100.00%
3 paragraph summary	ByteDance Seed 2.0 Lite	100.00%
5 paragraph summary	Claude Sonnet 4.5	100.00%
sentences
1 sentence summary	Mistral Medium 3.1	100.00%
3 sentence summary	GPT-5.4 (Reasoning)	100.00%
10 sentence summary	Grok 4.20 (Reasoning)	100.00%
20 sentence summary	GPT-5 Mini	100.00%
50 sentence summary	Grok 4.3 (Reasoning)	100.00%
words
10 word summary	o4 Mini	100.00%
20 word summary	Grok 4.3 (Reasoning)	100.00%
50 word summary	Z.AI GLM 4.7	100.00%
100 word summary	Z.AI GLM 5 Turbo	100.00%
200 word summary	Qwen3.7 Max	100.00%