google/gemini-2.5-flash

Gemini 2.5 Flash (Reasoning)

Release Date

Jun 17th, 2025

Context Size

1m

Reasoning

Yes

Benchmark Cost

$5.73

Speed

198.2 tok/s

Categories

20%40%60%80%100%Creative Writing76.3%Tooling100.0%Language86.1%Utility82.2%Reasoning93.8%Text Editing98.1%Rule Following60.0%Hallucination95.6%

Subcategories

20%40%60%80%100%AI-ismsProse VarietyDialoguePurple ProseMechanical StyleClichésXMLComprehensionGenerationWord CountingSentence CountingParagraph CountingStructural CountingData ExtractionDeductionAttentionTransformationPreservationStructural IntegrityConstraint AdherenceFalse PositivesContent InventionOutput Corruption

Bad Writing Habits

Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.

Scenario #1 #2 #3 #4 #5 Total
Detailed Writing Rules
Creative WritingHallucination
767272727273%
Creative WritingHallucination
837978767378%
Creative WritingHallucination
878280797681%
Creative WritingHallucination
848078767178%
Creative WritingHallucination
828078777578%
Creative WritingHallucination
858380787881%
Detailed Writing Rules77.98%
genre
Creative WritingHallucination
757168686670%
Creative WritingHallucination
787674676472%
Creative WritingHallucination
827572706974%
Creative WritingHallucination
838180796978%
Creative WritingHallucination
777573717073%
Creative WritingHallucination
858381797781%
genre74.62%
Novelcrafter Default Prompt
Creative WritingHallucination
757573727173%
Creative WritingHallucination
818080757378%
Creative WritingHallucination
787573706973%
Creative WritingHallucination
858180777680%
Creative WritingHallucination
757473737274%
Creative WritingHallucination
898685837183%
Novelcrafter Default Prompt76.65%
76.42%

Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Scenario #1 #2 #3 #4 #5 Total
ToolingReasoning
969696949495%
ToolingReasoning
949493919092%
ToolingReasoning
979796886989%
ToolingReasoning
1009491868391%
91.93%

Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
basic entries
Hallucination
100100100100100100100100252585%
Hallucination
100100100100100100100100100100100%
Hallucination
1001001001001001001001001002593%
Hallucination
100100100100100100100100100100100%
basic entries94.38%
detailed entries
Hallucination
100100100100100100100100252585%
Hallucination
100100252525252525252540%
Hallucination
1001001001001001001001001002593%
Hallucination
100100100100100100100100100100100%
detailed entries79.38%
86.88%

Codex Violation Detection

Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
matrix
ToolingReasoning
9796959595959595939195%
ToolingReasoning
10098969696949393939195%
ToolingReasoning
1001001009797959191878795%
ToolingReasoning
100100100100100100100100100100100%
matrix96.12%
tiers
ToolingReasoning
100100100100100100100100100100100%
ToolingReasoning
100100100100100100100100100100100%
ToolingReasoning
10097979796928989888893%
ToolingReasoning
1001001001001001009795959298%
tiers97.82%
96.97%

Dialogue tags

Various tasks related to dialogue tags in text.

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
dialogue-200
Rule FollowingUtility
6450504330261200027%
Rule FollowingUtility
1009877505050505049057%
Rule FollowingUtility
5350505049494938371844%
dialogue-20042.94%
dialogue-500
Rule FollowingUtility
60505047412928227033%
Rule FollowingUtility
8450484222171010027%
Rule FollowingUtility
443513844210011%
dialogue-50023.91%
Ungrouped
Rule Following
100100100100100100100100611487%
41.14%

Language Comprehension

Does the model understand more than just English?

Scenario #1 #2 #3 #4 #5 Total
Language
100100100100080%
Language
100100100100080%
Language
10010000040%
Language
100100100100100100%
75.00%

N-Length Sentences

Write sentences with exactly N words

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
Utility
10099999999979796959598%
Utility
9789878685847875716582%
Utility
5134121096310013%
64.00%

Novel outline

Handle questions about the outline of a novel in various formats

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
outline-count
Utility
100100100100100100100100100100100%
Utility
100100100100100100100100100100100%
Utility
100100100100100100100100100100100%
outline-count100.00%
pov-count
Utility
1001001001001001001005050080%
Utility
100100100100100100100100100100100%
Utility
100100100100100100100100100100100%
pov-count93.33%
96.67%

Text Replacement

Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.

Scenario #1 #2 #3 #4 #5 #6 #7 Total
Generic Prompt
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Text Editing
9999989898989898%
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Text Editing
9999999999999599%
Text EditingHallucination
9695959594949394%
Text Editing
100100100100100100100100%
Text Editing
9696938989818189%
Generic Prompt97.85%
Specific Prompt
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Text Editing
9999979793908995%
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Text EditingHallucination
9898979797969497%
Text Editing
100100100100100100100100%
Text Editing
100100100100100100100100%
Specific Prompt99.08%
98.46%

Tool usage within Novelcrafter

Output messages that are related to tool usage within Novelcrafter

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
Tooling
100100100100100100100100100100100%
100.00%

Voice/dialogue sheets

Extract dialogue from given text as voice sheets.

Scenario #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Total
Rule Following
100100100100100100000060%
Rule Following
10000000000010%
1-shot Rule Following
00000000000%
Few-shot Rule Following
1001000000000020%
Rule Following
100100100100100100100100100100100%
38.00%