Gemma 3 4B - NC Bench

google/gemma-3-4b-it

Gemma 3 4B

via OpenRouter

Release Date

Mar 12th, 2025

Context Size

128k

Reasoning

Benchmark Cost

$0.11

Speed

75.8 tok/s

Subcategories

Bad Writing Habits

Detects common prose quality anti-patterns in AI-generated creative writing, including passive voice, past progressive overuse, weak dialogue tags, filter words, purple prose, cliches, AI-ism words/adverbs/names, and more.

Scenario	#1	#2	#3	#4	#5	Total
Detailed Writing Rules
Fantasy: entering an ancient ruin Creative Writing Hallucination	79	74	71	71	71	73%
Horror: alone in an eerie place at night Creative Writing Hallucination	79	72	71	69	66	71%
Literary fiction: old friends reunite Creative Writing Hallucination	78	73	71	69	64	71%
Mystery: examining a crime scene Creative Writing Hallucination	76	74	74	73	70	73%
Romance: separated couple reunites Creative Writing Hallucination	74	73	72	70	70	72%
Thriller: chase through city streets Creative Writing Hallucination	81	80	77	71	70	76%
Detailed Writing Rules						72.79%
genre
Fantasy: entering an ancient ruin Creative Writing Hallucination	78	69	67	67	67	70%
Horror: alone in an eerie place at night Creative Writing Hallucination	77	72	71	70	68	72%
Literary fiction: old friends reunite Creative Writing Hallucination	76	75	71	70	68	72%
Mystery: examining a crime scene Creative Writing Hallucination	76	75	74	72	70	73%
Romance: separated couple reunites Creative Writing Hallucination	72	72	70	70	63	69%
Thriller: chase through city streets Creative Writing Hallucination	80	80	79	76	74	78%
genre						72.28%
Novelcrafter Default Prompt
Fantasy: entering an ancient ruin Creative Writing Hallucination	78	77	75	73	72	75%
Horror: alone in an eerie place at night Creative Writing Hallucination	76	75	73	71	64	72%
Literary fiction: old friends reunite Creative Writing Hallucination	75	74	74	71	69	72%
Mystery: examining a crime scene Creative Writing Hallucination	77	77	75	74	67	74%
Romance: separated couple reunites Creative Writing Hallucination	74	74	72	67	66	70%
Thriller: chase through city streets Creative Writing Hallucination	80	76	75	75	75	76%
Novelcrafter Default Prompt						73.35%
72.81%

Codex Extraction

Evaluates a model's ability to extract structured codex entries (characters, locations, objects, lore) from prose passages and return them as well-formed XML.

Scenario	#1	#2	#3	#4	#5	Total
Long: The Spire of Echoes (Dense) Tooling Reasoning	89	87	86	86	84	87%
Medium: The Hollow (Inferred) Tooling Reasoning	83	82	81	80	79	81%
Medium: Through the Thornveil (Scattered) Tooling Reasoning	80	79	78	78	76	78%
Short: The Rusty Lantern (Explicit) Tooling Reasoning	90	90	88	86	85	88%
83.33%

Codex Red Herring (False Positive Detection)

Tests whether models correctly report "no violations" when a codex is fully consistent with the prose passage. Models that hallucinate false violations (false positives) fail. Uses a 2×2 matrix of text length × codex size, with bare and detailed-entry variants.

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
basic entries
Long text (~1594 words), big codex (51 entries) Hallucination	1	1	1	1	1	1	1	1	1	1	1%
Long text (~1594 words), small codex (11 entries) Hallucination	4	3	3	3	3	3	3	2	2	2	3%
Short text (~524 words), big codex (51 entries) Hallucination	5	5	5	3	3	2	2	1	1	1	3%
Short text (~524 words), small codex (11 entries) Hallucination	7	7	7	6	6	6	6	6	6	4	6%
basic entries											3.13%
detailed entries
Long text (~1594 words), big codex (51 detailed entries) Hallucination	2	1	1	1	1	1	1	1	1	1	1%
Long text (~1594 words), small codex (11 detailed entries) Hallucination	4	3	3	3	3	3	3	3	2	2	3%
Short text (~524 words), big codex (51 detailed entries) Hallucination	2	2	2	1	1	1	1	1	1	1	1%
Short text (~524 words), small codex (11 detailed entries) Hallucination	6	6	5	5	5	5	5	4	4	3	5%
detailed entries											2.48%
2.80%

Codex Violation Detection

Detects factual inconsistencies between a story bible and prose passages. The model must output structured XML identifying each violation with paragraph number and substring.

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
matrix
Large codex (40 entries), long passage (1,019 words) Tooling Reasoning	43	40	38	37	29	23	21	3	3	3	24%
Large codex (40 entries), short passage (165 words) Tooling Reasoning	40	38	37	36	35	33	33	33	20	17	32%
Small codex (7 entries), long passage (734 words) Tooling Reasoning	61	59	54	49	49	47	44	43	42	42	49%
Small codex (7 entries), short passage (165 words) Tooling Reasoning	48	44	40	40	40	40	39	39	39	38	41%
matrix											36.54%
tiers
5 codex entries Tooling Reasoning	57	57	57	44	42	42	42	42	42	42	46%
10 codex entries Tooling Reasoning	64	58	56	49	49	44	42	39	33	33	47%
20 codex entries Tooling Reasoning	48	40	38	38	37	36	36	36	35	32	38%
40 codex entries Tooling Reasoning	42	40	39	38	37	33	33	33	33	33	36%
tiers											41.78%
39.16%

Data extraction

Extract key details from a given block of text.

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
All valid emails Utility	100	100	100	100	100	100	100	100	100	100	100%
Contextual pronoun Utility	100	100	100	100	100	100	100	100	100	100	100%
Fruits excluding citrus Utility	100	100	100	100	100	100	100	100	100	100	100%
Future event time Reasoning	100	100	100	100	100	100	100	100	100	100	100%
Guess the pet Reasoning	100	100	100	100	100	100	100	100	100	100	100%
Highest-rated movie Reasoning	100	100	100	100	100	100	100	100	100	100	100%
Indirect birth year Reasoning	100	100	100	100	100	100	100	100	100	100	100%
What instrument does Lucy play? Reasoning	100	100	100	100	100	100	100	100	100	100	100%
What's the color of the car? Utility	100	100	100	100	100	100	100	100	100	100	100%
Who's the sister? Reasoning	100	100	100	100	100	100	100	100	100	100	100%
Who's the tallest? Reasoning	100	100	100	100	100	100	100	100	100	100	100%
100.00%

Dialogue tags

Various tasks related to dialogue tags in text.

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
dialogue-200
Write 200 words with 10% dialogue Rule Following Utility	50	49	14	3	2	1	0	0	0	0	12%
Write 200 words with 50% dialogue Rule Following Utility	0	0	0	0	0	0	0	0	0	0	0%
Write 200 words with 90% dialogue Rule Following Utility	50	50	50	50	50	50	50	48	43	18	46%
dialogue-200											19.21%
dialogue-500
Write 500 words with 30% dialogue Rule Following Utility	38	2	0	0	0	0	0	0	0	0	4%
Write 500 words with 50% dialogue Rule Following Utility	0	0	0	0	0	0	0	0	0	0	0%
Write 500 words with 70% dialogue Rule Following Utility	0	0	0	0	0	0	0	0	0	0	0%
dialogue-500											1.31%
Ungrouped
Write unattributed dialogue Rule Following	100	61	61	61	14	14	14	14	1	1	34%
13.63%

Language Comprehension

Does the model understand more than just English?

Scenario	#1	#2	#3	#4	#5	Total
Asking for directions (Dutch) Language	100	100	100	0	0	60%
Asking for directions (German) Language	100	100	100	100	100	100%
Friend got new kittens (German) Language	100	0	0	0	0	20%
Friend got new kittens (Tagalog) Language	100	100	100	100	100	100%
70.00%

Language Writing

Can the model generate text in different languages?

Scenario	#1	#2	#3	#4	#5	Total
Character dialogue (French) in a story Language	100	100	100	100	80	96%
Character dialogue (German) in a story Language	100	100	100	92	86	96%
Character dialogue (Hindi) in a story Language	58	50	50	45	0	41%
Character dialogue (Italian) in a story Language	100	100	85	73	0	71%
Character dialogue (Spanish) in a story Language	100	100	83	62	0	69%
74.56%

N-Length Sentences

Write sentences with exactly N words

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
Write sentences with 5 words each Utility	100	100	100	98	98	98	94	93	89	77	95%
Write sentences with 10 words each Utility	88	88	87	80	77	75	72	65	63	52	75%
Write sentences with 20 words each Utility	12	10	2	1	0	0	0	0	0	0	3%
57.31%

Novel outline

Handle questions about the outline of a novel in various formats

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
outline-count
Count acts Utility	100	100	100	100	100	100	100	100	100	100	100%
Count chapters Utility	100	100	100	100	100	100	100	100	100	100	100%
Count scenes Utility	0	0	0	0	0	0	0	0	0	0	0%
outline-count											66.67%
pov-count
Count point of views for Jack and Olivia Utility	50	50	50	50	0	0	0	0	0	0	20%
Count point of views for Jack Harper Utility	100	0	0	0	0	0	0	0	0	0	10%
Count point of views for Olivia Utility	0	0	0	0	0	0	0	0	0	0	0%
pov-count											10.00%
38.33%

Relationship tree

Extracts a deterministic XML family and relationship tree from cumulative literary prose.

Scenario	#1	#2	#3	#4	#5	Total
Core relationship tree Tooling Reasoning Hallucination	51	50	46	45	43	47%
Family relationship tree Tooling Reasoning Hallucination	40	39	36	35	30	36%
41.60%

Text Replacement

Tests deterministic text transformations: renaming characters/locations, expanding contractions, tense rewriting, POV shifts, gender swaps, combined transformations, and word avoidance. Scored by checking each expected change independently.

Scenario	#1	#2	#3	#4	#5	#6	#7	Total
Generic Prompt
Avoid said/asked/replied/answered Text Editing	100	97	97	94	94	92	86	94%
Character rename: Elena->Mirabel, Gregor->Aldric Text Editing	100	100	100	100	100	100	100	100%
Combined: 3rd person past → 1st person present Text Editing	70	70	70	70	67	67	67	69%
Expand all contractions Text Editing	89	78	78	78	78	77	76	79%
Location rename: market square, outer ring, bridge, northern mines Text Editing	96	96	96	96	96	96	96	96%
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged Text Editing	89	89	89	89	89	89	89	89%
Passive voice → active voice Text Editing Hallucination	80	79	78	76	76	76	74	77%
POV shift: 3rd person to 1st person (Elena's perspective) Text Editing	97	97	95	95	95	95	93	95%
Tense rewriting: past to present Text Editing	67	67	67	67	67	67	67	67%
Generic Prompt								85.09%
Specific Prompt
Avoid said/asked/replied/answered Text Editing	100	100	100	100	97	94	92	98%
Character rename: Elena->Mirabel, Gregor->Aldric Text Editing	100	100	100	100	100	100	100	100%
Combined: 3rd person past → 1st person present Text Editing	98	96	96	96	96	96	95	96%
Expand all contractions Text Editing	72	72	72	72	72	72	72	72%
Location rename: market square, outer ring, bridge, northern mines Text Editing	100	100	100	100	100	100	100	100%
Multi-character gender swap: Priya(F)->Rohan(M), Mara unchanged Text Editing	100	100	100	100	100	100	100	100%
Passive voice → active voice Text Editing Hallucination	80	80	80	80	80	80	80	80%
POV shift: 3rd person to 1st person (Elena's perspective) Text Editing	100	100	100	100	100	100	100	100%
Tense rewriting: past to present Text Editing	97	97	97	97	93	93	93	95%
Specific Prompt								93.45%
89.27%

Tool usage within Novelcrafter

Output messages that are related to tool usage within Novelcrafter

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
Create alternate prose sections Tooling	100	100	100	100	100	100	100	100	100	100	100%
100.00%

Voice/dialogue sheets

Extract dialogue from given text as voice sheets.

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
Multiple speakers Rule Following	0	0	0	0	0	0	0	0	0	0	0%
Simple Rule Following	100	100	100	100	0	0	0	0	0	0	40%
Simple (1-shot) 1-shot Rule Following	100	0	0	0	0	0	0	0	0	0	10%
Simple (5-shot) Few-shot Rule Following	100	100	100	100	100	100	100	100	100	100	100%
Unattributed dialogue Rule Following	0	0	0	0	0	0	0	0	0	0	0%
30.00%

Write N of X

Write exactly N words/sentences/paragraphs...

Scenario	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	Total
paragraphs
1 paragraph summary Utility	100	100	100	100	100	100	100	100	100	100	100%
3 paragraph summary Utility	100	100	100	100	100	100	0	0	0	0	60%
5 paragraph summary Utility	0	0	0	0	0	0	0	0	0	0	0%
paragraphs											53.33%
sentences
1 sentence summary Utility	100	100	100	100	100	100	100	100	100	100	100%
3 sentence summary Utility	100	100	100	100	100	100	100	100	100	100	100%
10 sentence summary Utility	100	100	100	100	100	100	100	100	100	98	100%
20 sentence summary Utility	100	100	100	100	100	100	100	100	100	100	100%
50 sentence summary Utility	100	100	100	100	100	100	92	54	9	0	75%
sentences											95.04%
words
10 word summary Utility	54	27	2	0	0	0	0	0	0	0	8%
20 word summary Utility	0	0	0	0	0	0	0	0	0	0	0%
50 word summary Utility	9	0	0	0	0	0	0	0	0	0	1%
100 word summary Utility	0	0	0	0	0	0	0	0	0	0	0%
200 word summary Utility	98	0	0	0	0	0	0	0	0	0	10%
words											3.80%
50.32%