NC Bench
Overview
About
Tests
Categories
Models
Vendors
Compare
Run: 2026-03-18-01KM09FSX63E2TJARWCPRHN4C4
Run Details
Model:
GPT-5.4 Mini (Reasoning, Low)
Test:
Dialogue tags
Scenario:
Write 500 words with 50% dialogue
Duration:
6.2s
Cost:
$0.004504
Provider:
OpenAI
Evaluation summary
Score
Evaluator
Details Link
0.00%
Matches word count
(details)
0.00%
Dialogue to Total Word Ratio
(details)
0.0000%
Overall
Model input
User/Human
You are a author of comedic heist books. Please write a 500 word scene that has 50% words of dialogue between two thieves named Molly and Bill who are planning to break into a children's party to steal all the presents. Only include the prose of the scene without any markdown or commentary.
Model response
Messages
Meta
AI
Molly sat on the hood of a dented blue van, holding a glitter-covered invitation between two fingers as if it might bite. Across from her, Bill was wrestling a sheet of wrapping paper out of the windshield wiper like it had declared war. “Tell me again,” Molly said, “why the best job we could find is stealing from a room full of sugar-drunk six-year-olds?” Bill tugged the paper free. “Because, unlike banks, children’s parties are full of unattended loot and no one expects a professional operation.” “Professional?” Molly said. “You’re wearing a penguin tie and trainers with one sole flapping like a tongue.” “It’s called disguise,” Bill said. “Also, they’re supposed to be clown shoes.” “They’re not clown shoes. They’re shoes in emotional distress.” Bill folded the invitation and squinted at it. “It says here the presents will be stacked in the sunroom, right next to the cake table.” “Mm-hm. And the cake table is beside the back door.” “And the back door leads to the garden.” “And the garden has a temporary fence because the host thinks rabbits are stealing carrots.” Bill nodded slowly. “A rabbit could get through that fence.” “A rabbit could get through your security plan.” He put a hand to his chest. “I resent that.” “You’ve named three jobs after breakfast foods.” “That is not a security flaw.” “It is when your last one was called Operation Waffle and ended with you stuck in a revolving door.” Bill pointed at her. “You still got the necklace.” “We stole a necklace from a man in a bathrobe because you shouted ‘fire drill’ at the wrong floor.” “It worked.” “At extracting us, yes. At subtlety, no.” Bill glanced toward the house at the end of the lane, where a cluster of balloons bobbed over the hedge. “Listen, this is a perfect score. Parents are inside gossiping, children are outside shrieking, and the presents are all tagged with ribbons.” “Ribbons make them easier to carry?” “They make them easier to count.” Molly slid off the hood and dusted off her coat. “You’re weirdly sincere about this.” “I take family entertainment seriously.” “You take theft seriously.” “Only the organized kind.” She peered through the hedge. “How many guests?” “According to the invite, twenty-two children, nine adults, one magician, one mascot, and a pony.” “A pony?” “A small pony.” “That doesn’t make it less alarming.” Bill checked his watch. “At 3:15, the entertainer starts balloon animals in the back room. At 3:20, the birthday child opens the big pile. At 3:22, everyone is distracted by wrapping paper, helium, and emotional overload.” “You’ve really thought this through.” “I prepared a flow chart.” Molly stared at him. “You made a flow chart for robbing toddlers.” “Borrowing from toddlers,” he corrected. “And I hate the term robbing. It sounds so rude.” “It is rude.” “It’s also efficient.” She sighed, then smiled in spite of herself. “All right. We go in through the garden, past the rabbit fence, past the pony, and past the magician if he isn’t too observant.” “He’ll be too busy pretending he sawed a chair in half.” “And if we’re caught?” Bill straightened his tie. “We look betrayed and say we’re the entertainment.” Molly laughed once, sharp and quiet. “That’s actually terrible.” “It’s brilliant.” “It’s terrible because it’s brilliant.” He opened the van door and reached for a pair of cardboard gift boxes. “You take the bags. I’ll handle decoys.” “Decoys?” He handed her a plush dinosaur wearing sunglasses. “You walk out carrying this, nobody questions your motives.” Molly took it and blinked. “Why do you own this?” “I don’t own it. I liberated it from a claw machine.” She tucked the dinosaur under her arm. “Bill, if this goes wrong, I want it noted that I protested.” “Noted.” “And if it goes right?” “We’ll be legends.” She looked at the balloons, the hedge, the house bursting with music and innocence and expensive paper. Then she looked back at Bill. “All right,” she said. “But if there’s a sing-along, I’m leaving you behind.” He grinned. “Deal. Now come on. The cake table won’t empty itself.”
created_at
"2026-03-18T10:57:28.499Z"
3/18/2026, 10:57:28 AM
model
"openai/gpt-5.4-mini-20260317"
app_id
182717
external_user
(null)
streamed
true
cancelled
false
latency
1206
moderation_latency
(null)
generation_time
5958
tokens_prompt
77
tokens_completion
1222
native_tokens_prompt
66
native_tokens_completion
990
native_tokens_completion_images
0
native_tokens_reasoning
53
native_tokens_cached
0
num_media_prompt
(null)
num_input_audio_prompt
(null)
num_media_completion
0
num_search_results
(null)
origin
"https://ncbench.com/"
is_byok
false
finish_reason
"stop"
native_finish_reason
"completed"
usage
0.0045045
router
(null)
provider_responses
0
endpoint_id
"9ee065b2-3d1c-43bc-bdd7-28af3b148282"
id
"resp_064d050b3d256f530169ba851894b8819783ebcfc8d1623706"
is_byok
false
latency
178
model_permaslug
"openai/gpt-5.4-mini-20260317"
provider_name
"OpenAI"
status
200
user_agent
"langchainjs-openai/1.0.0 ((node/v24.14.0; linux; x64))"
http_referer
(null)
api_type
"completions"
id
"gen-1773831448-fRk8o8FKmPIwBsbi601t"
upstream_id
"resp_064d050b3d256f530169ba851894b8819783ebcfc8d1623706"
total_cost
0.0045045
cache_discount
(null)
upstream_inference_cost
0
provider_name
"OpenAI"
Evaluation details
Result
Evaluator
Details
Meta Data
0.00%
Matches word count
n/a
neededClean
false
words
685
0.00%
Dialogue to Total Word Ratio
Ratio: 70.91%, Deviation: 20.91%
neededClean
false
wordsTotal
691
wordsDialogue
490
0.0000%