NC Bench
Overview
About
Tests
Categories
Models
Vendors
Compare
Run: 2026-03-06-01KK1J7D2AV2WJ64BQ4B7M8MEK
Run Details
Model:
GPT-5.4 (Reasoning)
Test:
Dialogue tags
Scenario:
Write 200 words with 90% dialogue
Duration:
12.0s
Cost:
$0.005205
Provider:
OpenAI
Evaluation summary
Score
Evaluator
Details Link
60.28%
Matches word count
(details)
76.18%
Dialogue to Total Word Ratio
(details)
68.2276%
Overall
Model input
User/Human
You are a author of comedic heist books. Please write a 200 word scene that has 90% words of dialogue between two thieves named Molly and Bill who are planning to break into a children's party to steal all the presents. Only include the prose of the scene without any markdown or commentary.
Model response
Messages
Meta
AI
“Bill, this is our finest hour,” Molly whispered. “A suburban fortress, brightly frosted, guarded by balloon arches and inattentive parents.” “Finest? It smells like juice boxes and moral decay,” Bill said. “I still say robbing grown-ups has dignity.” “Grown-ups label things,” Molly said. “Children tear labels off. Untraceable loot.” “What exactly are we stealing?” “Everything with a bow.” “That’s not a plan, Molly. That’s a festive mood.” “Listen carefully. We arrive disguised as entertainers.” “I can’t juggle.” “You don’t juggle. You loom.” “I do loom professionally.” “Good. I’ll do the talking. You carry the sack.” “A sack makes me feel traditional.” “Then there’s the gift table. We create a diversion.” “With what?” “Cake catastrophe.” “That’s reckless.” “That’s classic.” “And if someone asks who invited us?” “You say you’re Balloon Kevin.” “I look nothing like a Kevin.” “Exactly. It’s disarming.” “What if there’s a clown?” “We bribe him.” “With what?” “Cupcakes. Clowns are vain, not principled.” “And the escape?” “Through the bouncy castle.” “That’s not an exit. That’s a wobbling confession.” “Bill, greatness requires bounce.” “I hate when you say things that sound embroidered on cushions.” “Hush. Do you want presents or principles?” “How many presents?” “All of them.” “…I’m listening.” “Excellent. Practice smiling.” “I have one, but children fear it.” “Perfect. We’ll call it authority.”
created_at
"2026-03-06T12:33:42.996Z"
3/6/2026, 12:33:42 PM
model
"openai/gpt-5.4-20260305"
app_id
182717
external_user
(null)
streamed
true
cancelled
false
latency
364
moderation_latency
(null)
generation_time
11947
tokens_prompt
77
tokens_completion
489
native_tokens_prompt
66
native_tokens_completion
336
native_tokens_completion_images
0
native_tokens_reasoning
0
native_tokens_cached
0
num_media_prompt
(null)
num_input_audio_prompt
(null)
num_media_completion
0
num_search_results
(null)
origin
"https://ncbench.com/"
is_byok
false
finish_reason
"stop"
native_finish_reason
"completed"
usage
0.005205
router
(null)
provider_responses
0
endpoint_id
"9ff5625c-403f-4d7f-b895-58ac7295062c"
id
"resp_0268da9d9ce09ed80169aac9a70c7c8194831d4c2bf39fb64a"
is_byok
false
latency
138
model_permaslug
"openai/gpt-5.4-20260305"
provider_name
"OpenAI"
status
200
user_agent
"langchainjs-openai/1.0.0 ((node/v24.14.0; linux; x64))"
http_referer
(null)
api_type
"completions"
id
"gen-1772800422-1goguWupphjMZdav3ZCY"
upstream_id
"resp_0268da9d9ce09ed80169aac9a70c7c8194831d4c2bf39fb64a"
total_cost
0.005205
cache_discount
(null)
upstream_inference_cost
0
provider_name
"OpenAI"
Evaluation details
Result
Evaluator
Details
Meta Data
60.28%
Matches word count
n/a
neededClean
false
words
215
76.18%
Dialogue to Total Word Ratio
Ratio: 97.22%, Deviation: 7.22%
neededClean
false
wordsTotal
216
wordsDialogue
210
68.2276%