NC Bench
Overview
About
Tests
Models
Dialogue tags
Various tasks related to dialogue tags in text.
Write 500 words with 70% dialogue
0-shot
Creative writing
Rule following
Model
Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Run 7
Run 8
Run 9
Run 10
Total
o4 Mini High
100%
95%
83%
64%
60%
56%
46%
44%
29%
0%
58%
GPT-4o, Aug. 6th (temp=0)
99%
97%
83%
65%
46%
46%
46%
46%
35%
1%
56%
Claude 3.7 Sonnet
72%
70%
66%
64%
59%
50%
49%
41%
35%
22%
53%
GPT-4o, Aug. 6th (temp=1)
99%
93%
86%
50%
44%
43%
43%
2%
0%
0%
46%
Claude 3.5 Sonnet
98%
92%
73%
50%
46%
44%
41%
3%
3%
0%
45%
DeepSeek-V2 Chat
89%
84%
50%
44%
44%
40%
19%
17%
11%
5%
40%
Claude Opus 4.6
56%
50%
50%
50%
50%
45%
30%
26%
18%
18%
39%
Claude 3.5 Sonnet (new)
50%
50%
50%
50%
45%
43%
19%
16%
5%
0%
33%
GPT-4o Mini (temp=0)
50%
50%
50%
49%
41%
38%
34%
14%
0%
0%
32%
GPT-4 Turbo
100%
91%
59%
49%
14%
0%
0%
0%
0%
0%
31%
Llama 3.1 405B
50%
46%
38%
37%
32%
31%
22%
19%
12%
2%
29%
Gemini 3 Flash (Preview)
51%
50%
50%
49%
38%
18%
18%
14%
0%
0%
29%
Llama 3.1 Euryale 70B v2.2
75%
64%
50%
47%
32%
6%
2%
0%
0%
0%
28%
Claude Opus 4.5
50%
50%
48%
45%
41%
14%
10%
10%
5%
3%
28%
Cohere Command R+ (Aug. 2024)
79%
50%
44%
32%
31%
12%
10%
10%
4%
0%
27%
Llama 3.1 70B
71%
47%
47%
45%
45%
11%
3%
1%
0%
0%
27%
Claude Opus 4
48%
47%
43%
38%
38%
18%
18%
17%
1%
0%
27%
Claude 3.5 Haiku
50%
50%
48%
45%
37%
10%
8%
0%
0%
0%
25%
Mistral Small Creative
50%
49%
45%
39%
27%
14%
12%
12%
0%
0%
25%
EVA Qwen 2.5 14B
83%
50%
50%
37%
6%
3%
0%
0%
0%
0%
23%
Sao10K L3.1 70B Hanami x1
49%
49%
47%
47%
20%
3%
2%
0%
0%
0%
22%
Mistral Large 2
50%
49%
45%
36%
17%
9%
4%
4%
0%
0%
21%
Qwen 2.5 72B
50%
47%
37%
35%
30%
12%
0%
0%
0%
0%
21%
Llama 3.1 8B
48%
47%
37%
35%
18%
10%
7%
0%
0%
0%
20%
Claude 3 Haiku
49%
44%
30%
25%
22%
14%
13%
1%
0%
0%
20%
o4 Mini
56%
55%
49%
29%
6%
0%
0%
0%
0%
0%
19%
Gemini 2.5 Flash Lite
47%
45%
43%
22%
18%
10%
0%
0%
0%
0%
19%
Claude 3.0 Sonnet
50%
45%
42%
21%
10%
8%
3%
1%
0%
0%
18%
Llama 3.2 3B
43%
38%
36%
26%
23%
11%
0%
0%
0%
0%
18%
Claude Sonnet 4.5
46%
43%
37%
18%
17%
16%
1%
0%
0%
0%
18%
GPT-4o Mini (temp=1)
50%
49%
48%
24%
1%
0%
0%
0%
0%
0%
17%
Qwen 2 72B
37%
34%
29%
27%
24%
16%
4%
1%
0%
0%
17%
Gemma 2 27B
50%
35%
33%
32%
14%
3%
2%
2%
0%
0%
17%
Llama 3.2 90B (Vision)
99%
35%
21%
7%
6%
1%
1%
0%
0%
0%
17%
MythoMist 7B
85%
43%
34%
6%
0%
0%
0%
0%
0%
0%
17%
Ministral 3B
67%
43%
24%
18%
15%
0%
0%
0%
0%
0%
17%
GPT-4o, May 13th (temp=1)
54%
45%
41%
26%
0%
0%
0%
0%
0%
0%
17%
Llama 3.1 Nemotron 70B
49%
43%
38%
23%
5%
2%
2%
1%
1%
0%
16%
Inflection 3 (Productivity)
48%
43%
40%
27%
2%
1%
0%
0%
0%
0%
16%
GPT-4.1 Mini
50%
48%
47%
12%
2%
1%
0%
0%
0%
0%
16%
Rocinante 12B
50%
46%
18%
15%
8%
7%
6%
0%
0%
0%
15%
Magnum v2 72B
67%
45%
36%
0%
0%
0%
0%
0%
0%
0%
15%
Llama 3 TenyxChat-DaybreakStorywriter 70B
49%
45%
43%
5%
3%
1%
0%
0%
0%
0%
15%
Hermes 3 70B
45%
39%
23%
22%
8%
4%
2%
1%
0%
0%
14%
Mistral NeMO
50%
43%
41%
5%
0%
0%
0%
0%
0%
0%
14%
Llama 3.2 1B
50%
46%
23%
13%
1%
0%
0%
0%
0%
0%
13%
Gemini Pro 1.5
49%
49%
11%
7%
5%
4%
4%
2%
0%
0%
13%
GPT-4.1
50%
43%
38%
1%
0%
0%
0%
0%
0%
0%
13%
MoonshotAI: Kimi K2.5
48%
45%
15%
8%
7%
6%
0%
0%
0%
0%
13%
Llama 3 70B
47%
31%
23%
8%
8%
5%
1%
0%
0%
0%
12%
Gemini Flash 1.5
41%
39%
26%
11%
5%
1%
0%
0%
0%
0%
12%
AI21 Jamba 1.5 Large
50%
49%
22%
0%
0%
0%
0%
0%
0%
0%
12%
Phi-3 Medium 128k
50%
30%
30%
10%
0%
0%
0%
0%
0%
0%
12%
Gemini 2.5 Pro
47%
45%
14%
14%
0%
0%
0%
0%
0%
0%
12%
AI21 Jamba 1.5 Mini
41%
34%
24%
15%
2%
0%
0%
0%
0%
0%
12%
Magnum 72B
45%
34%
17%
14%
1%
1%
0%
0%
0%
0%
11%
Z.AI GLM 4.6
40%
30%
21%
14%
5%
0%
0%
0%
0%
0%
11%
Inflection 3 (PI)
50%
44%
14%
0%
0%
0%
0%
0%
0%
0%
11%
Hermes 3 405B
44%
36%
15%
11%
1%
0%
0%
0%
0%
0%
11%
Claude 2.0
58%
25%
19%
2%
1%
0%
0%
0%
0%
0%
11%
Llama 3.2 11B (Vision)
43%
26%
25%
2%
2%
1%
1%
1%
0%
0%
10%
Qwen 2 7B
47%
37%
7%
5%
4%
1%
0%
0%
0%
0%
10%
Phi-3.5 Mini 128k
49%
38%
9%
3%
1%
0%
0%
0%
0%
0%
10%
Claude Haiku 4.5
64%
19%
6%
5%
4%
1%
1%
0%
0%
0%
10%
Lumimaid v0.2 8B
50%
47%
0%
0%
0%
0%
0%
0%
0%
0%
10%
MythoMax 13B
50%
33%
5%
3%
0%
0%
0%
0%
0%
0%
9%
WizardLM 2 8x22b
49%
19%
13%
3%
0%
0%
0%
0%
0%
0%
8%
Writer: Palmyra X5
49%
30%
2%
0%
0%
0%
0%
0%
0%
0%
8%
Z.AI GLM 4.7
41%
18%
18%
0%
0%
0%
0%
0%
0%
0%
8%
Llama 3 Euryale 70B v2.1
43%
30%
2%
0%
0%
0%
0%
0%
0%
0%
8%
Gemini 3 Pro (Preview)
71%
1%
0%
0%
0%
0%
0%
0%
0%
0%
7%
Cohere Command R+ (Apr. 2024)
40%
18%
12%
1%
0%
0%
0%
0%
0%
0%
7%
GPT-4.1 Nano
24%
15%
14%
13%
1%
1%
0%
0%
0%
0%
7%
Claude Sonnet 4
50%
6%
3%
1%
0%
0%
0%
0%
0%
0%
6%
Mistral Large
34%
18%
4%
0%
0%
0%
0%
0%
0%
0%
6%
Mistral Nemo 12B Celeste
48%
2%
1%
0%
0%
0%
0%
0%
0%
0%
5%
lzlv 70B
50%
0%
0%
0%
0%
0%
0%
0%
0%
0%
5%
MN GRAND Gutenberg Lyra4 12B Madness
48%
1%
0%
0%
0%
0%
0%
0%
0%
0%
5%
Ministral 8B
49%
0%
0%
0%
0%
0%
0%
0%
0%
0%
5%
Hermes 2 Theta 8B
49%
0%
0%
0%
0%
0%
0%
0%
0%
0%
5%
Toppy M 7B
45%
1%
0%
0%
0%
0%
0%
0%
0%
0%
5%
Fimbulvetr 11B v2
46%
0%
0%
0%
0%
0%
0%
0%
0%
0%
5%
Gemini 2.5 Flash
30%
13%
0%
0%
0%
0%
0%
0%
0%
0%
4%
Z.AI GLM 4.7 Flash
38%
0%
0%
0%
0%
0%
0%
0%
0%
0%
4%
Gemma 2 9B
31%
3%
1%
0%
0%
0%
0%
0%
0%
0%
4%
Phi-3 Mini 128k
20%
7%
2%
0%
0%
0%
0%
0%
0%
0%
3%
AI21 Jamba
12%
9%
1%
0%
0%
0%
0%
0%
0%
0%
2%
Z.AI GLM 4.5
14%
4%
0%
0%
0%
0%
0%
0%
0%
0%
2%
GPT-4o, May 13th (temp=0)
13%
0%
0%
0%
0%
0%
0%
0%
0%
0%
1%
Goliath 120B
10%
1%
0%
0%
0%
0%
0%
0%
0%
0%
1%
Liquid: LFM 40B MoE
10%
0%
0%
0%
0%
0%
0%
0%
0%
0%
1%
Mistral Medium
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
Claude 2.1
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
16.06%