Count dialogue tags

Test: Dialogue tags

Avg. Score
85.7%
Scenarios
1

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Mistral Small 4100.0%$0.00022.9s100%
2Gemini 3.1 Flash Lite (Preview)100.0%$0.00042.0s100%
3GPT-4o Mini (temp=0)100.0%$0.00023.4s100%
4Gemini 3.1 Flash Lite100.0%$0.00042.0s100%
5Mistral Small 3.2 24B100.0%$0.00014.1s100%
6Claude 3 Haiku100.0%$0.00043.4s100%
7Grok 4 Fast100.0%$0.00034.2s100%
8Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00043.6s100%
9DeepSeek V4 Flash (Reasoning)100.0%$0.00016.3s100%
10DeepSeek V4 Flash100.0%$0.00016.8s100%
11Gemma 3 12B100.0%$0.00007.4s100%
12Gemini 3 Flash (Preview)100.0%$0.00093.2s100%
13Ministral 3 14B100.0%$0.00018.2s100%
14Grok 4.20100.0%$0.00074.9s100%
15Qwen3 235B A22B Instruct 2507100.0%$0.00018.3s100%
16Mistral Medium 3.1100.0%$0.00066.1s100%
17Mistral Large 3100.0%$0.00057.2s100%
18Llama 3.1 Nemotron 70B100.0%$0.00019.6s100%
19DeepSeek V3 (2024-12-26)100.0%$0.000410.1s100%
20Grok 4.1 Fast100.0%$0.000310.2s100%
21Gemma 3 27B100.0%$0.000112.5s100%
22DeepSeek V3 (2025-03-24)100.0%$0.000314.6s100%
23Mistral Small 4 (Reasoning)100.0%$0.001011.0s100%
24GPT-4o Mini (temp=1)100.0%$0.000215.5s100%
25Gemini 3.5 Flash (Reasoning, Minimal)100.0%$0.00252.6s100%
26DeepSeek V3.1100.0%$0.000215.4s100%
27Gemma 4 31B100.0%$0.000115.9s100%
28DeepSeek-V2 Chat100.0%$0.000116.1s100%
29Stealth: Hunter Alpha100.0%$0.000016.9s100%
30GPT-4.1100.0%$0.00234.7s100%
31Hermes 3 405B100.0%$0.000017.7s100%
32Gemma 4 26B100.0%$0.000117.2s100%
33Mistral Large 2100.0%$0.00198.1s100%
34GPT-4o, Aug. 6th (temp=0)100.0%$0.00303.8s100%
35GPT-4o, Aug. 6th (temp=1)100.0%$0.00303.7s100%
36Writer: Palmyra X5100.0%$0.00218.8s100%
37Aion 2.0100.0%$0.001116.7s100%
38Z.AI GLM 4.7 Flash100.0%$0.000624.6s100%
39Z.AI GLM 5 Turbo100.0%$0.00339.9s100%
40DeepSeek V4 Pro100.0%$0.000925.1s100%
41Claude Sonnet 4100.0%$0.00447.2s100%
42GPT-4o, May 13th (temp=0)100.0%$0.00485.2s100%
43Claude Sonnet 4.5100.0%$0.00457.3s100%
44GPT-4o, May 13th (temp=1)100.0%$0.00495.5s100%
45Claude 3.5 Sonnet100.0%$0.00496.2s100%
46Claude 3.7 Sonnet100.0%$0.00507.1s100%
47GPT-5.4100.0%$0.00489.6s100%
48Gemini 3 Flash (Preview, Reasoning)100.0%$0.004810.1s100%
49GPT-5.1100.0%$0.004912.4s100%
50GPT-5.4 (Reasoning, Low)100.0%$0.006010.3s100%
51Qwen 3.6 35B100.0%$0.004424.7s100%
52Claude Opus 4.5100.0%$0.00778.7s100%
53Gemini 2.5 Flash Lite96.1%$0.00011.4s76%
54Claude Opus 4.6100.0%$0.00799.5s100%
55GPT-5.4 (Reasoning)100.0%$0.007315.2s100%
56Llama 3.1 70B96.1%$0.00033.2s76%
57Mistral Large100.0%$0.00918.4s100%
58GPT-5.5100.0%$0.00919.1s100%
59Z.AI GLM 4.596.1%$0.00055.8s76%
60Claude Opus 4.6 (Reasoning)100.0%$0.009110.3s100%
61Grok 4.20 (Beta)96.1%$0.00161.9s76%
62Grok 4100.0%$0.008018.1s100%
63Z.AI GLM 5.1100.0%$0.006030.6s100%
64Z.AI GLM 4.7100.0%$0.002751.3s100%
65Claude Opus 4.7 (Reasoning)100.0%$0.0116.8s100%
66Gemma 4 31B (Reasoning)100.0%$0.00041.1m100%
67Xiaomi MIMO v2.5 Pro96.1%$0.001410.3s76%
68Claude Opus 4.7100.0%$0.0117.2s100%
69Z.AI GLM 4.6100.0%$0.003351.6s100%
70MoonshotAI: Kimi K2.5100.0%$0.005340.7s100%
71GPT-5.5 (Reasoning, Low)100.0%$0.01111.1s100%
72Gemma 4 26B (Reasoning)100.0%$0.00111.2m100%
73Z.AI GLM 5100.0%$0.004950.1s100%
74GPT-5.5 (Reasoning)100.0%$0.01211.2s100%
75Gemini 3.5 Flash (Reasoning)100.0%$0.0137.5s100%
76Xiaomi MIMO v2.592.1%$0.00096.1s69%
77Claude Haiku 4.592.1%$0.00164.0s69%
78Claude Sonnet 4.6 (Reasoning)96.1%$0.00548.4s76%
79Mistral Small Creative88.2%$0.00012.2s64%
80Gemini 2.5 Flash Lite (Reasoning)92.1%$0.000620.1s69%
81Gemini 3 Pro (Preview)100.0%$0.01713.7s100%
82Claude Sonnet 4.692.1%$0.00467.7s69%
83Qwen 3.5 Plus (2026-04-20)100.0%$0.008659.6s100%
84GPT-5.292.1%$0.00469.8s69%
85GPT-5 Nano96.1%$0.002358.8s76%
86GPT-4.1 Mini87.4%$0.00043.3s45%
87DeepSeek V4 Pro (Reasoning)100.0%$0.00851.6m100%
88DeepSeek V3.290.0%$0.000213.7s40%
89Claude Opus 4100.0%$0.02315.8s100%
90Cohere Command R+ (Aug. 2024)87.4%$0.00307.0s45%
91Inception Mercury 279.5%$0.00091.9s44%
92Gemini 2.5 Pro96.1%$0.01514.6s76%
93Gemini 2.5 Flash (Reasoning)87.4%$0.00357.1s45%
94Z.AI GLM 4.5 Air83.5%$0.000515.4s44%
95GPT-5 Mini90.0%$0.002214.1s40%
96Gemini 3.1 Pro (Preview)100.0%$0.02423.5s100%
97Qwen 3.6 27B96.1%$0.01054.4s76%
98Qwen 3.5 27B100.0%$0.0161.3m100%
99o4 Mini High96.1%$0.01434.7s76%
100ByteDance Seed 1.690.0%$0.002327.7s40%
101GPT-5.4 Mini (Reasoning)78.2%$0.00203.7s37%
102Qwen 3.5 122B100.0%$0.02256.5s100%
103Grok 4.3 (Reasoning)100.0%$0.0181.4m100%
104Stealth: Healer Alpha74.8%$0.00006.3s32%
105MoonshotAI: Kimi K2.6100.0%$0.0151.6m100%
106Gemini 2.5 Flash74.4%$0.00092.5s31%
107Qwen 3.6 Flash90.0%$0.006322.6s40%
108Qwen 3.5 Plus (2026-02-15)81.5%$0.000813.3s26%
109GPT-5100.0%$0.02751.7s100%
110GPT-4.1 Nano69.7%$0.00012.9s23%
111Grok 4.20 (Reasoning)90.0%$0.006736.2s40%
112Grok 4.368.9%$0.00064.0s19%
113GPT-5.4 Mini67.0%$0.00142.2s22%
114Qwen 3 32B67.0%$0.000423.9s22%
115Hermes 3 70B64.9%$0.000112.7s17%
116MiniMax M2.586.2%$0.00441.2m39%
117GPT-OSS 120B68.8%$0.000229.8s18%
118Mistral NeMO47.9%$0.00004.0s24%
119Arcee AI: Trinity Mini51.9%$0.00013.9s19%
120Grok 4.20 (Beta, Reasoning)90.0%$0.01914.1s40%
121Qwen 2.5 72B57.5%$0.00017.5s14%
122o4 Mini72.1%$0.006315.9s22%
123Qwen3.6 Max Preview100.0%$0.0271.7m100%
124ByteDance Seed 1.6 Flash52.3%$0.00036.9s15%
125Qwen3.7 Max100.0%$0.0341.2m100%
126LFM2 24B48.9%$0.00007.1s9%
127Ministral 3 8B37.0%$0.00012.0s12%
128Arcee AI: Trinity Large (Preview)47.6%$0.00006.8s3%
129Nemotron 3 Nano66.1%$0.000950.4s10%
130Rocinante 12B48.9%$0.000214.5s4%
131Inception Mercury42.5%$0.00022.2s4%
132Qwen 3.5 Flash70.0%$0.003051.7s8%
133Ministral 3B40.2%$0.00001.8s2%
134Gemma 3 4B33.8%$0.00003.4s5%
135Ministral 3 3B29.8%$0.00001.5s7%
136Llama 3.1 8B33.8%$0.00011.3s1%
137GPT-5.4 Mini (Reasoning, Low)31.1%$0.00163.2s7%
138MiniMax M2.786.1%$0.00812.1m38%
139Nemotron 3 Super56.1%$0.000056.2s5%
140ByteDance Seed 2.0 Mini60.0%$0.00111.0m2%
141Qwen 3.5 9B71.4%$0.00121.8m12%
142WizardLM 2 8x22b25.6%$0.000512.6s6%
143Qwen 3.5 397B A17B100.0%$0.0263.1m100%
144Ministral 8B19.9%$0.00002.1s1%
145Stealth: Aurora Alpha49.7%3.3s11%
146GPT-5.4 Nano12.9%$0.00042.1s0%
147GPT-5.4 Nano (Reasoning, Low)9.1%$0.00042.0s1%
148GPT-5.4 Nano (Reasoning)8.9%$0.00042.3s0%
149Qwen 3.5 35B70.0%$0.01653.3s8%
150ByteDance Seed 2.0 Lite0.0%$0.002833.4s0%
85.71%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
GPT-5.4100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001006196.1%
o4 Mini High1001001001001001001001001006196.1%
Qwen 3.6 27B1001001001001001001001001006196.1%
Gemini 2.5 Pro1001001001001001001001001006196.1%
Xiaomi MIMO v2.5 Pro1001001001001001001001001006196.1%
Z.AI GLM 4.51001001001001001001001001006196.1%
Grok 4.20 (Beta)1001001001001001001001001006196.1%
GPT-5 Nano1001001001001001001001001006196.1%
Gemini 2.5 Flash Lite1001001001001001001001001006196.1%
Llama 3.1 70B1001001001001001001001001006196.1%
Claude Sonnet 4.6100100100100100100100100616192.1%
GPT-5.2100100100100100100100100616192.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100616192.1%
Claude Haiku 4.5100100100100100100100100616192.1%
Xiaomi MIMO v2.5100100100100100100100100616192.1%
GPT-5 Mini100100100100100100100100100090.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100090.0%
Grok 4.20 (Reasoning)100100100100100100100100100090.0%
ByteDance Seed 1.6100100100100100100100100100090.0%
Qwen 3.6 Flash100100100100100100100100100090.0%
DeepSeek V3.2100100100100100100100100100090.0%
Mistral Small Creative10010010010010010010061616188.2%
Gemini 2.5 Flash (Reasoning)100100100100100100100100611487.4%
GPT-4.1 Mini100100100100100100100100611487.4%
Cohere Command R+ (Aug. 2024)100100100100100100100100611487.4%
MiniMax M2.510010010010010010010010061186.2%
MiniMax M2.710010010010010010010010061086.1%
Z.AI GLM 4.5 Air10010010010010010010061611483.5%
Qwen 3.5 Plus (2026-02-15)10010010010010010010010014181.5%
Inception Mercury 21001001001001001006161611479.5%
GPT-5.4 Mini (Reasoning)100100100100100100616161078.2%
Stealth: Healer Alpha1001001001001001006161141474.8%
Gemini 2.5 Flash10010010010010061616161174.4%
o4 Mini10010010010010010061610072.1%
Qwen 3.5 9B100100100100100100100140071.4%
Qwen 3.5 35B10010010010010010010000070.0%
Qwen 3.5 Flash10010010010010010010000070.0%
GPT-4.1 Nano10010010010010061616114169.7%
Grok 4.3100100100100100100611414168.9%
GPT-OSS 120B100100100100100100611414068.8%
GPT-5.4 Mini10010010010061616161141467.0%
Qwen 3 32B10010010010061616161141467.0%
Nemotron 3 Nano1001001001001001006100066.1%
Hermes 3 70B10010010010010061611414164.9%
ByteDance Seed 2.0 Mini100100100100100100000060.0%
Qwen 2.5 72B10010010010061611414141457.5%
Nemotron 3 Super10010010010010061000056.1%
ByteDance Seed 1.6 Flash100100100616161141414052.3%
Arcee AI: Trinity Mini1001006161616161141151.9%
Stealth: Aurora Alpha1001001006161611410049.7%
LFM2 24B1001001006161141414141448.9%
Rocinante 12B10010010010061141410048.9%
Mistral NeMO100616161616161141047.9%
Arcee AI: Trinity Large (Preview)1001001001006114110047.6%
Inception Mercury10010010061611110042.5%
Ministral 3B1001001006114141410040.2%
Ministral 3 8B10061616161141400037.0%
Gemma 3 4B100616161141414141133.8%
Llama 3.1 8B1001006161141110033.8%
GPT-5.4 Mini (Reasoning, Low)616161611414141414131.1%
Ministral 3 3B61616161141414141029.8%
WizardLM 2 8x22b1006114141414141414125.6%
Ministral 8B6161611411100019.9%
GPT-5.4 Nano1001414110000012.9%
GPT-5.4 Nano (Reasoning, Low)61141411100009.1%
GPT-5.4 Nano (Reasoning)61141410000008.9%
ByteDance Seed 2.0 Lite00000000000.0%