Count dialogue tags

Test: Dialogue tags

Avg. Score
84.9%
Scenarios
1

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-4o Mini (temp=0)100.0%$0.00023.4s100%
2Mistral Small 3.2 24B100.0%$0.00014.1s100%
3Grok 4 Fast100.0%$0.00034.2s100%
4Claude 3 Haiku100.0%$0.00043.4s100%
5Gemma 3 12B100.0%$0.00007.4s100%
6Ministral 3 14B100.0%$0.00018.2s100%
7Gemini 3 Flash (Preview)100.0%$0.00093.2s100%
8Z.AI GLM 4.5100.0%$0.00055.8s100%
9Mistral Medium 3.1100.0%$0.00066.1s100%
10Mistral Large 3100.0%$0.00057.2s100%
11Grok 4.1 Fast100.0%$0.000310.2s100%
12DeepSeek V3 (2024-12-26)100.0%$0.000410.1s100%
13Gemma 3 27B100.0%$0.000112.5s100%
14Claude 3.5 Haiku100.0%$0.00125.0s100%
15GPT-4o Mini (temp=1)100.0%$0.000215.5s100%
16DeepSeek V3 (2025-03-24)100.0%$0.000314.6s100%
17DeepSeek-V2 Chat100.0%$0.000116.1s100%
18DeepSeek V3.1100.0%$0.000215.4s100%
19Hermes 3 405B100.0%$0.000017.7s100%
20GPT-4.1100.0%$0.00234.7s100%
21Mistral Large 2100.0%$0.00198.1s100%
22Writer: Palmyra X5100.0%$0.00218.8s100%
23GPT-4o, Aug. 6th (temp=0)100.0%$0.00303.8s100%
24Z.AI GLM 4.7 Flash100.0%$0.000624.6s100%
25Claude Sonnet 4100.0%$0.00447.2s100%
26GPT-4o, May 13th (temp=0)100.0%$0.00485.2s100%
27Claude Sonnet 4.5100.0%$0.00457.3s100%
28GPT-4o, May 13th (temp=1)100.0%$0.00495.5s100%
29Claude 3.5 Sonnet100.0%$0.00496.2s100%
30Claude 3.7 Sonnet100.0%$0.00507.1s100%
31Llama 3.1 70B96.1%$0.00033.2s76%
32Claude Opus 4.5100.0%$0.00778.7s100%
33Llama 3.1 Nemotron 70B96.1%$0.00019.6s76%
34Claude Opus 4.6100.0%$0.00799.5s100%
35Z.AI GLM 4.7100.0%$0.002751.3s100%
36Mistral Large100.0%$0.00918.4s100%
37Grok 4100.0%$0.008018.1s100%
38Z.AI GLM 4.6100.0%$0.003351.6s100%
39GPT-4o, Aug. 6th (temp=1)96.1%$0.00303.7s76%
40Gemini 2.5 Flash Lite92.1%$0.00011.4s69%
41MoonshotAI: Kimi K2.5100.0%$0.005340.7s100%
42Z.AI GLM 5100.0%$0.004950.1s100%
43Claude Haiku 4.592.1%$0.00164.0s69%
44GPT-5.196.1%$0.004912.4s76%
45Claude Sonnet 4.692.1%$0.00467.7s69%
46Gemini 2.5 Pro100.0%$0.01514.6s100%
47Gemini 3 Pro (Preview)100.0%$0.01713.7s100%
48Mistral Small Creative83.5%$0.00012.2s44%
49DeepSeek V3.290.0%$0.000213.7s40%
50Gemini 2.5 Flash78.3%$0.00092.5s38%
51Cohere Command R+ (Aug. 2024)86.2%$0.00307.0s39%
52GPT-4.1 Mini82.7%$0.00043.3s31%
53Mistral NeMO75.6%$0.00004.0s36%
54ByteDance Seed 1.690.0%$0.002327.7s40%
55Claude Opus 4100.0%$0.02315.8s100%
56o4 Mini High96.1%$0.01434.7s76%
57GPT-5 Nano91.4%$0.002358.8s48%
58Gemini 3.1 Pro (Preview)100.0%$0.02423.5s100%
59GPT-5 Mini78.2%$0.002214.1s37%
60Qwen 3.5 Plus (2026-02-15)81.5%$0.000813.3s26%
61Hermes 3 70B73.5%$0.000112.7s26%
62GPT-5.278.8%$0.00469.8s31%
63GPT-5100.0%$0.02751.7s100%
64GPT-4.1 Nano60.2%$0.00012.9s12%
65Minimax M2.582.2%$0.00441.2m38%
66ByteDance Seed 1.6 Flash52.3%$0.00036.9s15%
67Qwen 2.5 72B52.8%$0.00017.5s7%
68o4 Mini67.4%$0.006315.9s14%
69Ministral 3 8B37.0%$0.00012.0s12%
70Rocinante 12B42.5%$0.000214.5s4%
71Gemma 3 4B33.8%$0.00003.4s5%
72Arcee AI: Trinity Mini33.8%$0.00013.9s5%
73Arcee AI: Trinity Large (Preview)37.8%$0.00006.8s1%
74Ministral 3 3B29.8%$0.00001.5s7%
75Llama 3.1 8B29.9%$0.00011.3s2%
76Ministral 3B23.8%$0.00001.8s0%
77Stealth: Aurora Alpha50.9%3.3s13%
78Ministral 8B15.0%$0.00002.1s0%
79WizardLM 2 8x22b7.7%$0.000512.6s0%
80Qwen 3.5 397B A17B92.1%$0.0263.1m69%
84.94%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.5 Haiku100100100100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
Llama 3.1 70B1001001001001001001001001006196.1%
GPT-5.11001001001001001001001001006196.1%
o4 Mini High1001001001001001001001001006196.1%
GPT-4o, Aug. 6th (temp=1)1001001001001001001001001006196.1%
Llama 3.1 Nemotron 70B1001001001001001001001001006196.1%
Qwen 3.5 397B A17B100100100100100100100100616192.1%
Claude Sonnet 4.6100100100100100100100100616192.1%
Claude Haiku 4.5100100100100100100100100616192.1%
Gemini 2.5 Flash Lite100100100100100100100100616192.1%
GPT-5 Nano1001001001001001001001001001491.4%
ByteDance Seed 1.6100100100100100100100100100090.0%
DeepSeek V3.2100100100100100100100100100090.0%
Cohere Command R+ (Aug. 2024)10010010010010010010010061186.2%
Mistral Small Creative10010010010010010010061611483.5%
GPT-4.1 Mini100100100100100100100100141482.7%
Minimax M2.51001001001001001001006161182.2%
Qwen 3.5 Plus (2026-02-15)10010010010010010010010014181.5%
GPT-5.210010010010010010010061141478.8%
Gemini 2.5 Flash100100100100100100616161178.3%
GPT-5 Mini100100100100100100616161078.2%
Mistral NeMO100100100100100616161611475.6%
Hermes 3 70B100100100100100100616114073.5%
o4 Mini10010010010010010061140067.4%
GPT-4.1 Nano10010010010010061141414160.2%
Qwen 2.5 72B10010010010061141414141452.8%
ByteDance Seed 1.6 Flash100100100616161141414052.3%
Stealth: Aurora Alpha10010010061616114140050.9%
Rocinante 12B10010010061611110042.5%
Arcee AI: Trinity Large (Preview)10010010061141110037.8%
Ministral 3 8B10061616161141400037.0%
Arcee AI: Trinity Mini100616161141414141133.8%
Gemma 3 4B100616161141414141133.8%
Llama 3.1 8B100616161141110029.9%
Ministral 3 3B61616161141414141029.8%
Ministral 3B10061611411100023.8%
Ministral 8B6161141410000015.0%
WizardLM 2 8x22b6114110000007.7%