Matches word count

Test: Dialogue tags

Avg. Score
40.5%
Scenarios
6

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5 Mini95.2%$0.01158.5s58%
2Claude Opus 4.686.8%$0.01415.5s44%
3Gemini 3.1 Flash Lite (Preview)78.4%$0.00072.9s36%
4GPT-4o Mini (temp=0)77.7%$0.00038.5s34%
5Inception Mercury 281.6%$0.00426.8s28%
6Claude Sonnet 4.670.9%$0.007912.8s33%
7GPT-4o, Aug. 6th (temp=0)68.9%$0.00536.3s18%
8Z.AI GLM 5 Turbo87.5%$0.0351.4m36%
9Claude Opus 4.567.4%$0.01414.3s19%
10GPT-590.1%$0.0531.5m40%
11GPT-4o, Aug. 6th (temp=1)64.6%$0.00546.3s12%
12Gemini 3 Flash (Preview)59.8%$0.00155.0s12%
13Grok 4.20 (Beta)57.9%$0.00293.2s12%
14GPT-4.161.4%$0.00447.8s9%
15Claude Opus 4.6 (Reasoning)84.2%$0.08042.3s39%
16o4 Mini High88.5%$0.0501.9m40%
17MiniMax M2.588.9%$0.0173.1m42%
18GPT-4o Mini (temp=1)59.1%$0.00038.4s7%
19Grok 4 Fast53.3%$0.00046.8s10%
20Inception Mercury57.6%$0.00069.1s6%
21Grok 460.0%$0.01227.3s13%
22Nemotron 3 Super80.6%$0.00002.8m29%
23o4 Mini72.7%$0.0261.1m18%
24Gemini 3.1 Pro (Preview)98.7%$0.1572.1m80%
25MiniMax M2.795.0%$0.0234.7m57%
26Claude 3.7 Sonnet50.7%$0.008511.6s9%
27Gemini 2.5 Flash Lite41.3%$0.00022.6s5%
28GPT-5.4 Nano (Reasoning)54.7%$0.004221.2s1%
29Grok 4.1 Fast45.4%$0.000410.3s3%
30Claude Opus 459.0%$0.04425.5s18%
31Gemini 3 Flash (Preview, Reasoning)58.0%$0.01933.7s7%
32Llama 3.1 70B40.6%$0.00055.2s3%
33Mistral Medium 3.144.0%$0.001213.1s3%
34Llama 3.1 8B38.3%$0.00012.0s1%
35Claude Haiku 4.539.2%$0.00266.6s2%
36GPT-4.1 Mini40.2%$0.00096.9s1%
37GPT-5.4 Mini (Reasoning)45.3%$0.00779.4s0%
38Stealth: Aurora Alpha67.7%9.2s9%
39Claude 3 Haiku36.0%$0.00074.7s1%
40GPT-4.1 Nano37.3%$0.00026.0s0%
41Arcee AI: Trinity Large (Preview)37.8%$0.000012.2s1%
42DeepSeek V3 (2024-12-26)37.2%$0.000619.3s3%
43Hermes 3 405B35.7%$0.000026.5s3%
44Mistral Small 431.2%$0.00035.7s1%
45GPT-5.4 Nano (Reasoning, Low)34.5%$0.002010.4s0%
46GPT-5.4 Mini (Reasoning, Low)32.9%$0.00324.6s0%
47GPT-5 Nano59.7%$0.00401.6m4%
48Mistral Large 332.6%$0.000912.1s0%
49LFM2 24B31.4%$0.000112.6s1%
50Gemini 2.5 Flash30.2%$0.00154.0s0%
51GPT-5.4 Mini30.5%$0.00294.0s0%
52DeepSeek V3 (2025-03-24)33.3%$0.000617.8s0%
53DeepSeek-V2 Chat36.0%$0.000229.9s0%
54GPT-4o, May 13th (temp=1)34.9%$0.008815.6s0%
55GPT-5.4 (Reasoning, Low)42.5%$0.01723.5s0%
56Stealth: Hunter Alpha29.6%$0.000017.1s0%
57DeepSeek V3.133.2%$0.000628.3s0%
58GPT-5.4 (Reasoning)53.9%$0.03039.6s1%
59GPT-5.253.3%$0.02738.3s0%
60GPT-5.156.5%$0.03052.9s2%
61Qwen 2.5 72B27.2%$0.000315.6s0%
62GPT-4o, May 13th (temp=0)31.9%$0.009014.3s0%
63Stealth: Healer Alpha27.3%$0.000019.6s0%
64Gemini 3 Pro (Preview)43.1%$0.03123.6s3%
65Mistral Large33.5%$0.01513.3s0%
66Hermes 3 70B25.9%$0.000220.8s0%
67Mistral Small 3.2 24B21.1%$0.00029.0s0%
68DeepSeek V3.234.3%$0.000547.8s0%
69Gemma 3 12B21.3%$0.000111.9s0%
70Ministral 8B18.6%$0.00004.0s0%
71Ministral 3 3B17.2%$0.00012.0s0%
72Aion 2.025.7%$0.001624.7s0%
73Gemini 2.5 Flash Lite (Reasoning)25.2%$0.002224.1s0%
74Qwen 3.5 Plus (2026-02-15)22.7%$0.001322.1s0%
75Arcee AI: Trinity Mini16.2%$0.00015.6s0%
76Claude 3.5 Haiku17.6%$0.00207.4s0%
77Ministral 3B13.5%$0.00002.5s0%
78Ministral 3 8B13.5%$0.00013.3s0%
79GPT-5.4 Nano14.6%$0.00115.2s0%
80Claude Sonnet 4.520.8%$0.008011.8s0%
81GPT-5.424.1%$0.009618.7s0%
82Ministral 3 14B13.8%$0.00026.4s0%
83Qwen 3.5 27B57.5%$0.0251.9m6%
84Qwen 3.5 Flash41.7%$0.00501.4m0%
85Claude 3.5 Sonnet26.4%$0.009232.1s1%
86Mistral Small 4 (Reasoning)18.6%$0.001922.1s0%
87ByteDance Seed 2.0 Lite26.5%$0.003642.6s0%
88Cohere Command R+ (Aug. 2024)15.7%$0.005412.4s0%
89Qwen3 235B A22B Instruct 250714.4%$0.000318.7s0%
90Writer: Palmyra X514.3%$0.003812.6s0%
91Z.AI GLM 4.511.1%$0.00098.9s0%
92ByteDance Seed 1.6 Flash11.7%$0.000612.2s0%
93Claude Sonnet 415.1%$0.007711.4s0%
94Qwen 3.5 35B46.6%$0.0241.3m1%
95Claude Sonnet 4.6 (Reasoning)71.3%$0.1171.3m27%
96Llama 3.1 Nemotron 70B9.8%$0.000215.7s0%
97Mistral NeMO6.5%$0.00016.6s0%
98Gemma 3 4B5.3%$0.00005.5s0%
99Qwen 3 32B10.8%$0.000521.3s0%
100Qwen 3.5 122B47.9%$0.0331.3m2%
101Grok 4.20 (Beta, Reasoning)45.7%$0.05338.2s3%
102Mistral Small Creative3.7%$0.00023.8s0%
103Gemini 2.5 Pro28.2%$0.02925.0s0%
104Rocinante 12B9.8%$0.000325.1s0%
105Gemma 3 27B5.2%$0.000115.7s0%
106Mistral Large 22.6%$0.003113.0s0%
107Z.AI GLM 4.7 Flash18.3%$0.00141.1m0%
108ByteDance Seed 1.627.5%$0.00731.4m0%
109Gemini 2.5 Flash (Reasoning)12.4%$0.01529.3s0%
110WizardLM 2 8x22b0.0%$0.000722.8s0%
111Z.AI GLM 4.740.6%$0.00692.4m4%
112Z.AI GLM 528.2%$0.0101.8m0%
113Nemotron 3 Nano70.4%$0.00424.8m11%
114Z.AI GLM 4.611.5%$0.00441.3m0%
115Qwen 3.5 9B25.8%$0.00172.5m0%
116MoonshotAI: Kimi K2.549.8%$0.0243.2m4%
117ByteDance Seed 2.0 Mini25.8%$0.00293.3m0%
118Qwen 3.5 397B A17B62.6%$0.0394.9m8%
40.51%

Individual Scenarios

dialogue-200

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)1001001001001001001001001009999.9%
GPT-5.4 Mini (Reasoning)1001001001001001001001001009999.9%
Gemini 3.1 Flash Lite (Preview)1001001001001001001001001009999.8%
Claude Opus 4.510010010010010010010099989899.4%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100989699.4%
MoonshotAI: Kimi K2.510010010010010010010099999499.1%
Claude Opus 4.6100100100100100999999999098.7%
GPT-4o Mini (temp=0)10010010010010010010099989098.6%
GPT-4o Mini (temp=1)100100100100100100100100998698.4%
MiniMax M2.51001001001001001001001001008198.1%
GPT-4.1100100100100100999999968697.9%
GPT-4o, Aug. 6th (temp=1)10010010010099999998949097.8%
Gemini 3 Flash (Preview)10010010010099999898949097.7%
Claude Sonnet 4.6100100999999999998948697.2%
GPT-4o, Aug. 6th (temp=0)10010010010010010010099986095.6%
Qwen 3.5 122B10010010010010010010099944393.6%
GPT-5.4 (Reasoning, Low)10010010010010010010098863591.9%
Qwen 3.5 27B100100100100100100100100992091.9%
Nemotron 3 Nano100100100100100100100100100090.0%
Inception Mercury100100100100100100100100100090.0%
Grok 4100100100100100100999686088.1%
Grok 4 Fast10099999898949486811486.3%
Grok 4.1 Fast100100999494817575756886.1%
Claude Opus 4100100100100100999981681085.6%
Llama 3.1 70B100100989898949081682785.4%
Claude Sonnet 4.510010010010096907568606085.0%
Grok 4.20 (Beta, Reasoning)100100100100100999996272784.8%
GPT-4.1 Nano1001001009996868668605284.8%
Claude Haiku 4.51001001009998969486601084.2%
Qwen 3.5 35B100100100100100100100966080.2%
Qwen 3.5 Flash100100100100100100100990080.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100430074.4%
Grok 4.20 (Beta)100100100989690865220174.3%
Qwen 3.5 9B1001001001001009999430074.1%
Gemini 3 Pro (Preview)100100100999696684335474.1%
GPT-5.4 Mini (Reasoning, Low)10010099989486686035074.0%
GPT-4.1 Mini10010099999986814327073.6%
Claude 3 Haiku999998989694864310072.2%
Z.AI GLM 4.71001001001009675685210070.1%
Claude Sonnet 41009996909086684320670.0%
DeepSeek V3 (2025-03-24)100999894907575600069.2%
Arcee AI: Trinity Large (Preview)999999969475682727068.3%
Claude 3.5 Haiku9986817575686843432066.0%
Llama 3.1 8B100100999896755261062.7%
Qwen 3.5 Plus (2026-02-15)1001009690867552141061.5%
DeepSeek V3.1100989690686052201158.7%
GPT-4o, May 13th (temp=1)10010099969486620058.3%
LFM2 24B100999975606060146057.5%
Gemini 2.5 Flash100969490754335272056.3%
Gemini 2.5 Pro99969690754327140054.2%
Hermes 3 405B100999081756814100053.8%
Gemma 3 12B100949490683514146652.1%
DeepSeek V3.21001001009675272000051.9%
Ministral 3 3B10010099989610000050.2%
Gemini 2.5 Flash Lite100999999940000049.1%
GPT-5.4 Mini10094818175351420048.3%
Claude 3.5 Sonnet100997568433527106046.3%
DeepSeek V3 (2024-12-26)10010099942727600045.3%
Z.AI GLM 4.59494907552271400044.6%
Mistral Large1009998754327200044.5%
GPT-4o, May 13th (temp=0)999998814320000044.1%
Stealth: Healer Alpha1009481605252100044.0%
Mistral Medium 3.11009694863514600043.1%
ByteDance Seed 2.0 Mini99965252433535140042.6%
Qwen 2.5 72B10010090863510100042.2%
Hermes 3 70B100999686276200041.7%
Gemini 2.5 Flash Lite (Reasoning)1001009494204000041.2%
DeepSeek-V2 Chat9996944335351000041.2%
GPT-5.49490686035272062240.5%
ByteDance Seed 2.0 Lite100999081206200039.9%
Z.AI GLM 5100988686271000039.9%
Mistral Large 310099989600000039.2%
Stealth: Hunter Alpha998152434327000034.6%
Mistral Small 3.2 24B9986603514141000031.9%
ByteDance Seed 1.6100100752760000030.8%
Mistral Small 4100100752000000029.6%
Mistral Small 4 (Reasoning)10099681000000027.8%
Ministral 3 8B1009868000000026.6%
Llama 3.1 Nemotron 70B100994310100000026.1%
GPT-5.4 Nano1009452000000024.6%
Gemma 3 4B1009827642000023.6%
Ministral 3B94524320200000022.9%
Aion 2.01008143400000022.8%
Claude 3.7 Sonnet684343271410622021.6%
Z.AI GLM 4.7 Flash1006827000000019.5%
Gemma 3 27B868114600000018.9%
Gemini 2.5 Flash (Reasoning)946027000000018.1%
Z.AI GLM 4.6945227000000017.3%
Mistral NeMO96520000000014.8%
ByteDance Seed 1.6 Flash86600000000014.7%
Rocinante 12B98430000000014.1%
Ministral 8B99270000000012.6%
Qwen 3 32B96200000000011.6%
Writer: Palmyra X586272000000011.6%
Qwen3 235B A22B Instruct 2507860000000008.7%
Ministral 3 14B4310000000005.3%
Mistral Large 22720110000004.9%
Cohere Command R+ (Aug. 2024)274000000003.1%
Mistral Small Creative40000000000.4%
WizardLM 2 8x22b00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)1001001001001001001001001009999.9%
MiniMax M2.7100100100100100100100100999999.8%
GPT-5.4 Nano (Reasoning)100100100100100100100100999999.8%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100999999.8%
Nemotron 3 Super1001001001001001001001001009899.8%
Claude Opus 4.610010010010010010010099999999.6%
o4 Mini High1001001001001001001001001009699.6%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100999699.4%
Claude Opus 4.6 (Reasoning)10010010010010010010099999699.3%
Inception Mercury 21001001001001001001001001009099.0%
GPT-4o Mini (temp=1)10010010010010010010099989098.7%
o4 Mini100100100100100100100100998698.5%
Claude Sonnet 4.6 (Reasoning)100100100100100999898948697.5%
Qwen 3.5 397B A17B10010010010010010010099997597.2%
Gemini 3 Pro (Preview)10099999999999694909096.5%
Qwen 3.5 35B10010010010010010010099907596.5%
Qwen 3.5 27B1001001001001001009998907596.2%
GPT-4o, Aug. 6th (temp=0)1001001001001001009998906094.7%
Claude Opus 4.59998989696949494908694.4%
GPT-4.1100100100100100999999991090.5%
Nemotron 3 Nano100100100100100100100100100090.0%
Grok 4.1 Fast100100100100100100999994689.7%
GPT-5.4 (Reasoning, Low)100100100100100100100100523588.7%
GPT-4o Mini (temp=0)1001001009996908686686088.6%
GPT-5.4 Nano (Reasoning, Low)100100100100100100989081087.0%
DeepSeek-V2 Chat100100100100100100948681086.1%
Claude Sonnet 4.69996949090908686755286.0%
Inception Mercury10010010010010099948175084.9%
Grok 4.20 (Beta, Reasoning)1001001001009996908175684.8%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100271083.7%
GPT-4.1 Nano10010099999696908660082.8%
Arcee AI: Trinity Large (Preview)1001001009490908675682082.4%
Stealth: Hunter Alpha10010099999894867568081.9%
MoonshotAI: Kimi K2.5100100100999490908160081.5%
Grok 4.20 (Beta)1001001001009894867552080.5%
Gemini 3 Flash (Preview)9999999898909068201477.6%
GPT-4.1 Mini999898949086818120074.8%
Claude 3 Haiku100100999881817560272074.2%
Z.AI GLM 4.71001001001009996962714073.2%
Mistral Medium 3.11001001001001009996144171.3%
GPT-5.4 Mini1009994868181756820671.1%
Grok 4100100100100987560436468.6%
Mistral Large1001001001001001008110068.2%
DeepSeek V3.210098948675686843351468.1%
GPT-4o, May 13th (temp=1)1001001009999947500066.8%
DeepSeek V3 (2024-12-26)10010090868181524327066.2%
Qwen 3.5 Flash100100999996906060065.2%
Mistral Large 3100999894816860520065.2%
Llama 3.1 70B1009898868675523510464.3%
ByteDance Seed 1.6100100100100100984300064.1%
GPT-5.4 Mini (Reasoning, Low)10010010096866868106163.5%
DeepSeek V3 (2025-03-24)100100100907560432720061.7%
Hermes 3 405B948181757568524343261.5%
GPT-4o, May 13th (temp=0)100100999481686800061.1%
Qwen 3.5 122B10010010098984327276059.9%
Llama 3.1 8B100999999994343140059.7%
Claude Opus 499908681757543142056.7%
Mistral Small 410099999896601000056.2%
ByteDance Seed 2.0 Mini10010099967568000053.8%
Gemini 2.5 Pro86818175685252350053.2%
Claude Haiku 4.590909081684320206051.1%
GPT-5.4989486818168000051.0%
Qwen 2.5 72B100999998522014102049.4%
DeepSeek V3.1100100998168202020049.1%
Z.AI GLM 4.6999694866820600047.0%
Grok 4 Fast9896906035352700044.2%
Mistral Small 3.2 24B10010099686014000044.1%
Gemini 2.5 Flash Lite1001009875680000044.1%
GPT-5.4 Nano1009894686014400043.8%
ByteDance Seed 2.0 Lite10010010010000000040.0%
Z.AI GLM 4.7 Flash10010010060270000038.7%
LFM2 24B1009086682010620038.3%
Claude 3.5 Haiku949075682710642037.6%
Stealth: Healer Alpha9998986044000036.3%
Writer: Palmyra X599989052104000035.3%
Qwen3 235B A22B Instruct 250710099756800000034.3%
Qwen 3.5 Plus (2026-02-15)1008168352720000033.2%
Hermes 3 70B10010043352720200032.8%
Gemini 2.5 Flash Lite (Reasoning)10099604340000030.7%
Z.AI GLM 596946043101000030.4%
Aion 2.09994862000000029.9%
Claude 3.5 Sonnet7568524335141000029.8%
Claude 3.7 Sonnet816043352720640027.7%
Gemini 2.5 Flash (Reasoning)1009875000000027.3%
Ministral 3B9686682000000027.1%
ByteDance Seed 1.6 Flash10010052000000025.2%
Qwen 3.5 9B10068433500000024.6%
Gemini 2.5 Flash1001001410106000024.0%
Ministral 8B96944000000019.4%
Ministral 3 14B94900000000018.4%
Qwen 3 32B90684000000016.2%
Mistral Small 4 (Reasoning)964310100000015.0%
Claude Sonnet 4.552272020141110013.7%
Llama 3.1 Nemotron 70B1002010000000013.0%
Mistral Small Creative682727600000012.9%
Rocinante 12B9640000000010.0%
Arcee AI: Trinity Mini10000000000010.0%
Cohere Command R+ (Aug. 2024)68201010000009.9%
Z.AI GLM 4.552202021000009.5%
Gemma 3 12B816000000008.9%
Claude Sonnet 443141062000007.6%
Mistral NeMO750000000007.5%
Ministral 3 3B14141062000004.7%
Gemma 3 27B200000000002.0%
Ministral 3 8B20000000000.2%
Mistral Large 200000000000.0%
Gemma 3 4B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
GPT-5.2100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
o4 Mini High10010010010010010010010010010099.9%
GPT-4o, Aug. 6th (temp=0)100100100100100999999999699.2%
MiniMax M2.5100100100100100100100100999099.0%
GPT-5.4 Nano (Reasoning)1001001001001001001001001008698.6%
Inception Mercury 21001001001001001001001001008698.6%
Qwen 3.5 27B100100100100100100100100949098.4%
Gemini 3.1 Flash Lite (Preview)100100100100100999996949498.2%
Gemini 3 Flash (Preview, Reasoning)1001001009999999896908696.8%
GPT-4o Mini (temp=0)10010010010099999898966895.7%
Qwen 3.5 397B A17B100100100100100100100100817595.6%
GPT-4o Mini (temp=1)1001001001001001009998965294.4%
GPT-5.4 (Reasoning)100100100100100100100100904393.4%
GPT-4o, Aug. 6th (temp=1)100100999998949086818192.9%
Claude Opus 4.610010010010099989090906092.8%
GPT-4.1100100999999999986815291.4%
GPT-5100100100100100100100100100490.4%
GPT-5 Nano100100100100100100100100100090.0%
Nemotron 3 Nano100100100100100100100100100090.0%
Claude Opus 4.6 (Reasoning)1001001009996969075756089.2%
GPT-5.110010010010010010010010068687.4%
Qwen 3.5 Flash10010010010010098969481086.8%
GPT-4o, May 13th (temp=0)100100100100100100999968086.6%
Nemotron 3 Super100100100100100999060431480.7%
Stealth: Aurora Alpha100100100100100100100940079.4%
Qwen 3.5 122B100100100100100100100900079.0%
Z.AI GLM 5 Turbo100100100100999981750075.4%
Inception Mercury1001001001009894864327475.2%
GPT-5.4 (Reasoning, Low)10010010010010075685243674.5%
GPT-5.4 Mini (Reasoning)100100100100100100604314071.8%
Grok 4.20 (Beta)100100100100999686350071.7%
Grok 41001001001009998525210071.0%
Qwen 3.5 35B100100100100999481274070.5%
MoonshotAI: Kimi K2.5100100100999696683510070.3%
Claude Opus 4.5999896908686862720469.3%
Grok 4.1 Fast1001009898908686274068.9%
Grok 4.20 (Beta, Reasoning)10010010010099969000068.5%
Claude Opus 4969490868675686010667.2%
o4 Mini10010010010010068522720266.9%
Llama 3.1 70B1009998948168524314065.0%
Gemini 3 Flash (Preview)9996868175757535141064.7%
GPT-5.4 Mini1009494908675682010063.7%
DeepSeek V3.2100100989686757500063.0%
GPT-4o, May 13th (temp=1)10010099967568681410063.0%
GPT-5.4 Mini (Reasoning, Low)100100996860605227201460.2%
Ministral 8B1001001009690753520059.9%
DeepSeek V3.1100999894865235350059.8%
Mistral Small 4 (Reasoning)1001001001009896000059.3%
Llama 3.1 8B10010010094757535140059.3%
Z.AI GLM 4.71001009681686043350058.4%
Claude Sonnet 4.69868686860525235353557.1%
GPT-4.1 Nano100999486814327274056.2%
Arcee AI: Trinity Large (Preview)1009086817552432010055.8%
ByteDance Seed 1.69999968675681400053.8%
Hermes 3 405B1009081816852272014053.4%
Mistral Medium 3.110096946060605264053.2%
GPT-5.4100948675525252200053.1%
Mistral Large1009994818175000053.1%
Hermes 3 70B1001009681523535274053.0%
GPT-4.1 Mini100100989881351020052.3%
Claude Sonnet 4.6 (Reasoning)999481524343272020048.1%
Claude 3.5 Sonnet100867575523527272048.0%
Gemini 3 Pro (Preview)10099999635271044047.4%
Z.AI GLM 51009999906014421047.0%
DeepSeek-V2 Chat1009081756852000046.7%
Claude 3 Haiku10096947543351400045.7%
GPT-5.4 Nano (Reasoning, Low)100100100100524200045.7%
DeepSeek V3 (2025-03-24)10010010094602000045.6%
Qwen 2.5 72B10010010099432100044.5%
DeepSeek V3 (2024-12-26)10010099755210600044.3%
Mistral Small 41009994812720664243.9%
Gemini 2.5 Flash Lite90756852434343200043.6%
Grok 4 Fast10099866043202020043.1%
Qwen 3.5 Plus (2026-02-15)908175605252220041.5%
Claude Haiku 4.5999475754310640040.6%
Gemini 2.5 Flash10099989460000039.6%
Ministral 3 14B1009086202020221034.2%
Mistral Small 3.2 24B96948152106000033.9%
LFM2 24B99966043270000032.6%
Stealth: Healer Alpha10099902700000031.6%
ByteDance Seed 2.0 Lite9990353566420027.8%
Qwen3 235B A22B Instruct 250786605235351000027.0%
Z.AI GLM 4.7 Flash1009075400000026.9%
Stealth: Hunter Alpha989668421000026.9%
Mistral Large 39952525210000025.6%
Claude 3.7 Sonnet906827202020441025.5%
Ministral 3 3B9694271020000022.8%
ByteDance Seed 1.6 Flash1008627000000021.3%
GPT-5.4 Nano1008110000000019.2%
Writer: Palmyra X5996810410000018.2%
ByteDance Seed 2.0 Mini757527400000018.1%
Qwen 3 32B99810000000018.1%
Qwen 3.5 9B100604200000016.6%
Llama 3.1 Nemotron 70B98524000000015.4%
Gemini 2.5 Pro606010600000013.6%
Ministral 3B75600000000013.5%
Gemini 2.5 Flash Lite (Reasoning)99351000000013.5%
Z.AI GLM 4.590274000000012.1%
Aion 2.0100101000000011.0%
Ministral 3 8B10020000000010.2%
Claude Sonnet 4.55227640000008.9%
Mistral Small Creative43351000000008.8%
Gemma 3 12B3514400000005.3%
Rocinante 12B3514100000005.0%
Z.AI GLM 4.6432000000004.6%
Claude Sonnet 4206110000003.0%
Cohere Command R+ (Aug. 2024)270000000002.7%
Claude 3.5 Haiku1010110000002.2%
Gemini 2.5 Flash (Reasoning)64000000001.0%
Mistral Large 260000000000.6%
Arcee AI: Trinity Mini60000000000.6%
Gemma 3 27B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Mistral NeMO00000000000.0%
Gemma 3 4B00000000000.0%

dialogue-500

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
MiniMax M2.71001001001001001001001001009999.9%
Gemini 3.1 Pro (Preview)1001001001001001001001001002092.0%
o4 Mini High100100100100100100100100100090.0%
Claude Sonnet 4.610099999694949081686088.1%
MiniMax M2.510010010010010096969490087.6%
Claude Sonnet 4.6 (Reasoning)100100100100100999690353585.5%
GPT-5 Mini10010010010010010010010014081.4%
GPT-51001001001001001001001000080.0%
Claude 3.7 Sonnet10010099969690868114076.4%
Claude Opus 4.51009998968681686860175.7%
Inception Mercury 2100100100100999860350069.2%
Gemini 3 Flash (Preview)100100999998988140067.8%
Nemotron 3 Super100100100100100987500067.3%
Grok 4.20 (Beta)100999998908143432065.6%
o4 Mini100100100100908135206063.3%
Grok 41009999907560603514063.3%
GPT-4o, Aug. 6th (temp=0)99868686868175100061.1%
Stealth: Aurora Alpha100100100100100902000061.1%
GPT-4o Mini (temp=0)10010090756860523527060.8%
Nemotron 3 Nano100100100100100100000060.0%
Claude Opus 4.6 (Reasoning)100100999996861000059.1%
Gemini 3.1 Flash Lite (Preview)99999686755235272057.1%
Claude Opus 4.6100100100998668600056.0%
Claude Opus 494949081755220140052.1%
Qwen 3.5 397B A17B100999894520000044.2%
Mistral Medium 3.110010010098350000043.3%
Gemma 3 12B100998675600000042.1%
Grok 4 Fast100999086351000041.2%
Hermes 3 405B100908168202000036.2%
Qwen 3.5 27B100100995200000035.1%
Gemini 2.5 Flash Lite10099945260000035.1%
LFM2 24B1001009035104200034.0%
GPT-5 Nano1001001003500000033.5%
DeepSeek V3 (2024-12-26)9998944300000033.4%
Arcee AI: Trinity Mini9996686800000033.2%
Mistral Large 386756843430000031.6%
Claude Haiku 4.510094524360000029.5%
Z.AI GLM 510090525200000029.4%
MoonshotAI: Kimi K2.51009694000000029.0%
Ministral 3 8B10010052000000025.2%
ByteDance Seed 2.0 Lite10098272040000024.9%
Mistral Small 4907568600000024.0%
Z.AI GLM 4.7 Flash1008652000000023.8%
Cohere Command R+ (Aug. 2024)969443000000023.3%
Gemini 3 Flash (Preview, Reasoning)10075431400000023.3%
ByteDance Seed 2.0 Mini10098141000000022.2%
Gemini 2.5 Pro1007543000000021.8%
Inception Mercury1001000000000020.0%
GPT-5.1100990000000019.9%
Qwen 3.5 122B100990000000019.9%
Gemini 2.5 Flash Lite (Reasoning)100940000000019.4%
Gemini 2.5 Flash906820000000017.9%
Z.AI GLM 4.752434320141000017.5%
Qwen 3.5 Flash686035000000016.4%
GPT-4o, Aug. 6th (temp=1)994310400000015.6%
GPT-4o Mini (temp=1)816010000000015.2%
Grok 4.1 Fast96434000000014.3%
Claude 3 Haiku100352000000013.7%
Gemini 3 Pro (Preview)100350000000013.5%
DeepSeek-V2 Chat99204200000012.5%
GPT-4.196270000000012.3%
Ministral 3 3B99140000000011.4%
Claude Sonnet 4.5811410200000010.7%
Gemma 3 27B9960000000010.5%
GPT-5.4 (Reasoning)10000000000010.0%
DeepSeek V3.210000000000010.0%
Qwen 3 32B10000000000010.0%
Llama 3.1 8B10000000000010.0%
Qwen 2.5 72B9460000000010.0%
Qwen 3.5 35B990000000009.9%
GPT-4.1 Mini990000000009.9%
Ministral 8B990000000009.9%
GPT-5.2980000000009.8%
Gemini 2.5 Flash (Reasoning)43272000000009.1%
DeepSeek V3 (2025-03-24)861000000008.7%
Qwen3 235B A22B Instruct 2507860000000008.7%
Hermes 3 70B860000000008.6%
Ministral 3B860000000008.6%
Aion 2.0754210000008.2%
Gemma 3 4B754000000007.9%
Claude Sonnet 4750000000007.5%
Claude 3.5 Sonnet682000000007.0%
Llama 3.1 70B6010000000007.0%
Mistral Large5210000000006.2%
Grok 4.20 (Beta, Reasoning)4314000000005.8%
Ministral 3 14B520000000005.2%
Mistral Small 3.2 24B430000000004.3%
Stealth: Hunter Alpha430000000004.3%
Llama 3.1 Nemotron 70B430000000004.3%
ByteDance Seed 1.6352000000003.7%
Mistral Large 2200000000002.0%
DeepSeek V3.1140000000001.5%
Rocinante 12B60000000000.6%
ByteDance Seed 1.6 Flash40000000000.4%
Mistral Small Creative20000000000.2%
Stealth: Healer Alpha20000000000.2%
Z.AI GLM 4.600000000000.0%
Z.AI GLM 4.500000000000.0%
Mistral NeMO00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Writer: Palmyra X500000000000.0%
Qwen 3.5 9B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Claude 3.5 Haiku00000000000.0%
Arcee AI: Trinity Large (Preview)00000000000.0%
GPT-4.1 Nano00000000000.0%
Mistral Small 4 (Reasoning)00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
GPT-5.4 (Reasoning, Low)00000000000.0%
GPT-5.4 Mini (Reasoning)00000000000.0%
GPT-5.4 Mini (Reasoning, Low)00000000000.0%
GPT-5.400000000000.0%
GPT-5.4 Mini00000000000.0%
GPT-5.4 Nano (Reasoning)00000000000.0%
GPT-5.4 Nano (Reasoning, Low)00000000000.0%
GPT-5.4 Nano00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo1001001001001001001001001009699.6%
Claude Opus 4.61001001001001001009996966895.8%
MiniMax M2.7100100100100100100100100100090.0%
o4 Mini100100100100100100100962079.8%
Claude Sonnet 4.6 (Reasoning)10099989090756052432072.8%
Claude Opus 4.6 (Reasoning)100100100100999894350072.6%
o4 Mini High100100100100100100100140071.4%
MiniMax M2.510010010010010010010060070.6%
Nemotron 3 Super10010099999875602010066.1%
Claude 3.7 Sonnet100100100999686202020064.2%
Gemini 3.1 Flash Lite (Preview)989890686860605243063.8%
Claude Sonnet 4.6969075757575684335063.4%
Inception Mercury 210010010010096943560063.1%
GPT-4o Mini (temp=0)10099969481812740058.2%
Grok 4 Fast10010010010090682000057.8%
Nemotron 3 Nano1001001001007560000053.5%
GPT-4.110010096948643440052.6%
Inception Mercury100100100100981000049.9%
Grok 41009686757552200048.7%
Claude Opus 49990686868523500048.1%
Gemini 2.5 Flash Lite1009486686010600042.4%
Stealth: Aurora Alpha1001001009420000039.6%
Arcee AI: Trinity Mini10010075683514000039.2%
Aion 2.01009996602010000038.5%
Qwen 3.5 397B A17B10094866800000034.8%
GPT-4o, Aug. 6th (temp=1)100908152201000034.5%
Claude Opus 4.5989486431010220034.5%
Gemini 2.5 Flash10098816000000033.9%
GPT-5.1100100100000000030.0%
GPT-5.4 Nano (Reasoning)100100100000000030.0%
Qwen 3.5 9B10010098000000029.7%
Grok 4.20 (Beta)100996814102000029.3%
GPT-4o Mini (temp=1)1008681640000027.7%
Claude Haiku 4.59996681000000027.4%
Cohere Command R+ (Aug. 2024)969075200000026.4%
DeepSeek V3.1998675000000026.1%
Mistral Medium 3.110010060000000026.1%
GPT-5 Nano10010043000000024.4%
Llama 3.1 8B969043200000023.2%
Qwen 3.5 35B1008143000000022.5%
Gemini 3 Flash (Preview, Reasoning)908143000000021.6%
Gemini 2.5 Pro1009810000000020.7%
Mistral Large1001000000000020.0%
GPT-5.4 (Reasoning)1001000000000020.0%
Z.AI GLM 4.7100866000000019.2%
Gemini 3 Pro (Preview)756052000000018.7%
Gemini 2.5 Flash (Reasoning)756843000000018.7%
Hermes 3 70B686843000000018.0%
DeepSeek V3 (2024-12-26)8135271000000015.3%
Mistral Small 4816010000000015.1%
MoonshotAI: Kimi K2.5993514000000014.8%
Qwen 3.5 27B962720000000014.3%
Writer: Palmyra X586520000000013.8%
Llama 3.1 70B98350000000013.3%
Grok 4.20 (Beta, Reasoning)5235271400000012.8%
DeepSeek V3.290350000000012.6%
Gemma 3 12B100140000000011.4%
GPT-4.1 Mini86270000000011.4%
Rocinante 12B100100000000010.9%
Mistral Large 310062000000010.8%
Stealth: Hunter Alpha602720000000010.8%
DeepSeek V3 (2025-03-24)9940000000010.3%
GPT-5.210000000000010.0%
Arcee AI: Trinity Large (Preview)990000000009.9%
Stealth: Healer Alpha990000000009.9%
Ministral 3 3B980000000009.8%
Mistral Small 4 (Reasoning)944000000009.7%
DeepSeek-V2 Chat902000000009.3%
Hermes 3 405B900000000009.0%
Gemini 3 Flash (Preview)7514000000009.0%
Qwen 3.5 122B861000000008.7%
Qwen 3 32B860000000008.6%
ByteDance Seed 1.6 Flash860000000008.6%
ByteDance Seed 2.0 Mini811000000008.2%
Ministral 3 8B810000000008.1%
Grok 4.1 Fast750000000007.5%
Claude 3 Haiku684000000007.2%
Gemini 2.5 Flash Lite (Reasoning)604100000006.5%
LFM2 24B604000000006.4%
Claude Sonnet 4.5602000000006.2%
ByteDance Seed 1.6526000000005.8%
Claude Sonnet 4270000000002.7%
Mistral Small 3.2 24B270000000002.7%
Z.AI GLM 51410000000002.4%
Qwen 3.5 Flash200000000002.0%
ByteDance Seed 2.0 Lite140000000001.4%
Z.AI GLM 4.7 Flash100000000001.0%
Mistral Large 2100000000001.0%
Claude 3.5 Sonnet20000000000.2%
Z.AI GLM 4.520000000000.2%
GPT-4o, Aug. 6th (temp=0)10000000000.1%
Mistral NeMO00000000000.0%
Qwen3 235B A22B Instruct 250700000000000.0%
Ministral 3B00000000000.0%
Ministral 3 14B00000000000.0%
Qwen 2.5 72B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Ministral 8B00000000000.0%
GPT-5.4 Mini00000000000.0%
Gemma 3 4B00000000000.0%
GPT-4.1 Nano00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Gemma 3 27B00000000000.0%
Mistral Small Creative00000000000.0%
Claude 3.5 Haiku00000000000.0%
GPT-5.4 (Reasoning, Low)00000000000.0%
GPT-5.400000000000.0%
GPT-5.4 Mini (Reasoning)00000000000.0%
Z.AI GLM 4.600000000000.0%
GPT-5.4 Mini (Reasoning, Low)00000000000.0%
GPT-5.4 Nano (Reasoning, Low)00000000000.0%
GPT-5.4 Nano00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100090.0%
Claude 3.7 Sonnet10010010010098988681686089.1%
Claude Opus 4.6 (Reasoning)100100999696948668684385.1%
MiniMax M2.710010010010010010099992080.0%
MiniMax M2.5100100100100999996816078.1%
Claude Opus 4.6100100999996868168272077.7%
o4 Mini High1001001001001009968350070.3%
GPT-510010010010010010010000070.0%
Nemotron 3 Super100100100100100999610069.5%
GPT-4o Mini (temp=0)1001009490907568270064.5%
GPT-4o, Aug. 6th (temp=0)10010098966043434343062.7%
Inception Mercury 210010010010010094200059.6%
Gemini 3.1 Flash Lite (Preview)1001009460524343272052.2%
Z.AI GLM 5 Turbo10010010099992000050.0%
Grok 4 Fast10010010068432720140047.3%
GPT-4o, Aug. 6th (temp=1)100100100966010400046.9%
Claude Opus 49996904343352741044.0%
Aion 2.010010010068606000043.4%
Gemini 3 Flash (Preview)1001008181524000041.9%
Stealth: Healer Alpha100999490350000041.8%
Gemini 2.5 Flash Lite (Reasoning)99969486202000039.7%
Nemotron 3 Nano1001001008160000038.7%
Claude Sonnet 4.6100524343352714144033.3%
Gemini 2.5 Flash Lite10099904300000033.3%
Claude Opus 4.590909014144410030.9%
Cohere Command R+ (Aug. 2024)10099523541000029.0%
o4 Mini10010075200000027.7%
Claude 3.5 Sonnet1009875000000027.3%
Mistral Medium 3.1999875100000027.3%
Qwen 3.5 122B1009868000000026.6%
Stealth: Aurora Alpha1008181000000026.3%
Grok 4.20 (Beta)90686020200000025.9%
Inception Mercury10099352020000025.7%
ByteDance Seed 2.0 Lite988668000000025.2%
Claude Sonnet 4.6 (Reasoning)814343352014000023.8%
Gemini 3 Flash (Preview, Reasoning)1009935000000023.4%
GPT-4.1999635400000023.3%
Mistral Large 31009827420000023.0%
GPT-4o, May 13th (temp=1)1006052000000021.2%
GPT-4o Mini (temp=1)989810000000020.5%
Grok 49960351000000020.4%
DeepSeek-V2 Chat1001002000000020.2%
Z.AI GLM 59075201400000020.0%
Ministral 3 14B99990000000019.8%
LFM2 24B99960000000019.5%
GPT-4.1 Mini99960000000019.5%
Stealth: Hunter Alpha8668271000000019.1%
DeepSeek V3 (2024-12-26)756043600000018.5%
Mistral Small 4946027200000018.4%
Rocinante 12B99810000000018.0%
Grok 4.20 (Beta, Reasoning)818110000000017.2%
Qwen 2.5 72B994327200000017.1%
Mistral NeMO100680000000016.8%
Llama 3.1 8B813535000000015.1%
Arcee AI: Trinity Mini100430000000014.3%
Ministral 3 8B68276400000010.6%
GPT-5 Nano10010000000010.1%
Arcee AI: Trinity Large (Preview)10000000000010.0%
Ministral 8B10000000000010.0%
Mistral Small 3.2 24B980000000009.8%
Qwen 3.5 9B960000000009.6%
ByteDance Seed 2.0 Mini960000000009.6%
Gemini 2.5 Flash940000000009.4%
Mistral Large6027000000008.7%
Qwen 3.5 27B861000000008.7%
Llama 3.1 70B860000000008.6%
Ministral 3B860000000008.6%
Gemini 3 Pro (Preview)860000000008.6%
Gemma 3 12B811000000008.2%
Qwen3 235B A22B Instruct 2507752000000007.7%
Mistral Large 2680000000006.8%
Writer: Palmyra X5680000000006.8%
ByteDance Seed 1.6680000000006.8%
Grok 4.1 Fast600000000006.0%
Gemini 2.5 Pro3520200000005.7%
Z.AI GLM 4.735101000000005.5%
MoonshotAI: Kimi K2.5430000000004.4%
Ministral 3 3B430000000004.3%
DeepSeek V3 (2025-03-24)430000000004.3%
DeepSeek V3.1430000000004.3%
Qwen 3.5 397B A17B350000000003.5%
Claude 3 Haiku270000000002.7%
Claude Haiku 4.5202100000002.3%
GPT-5.1140000000001.4%
Hermes 3 70B64200000001.2%
Hermes 3 405B40000000000.4%
Qwen 3.5 Plus (2026-02-15)20000000000.2%
DeepSeek V3.210000000000.1%
Qwen 3 32B00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Claude Sonnet 4.500000000000.0%
GPT-4.1 Nano00000000000.0%
Z.AI GLM 4.500000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Gemini 2.5 Flash (Reasoning)00000000000.0%
Claude Sonnet 400000000000.0%
Qwen 3.5 35B00000000000.0%
Mistral Small 4 (Reasoning)00000000000.0%
Gemma 3 27B00000000000.0%
Z.AI GLM 4.7 Flash00000000000.0%
ByteDance Seed 1.6 Flash00000000000.0%
Z.AI GLM 4.600000000000.0%
Mistral Small Creative00000000000.0%
GPT-5.4 Mini00000000000.0%
Qwen 3.5 Flash00000000000.0%
WizardLM 2 8x22b00000000000.0%
GPT-5.4 Mini (Reasoning, Low)00000000000.0%
Gemma 3 4B00000000000.0%
GPT-5.4 (Reasoning)00000000000.0%
GPT-5.4 (Reasoning, Low)00000000000.0%
GPT-5.4 Mini (Reasoning)00000000000.0%
GPT-5.200000000000.0%
GPT-5.400000000000.0%
Claude 3.5 Haiku00000000000.0%
GPT-5.4 Nano (Reasoning)00000000000.0%
GPT-5.4 Nano (Reasoning, Low)00000000000.0%
GPT-5.4 Nano00000000000.0%