Dialogue to Total Word Ratio

Test: Dialogue tags

Avg. Score
35.8%
Scenarios
6

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Z.AI GLM 5 Turbo82.3%$0.0351.4m38%
2Gemini 3.1 Pro (Preview)99.6%$0.1572.1m95%
3Qwen3.7 Max89.3%$0.0742.5m56%
4Inception Mercury 257.7%$0.00426.8s11%
5GPT-5 Mini67.3%$0.01158.5s12%
6Claude Sonnet 4.655.4%$0.007912.8s11%
7Nemotron 3 Super74.2%$0.00002.8m22%
8GPT-5.557.0%$0.02119.2s12%
9Gemini 3.5 Flash (Reasoning)90.1%$0.13756.9s54%
10Gemini 2.5 Flash (Reasoning)54.5%$0.01529.3s10%
11Qwen 3.6 Flash56.5%$0.01549.1s11%
12GPT-5.4 Nano (Reasoning, Low)45.8%$0.002010.4s4%
13Qwen 3.6 35B60.0%$0.0131.3m11%
14GPT-5.4 Nano (Reasoning)48.6%$0.004221.2s3%
15Claude Sonnet 444.5%$0.007711.4s7%
16o4 Mini61.7%$0.0261.1m13%
17Claude Sonnet 4.546.4%$0.008011.8s5%
18Claude Opus 4.649.3%$0.01415.5s6%
19Inception Mercury42.6%$0.00069.1s3%
20GPT-5.4 (Reasoning, Low)52.0%$0.01723.5s6%
21Z.AI GLM 5.183.2%$0.0483.9m42%
22GPT-OSS 120B58.3%$0.00151.9m10%
23GPT-5.4 (Reasoning)57.9%$0.03039.6s11%
24GPT-573.4%$0.0531.5m20%
25Claude Opus 4.547.8%$0.01414.3s5%
26GPT-5.5 (Reasoning)62.2%$0.04929.5s14%
27GPT-5.255.2%$0.02738.3s9%
28GPT-5.440.3%$0.009618.7s8%
29GPT-5.158.2%$0.03052.9s11%
30GPT-4o, Aug. 6th (temp=0)30.6%$0.00536.3s12%
31Grok 4.20 (Reasoning)64.1%$0.0272.0m16%
32Gemini 3 Flash (Preview, Reasoning)51.5%$0.01933.7s4%
33Claude Haiku 4.536.9%$0.00266.6s0%
34Gemini 2.5 Flash Lite (Reasoning)38.3%$0.002224.1s1%
35GPT-5.5 (Reasoning, Low)52.7%$0.03324.6s6%
36Qwen 3.5 Flash49.6%$0.00501.4m5%
37ByteDance Seed 1.649.1%$0.00731.4m6%
38GPT-5.4 Mini32.0%$0.00294.0s2%
39Ministral 3 14B32.5%$0.00026.4s0%
40Qwen 3.5 27B62.1%$0.0251.9m10%
41Mistral NeMO32.5%$0.00016.6s0%
42Claude Opus 4.6 (Reasoning)69.3%$0.08042.3s20%
43Gemma 3 12B31.1%$0.000111.9s0%
44Gemini 2.5 Pro46.6%$0.02925.0s4%
45Claude Opus 4.7 (Reasoning)40.1%$0.02011.3s2%
46Mistral Small 4 (Reasoning)32.8%$0.001922.1s1%
47Grok 4.20 (Beta, Reasoning)57.9%$0.05338.2s10%
48Grok 4.1 Fast28.6%$0.000410.3s0%
49Z.AI GLM 4.643.6%$0.00441.3m3%
50MoonshotAI: Kimi K2.566.1%$0.0243.2m18%
51Mistral Small 426.5%$0.00035.7s0%
52GPT-5.4 Mini (Reasoning)29.9%$0.00779.4s1%
53Gemma 4 26B (Reasoning)64.2%$0.00353.8m14%
54Gemma 4 26B27.2%$0.000216.6s0%
55ByteDance Seed 2.0 Lite34.6%$0.003642.6s1%
56o4 Mini High64.0%$0.0501.9m16%
57Ministral 3B23.1%$0.00002.5s0%
58Qwen 3.5 35B50.3%$0.0241.3m4%
59Hermes 3 70B26.0%$0.000220.8s0%
60Llama 3.1 70B22.3%$0.00055.2s0%
61GPT-5.4 Mini (Reasoning, Low)22.8%$0.00324.6s1%
62LFM2 24B23.5%$0.000112.6s0%
63DeepSeek V3 (2024-12-26)25.2%$0.000619.3s0%
64Mistral Medium 3.123.8%$0.001213.1s0%
65ByteDance Seed 1.6 Flash23.2%$0.000612.2s0%
66Qwen3 235B A22B Instruct 250724.3%$0.000318.7s0%
67Z.AI GLM 547.3%$0.0101.8m3%
68DeepSeek-V2 Chat26.4%$0.000229.9s0%
69Qwen 3.5 Plus (2026-02-15)25.2%$0.001322.1s0%
70Qwen 3.5 122B52.0%$0.0331.3m5%
71Grok 4.3 (Reasoning)68.4%$0.0373.2m17%
72Aion 2.025.3%$0.001624.7s0%
73Qwen 3 32B23.5%$0.000521.3s0%
74Writer: Palmyra X523.6%$0.003812.6s0%
75Llama 3.1 Nemotron 70B21.9%$0.000215.7s0%
76GPT-5.4 Nano19.9%$0.00115.2s0%
77Mistral Small 3.2 24B20.3%$0.00029.0s0%
78Xiaomi MIMO v2.5 Pro24.8%$0.003220.8s0%
79Grok 4 Fast19.7%$0.00046.8s0%
80Rocinante 12B23.6%$0.000325.1s0%
81Gemma 4 31B (Reasoning)63.0%$0.00244.2m14%
82Claude 3.5 Sonnet29.0%$0.009232.1s2%
83Ministral 3 3B17.8%$0.00012.0s0%
84Mistral Small Creative18.2%$0.00023.8s0%
85Arcee AI: Trinity Large (Preview)19.9%$0.000012.2s0%
86Gemini 2.5 Flash Lite17.8%$0.00022.6s0%
87GPT-4o, Aug. 6th (temp=1)21.7%$0.00546.3s0%
88Stealth: Hunter Alpha20.9%$0.000017.1s0%
89Ministral 8B17.9%$0.00004.0s0%
90Z.AI GLM 4.519.3%$0.00098.9s0%
91Ministral 3 8B17.0%$0.00013.3s0%
92Xiaomi MIMO v2.520.1%$0.002212.5s0%
93Grok 4.318.2%$0.00117.2s0%
94GPT-4o, May 13th (temp=0)23.7%$0.009014.3s1%
95DeepSeek V3.122.5%$0.000628.3s0%
96DeepSeek V3 (2025-03-24)19.8%$0.000617.8s0%
97GPT-4o Mini (temp=0)16.5%$0.00038.5s0%
98Mistral Large 219.2%$0.003113.0s0%
99Gemma 3 4B15.3%$0.00005.5s0%
100Llama 3.1 8B14.3%$0.00012.0s0%
101Z.AI GLM 4.747.5%$0.00692.4m4%
102Claude 3.7 Sonnet21.3%$0.008511.6s0%
103Hermes 3 405B19.5%$0.000026.5s0%
104Qwen 2.5 72B17.1%$0.000315.6s0%
105Gemini 3 Flash (Preview)15.4%$0.00155.0s0%
106ByteDance Seed 2.0 Mini52.3%$0.00293.3m7%
107Stealth: Healer Alpha17.3%$0.000019.6s0%
108Cohere Command R+ (Aug. 2024)18.7%$0.005412.4s0%
109Gemma 4 31B18.2%$0.000224.3s0%
110Claude 3 Haiku13.5%$0.00074.7s0%
111Mistral Large23.6%$0.01513.3s0%
112WizardLM 2 8x22b17.2%$0.000722.8s0%
113GPT-4.1 Nano13.0%$0.00026.0s0%
114Stealth: Aurora Alpha48.9%9.2s10%
115Z.AI GLM 4.7 Flash26.9%$0.00141.1m0%
116GPT-4o, May 13th (temp=1)19.6%$0.008815.6s0%
117Gemma 3 27B13.9%$0.000115.7s0%
118GPT-4.1 Mini12.4%$0.00096.9s0%
119DeepSeek V3.221.3%$0.000547.8s0%
120Mistral Large 313.2%$0.000912.1s0%
121Arcee AI: Trinity Mini10.8%$0.00015.6s0%
122Claude Opus 4.724.2%$0.02012.0s0%
123Z.AI GLM 4.5 Air12.8%$0.000717.8s0%
124Gemini 2.5 Flash10.0%$0.00154.0s0%
125Grok 421.3%$0.01227.3s0%
126GPT-5 Nano32.6%$0.00401.6m0%
127Gemini 3.1 Flash Lite (Reasoning)8.8%$0.00074.2s0%
128GPT-4o Mini (temp=1)9.5%$0.00038.4s0%
129DeepSeek V4 Flash10.0%$0.000214.6s0%
130MiniMax M2.553.9%$0.0173.1m6%
131Gemini 3.1 Flash Lite (Preview)7.1%$0.00072.9s0%
132Gemini 3.1 Flash Lite7.6%$0.00075.2s0%
133DeepSeek V4 Flash (Reasoning)15.5%$0.000241.4s0%
134Grok 4.207.1%$0.00139.3s0%
135Gemini 3.5 Flash (Reasoning, Minimal)7.1%$0.00413.7s0%
136Claude Sonnet 4.6 (Reasoning)71.1%$0.1171.3m22%
137Grok 4.20 (Beta)5.6%$0.00293.2s0%
138DeepSeek V4 Pro9.2%$0.000928.7s0%
139Qwen 3.5 9B35.6%$0.00172.5m1%
140Gemini 3 Pro (Preview)25.4%$0.03123.6s0%
141MiniMax M2.764.6%$0.0234.7m14%
142GPT-4.14.3%$0.00447.8s0%
143Qwen 3.5 Plus (2026-04-20)38.6%$0.0202.2m1%
144Qwen 3.6 27B42.1%$0.0292.1m1%
145Claude Opus 425.1%$0.04425.5s0%
146DeepSeek V4 Pro (Reasoning)48.2%$0.0193.7m4%
147MoonshotAI: Kimi K2.680.8%$0.0547.0m34%
148Nemotron 3 Nano42.3%$0.00424.8m3%
149Qwen 3.5 397B A17B56.2%$0.0394.9m6%
150Qwen3.6 Max Preview52.4%$0.0724.4m4%
35.84%

Individual Scenarios

dialogue-200

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini1001001001001001001001001009999.9%
Qwen 3.5 27B1001001001001001001001001009999.9%
Qwen 3.5 122B1001001001001001001001001009999.9%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001001009999.9%
GPT-5.11001001001001001001001001009999.9%
Z.AI GLM 4.71001001001001001001001001009999.9%
GPT-5.4 Nano (Reasoning)1001001001001001001001001009899.8%
Qwen 3.5 397B A17B1001001001001001001001001009799.7%
MoonshotAI: Kimi K2.61001001001001001001001001009799.7%
GPT-5.2100100100100100100100100999699.4%
o4 Mini High1001001001001001001001001009399.3%
ByteDance Seed 2.0 Mini100100100100100100100100998998.7%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001008298.2%
Z.AI GLM 510010010010010010010095948297.1%
GPT-5.5100100100100100999996928296.9%
Claude Sonnet 41001001009998989492929196.5%
MoonshotAI: Kimi K2.51001001001001001001001001005595.5%
DeepSeek V4 Pro (Reasoning)10010010010010010010098797695.3%
Nemotron 3 Nano1001001001001001001001001005295.1%
Qwen3.7 Max1001001001001001001001001003793.7%
GPT-5 Nano1001001001001001001001001003793.7%
Qwen 3.6 27B100100100100100100100100993793.6%
MiniMax M2.7100100100100100100100100993793.5%
o4 Mini1001001001001001001001001003293.2%
Claude Opus 4.51001001001001001009594796593.2%
Qwen 3.5 Flash10010010010010010010099953793.1%
ByteDance Seed 2.0 Lite100100100100100999998923792.5%
GPT-5.4 (Reasoning, Low)10010010010010010010010099590.2%
GPT-5.5 (Reasoning)100100100100100100100100100290.2%
GPT-5.41001001009998989190883489.8%
ByteDance Seed 1.61001001001001001009898573788.9%
Qwen 3.6 Flash10010010010010010010099453788.0%
GPT-OSS 120B1001001001001001001009684087.9%
Qwen3.6 Max Preview100100100100100100100100373787.4%
Grok 4.3 (Reasoning)100100100100100100100100373787.4%
GPT-5.4 Nano (Reasoning, Low)100100100100100100999283087.3%
Qwen 3.5 9B10010010010010010010099383787.3%
Qwen 3.5 35B10010010010010010010094373786.7%
Gemma 4 26B10010010010098959269653785.4%
Qwen 3.6 35B10010010010010010010065373783.9%
Claude Sonnet 4.510099979695939259524582.9%
Gemini 2.5 Pro10010010010010099999928082.3%
Qwen 3.5 Plus (2026-04-20)1001001001001001009937373781.0%
Inception Mercury1001001001001001001007337081.0%
Nemotron 3 Super100100100100100100100992280.4%
Inception Mercury 2100100100100100100100990079.8%
Stealth: Aurora Alpha1001001001001001001005237178.9%
Claude Opus 4.79589848379797873675978.6%
Claude Sonnet 4.69898968885737360525277.5%
Claude Opus 4.7 (Reasoning)10010094948579675959173.8%
Z.AI GLM 4.7 Flash10010010097959084610072.7%
Qwen3 235B A22B Instruct 25071009897938784774337071.5%
Claude Opus 410010094929078704034069.7%
Claude Haiku 4.510010010097968955490068.5%
Gemini 3 Pro (Preview)100100100100100373737373768.4%
Claude Opus 4.610010010099999835178065.7%
Z.AI GLM 4.61001001001001009921210064.0%
Gemma 3 12B100100989088827040063.3%
Gemma 4 31B8583817966593737373760.0%
Mistral Small 4 (Reasoning)10099999694771930058.8%
Aion 2.0100100100999689110058.5%
MiniMax M2.510010010099523737133054.0%
GPT-5.4 Mini (Reasoning)10010010010053372000051.0%
Gemini 2.5 Flash Lite (Reasoning)10010010097937000049.7%
Gemini 3 Flash (Preview)9996373737373737373749.0%
Qwen 3.5 Plus (2026-02-15)5858574437373737373743.8%
Gemini 3.1 Flash Lite (Reasoning)9637373737373737373742.7%
Gemini 3.1 Flash Lite (Preview)9537373737373737373742.7%
Grok 4.310099746437321420042.1%
Ministral 3 3B10096846941191000041.9%
Writer: Palmyra X5100989472451000040.9%
Xiaomi MIMO v2.5 Pro10010010010000000040.0%
DeepSeek V3.2100100998310000038.3%
Gemini 3.1 Flash Lite3737373737373737373736.8%
Gemini 2.5 Flash (Reasoning)1001008551245000036.6%
Gemini 3.5 Flash (Reasoning, Minimal)373737373737373737033.1%
Claude 3.5 Sonnet100965852131000032.0%
Xiaomi MIMO v2.5100100100100000030.0%
Stealth: Hunter Alpha10089714000000030.0%
Rocinante 12B1009990000000028.9%
Z.AI GLM 4.51009680000000027.6%
Qwen 3 32B10010075000000027.6%
GPT-5.4 Mini (Reasoning, Low)1001002221100000025.3%
Mistral NeMO1009941000000024.0%
Mistral Large94832015144000023.1%
Mistral Small 4956029850000019.7%
Ministral 3 14B100450000000014.4%
Arcee AI: Trinity Large (Preview)85460000000013.1%
LFM2 24B97280000000012.6%
Grok 4.1 Fast10000000000010.0%
DeepSeek V3 (2024-12-26)4935000000008.4%
Claude 3.7 Sonnet820000000008.2%
GPT-5.4 Nano678000000007.5%
GPT-4o, May 13th (temp=1)660000000006.6%
GPT-5.4 Mini621000000006.3%
DeepSeek V4 Pro561000000005.7%
GPT-4o, Aug. 6th (temp=1)458000000005.3%
Hermes 3 70B510000000005.1%
Stealth: Healer Alpha281000000003.0%
DeepSeek V3 (2025-03-24)194000000002.4%
DeepSeek V4 Flash (Reasoning)98100000001.7%
Llama 3.1 8B50000000000.5%
GPT-4.1 Mini30000000000.3%
GPT-4.111000000000.2%
Grok 4.2020000000000.2%
WizardLM 2 8x22b00000000000.0%
Mistral Small Creative00000000000.0%
Grok 400000000000.0%
Mistral Large 200000000000.0%
Ministral 8B00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Z.AI GLM 4.5 Air00000000000.0%
Hermes 3 405B00000000000.0%
ByteDance Seed 1.6 Flash00000000000.0%
Mistral Large 300000000000.0%
Ministral 3B00000000000.0%
Claude 3 Haiku00000000000.0%
Ministral 3 8B00000000000.0%
Grok 4 Fast00000000000.0%
Qwen 2.5 72B00000000000.0%
Gemma 3 27B00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Mistral Medium 3.100000000000.0%
Llama 3.1 70B00000000000.0%
Mistral Small 3.2 24B00000000000.0%
DeepSeek V3.100000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemma 3 4B00000000000.0%
Grok 4.20 (Beta)00000000000.0%
Gemini 2.5 Flash00000000000.0%
DeepSeek V4 Flash00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
GPT-4.1 Nano00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B1001001001001001001001001009999.9%
Qwen 3.6 35B10010010010010010010010010010099.9%
Gemma 4 31B (Reasoning)1001001001001001001001001009999.9%
GPT-5.5 (Reasoning)1001001001001001001001001009799.7%
Claude Sonnet 4.6 (Reasoning)10010010010010010010099999999.6%
Qwen 3.5 27B10010010010010010010099999899.6%
Z.AI GLM 5.1100100100100100100100100989499.2%
Qwen 3.5 35B10010010010010010010099978998.5%
Grok 4.20 (Beta, Reasoning)10010010010010010010099998298.0%
Qwen 3.5 122B1001001001001001009999958597.9%
GPT-5.5 (Reasoning, Low)100100100100100999997939097.8%
Z.AI GLM 5 Turbo100100100100100100100100996696.5%
MoonshotAI: Kimi K2.5100100100100100100100100986195.8%
MiniMax M2.51001001001001001001001001005295.2%
Inception Mercury 2100100100100100100100100965294.8%
GPT-OSS 120B10010010010010010010099994194.0%
GPT-5.51001001001001001009993885293.1%
MiniMax M2.7100100100100100100100100982392.1%
Qwen3.6 Max Preview100100100100100100100100100090.0%
GPT-5100100100100100100100100100090.0%
Gemma 4 26B (Reasoning)100100100100100100100100100090.0%
GPT-5.4 Nano (Reasoning)10010010010010010010010099089.9%
Grok 4.3 (Reasoning)10010010010010010010010099089.9%
Gemini 3 Flash (Preview, Reasoning)10010010010010010010010099089.8%
MoonshotAI: Kimi K2.61001001001001001001009898089.7%
Gemini 2.5 Pro10010010010099979488714589.2%
GPT-5.110010010010010010010010092089.2%
Grok 4.20 (Reasoning)1001001001001001009385803289.0%
Gemini 2.5 Flash (Reasoning)1001001001009999989878087.2%
Claude Opus 4.6 (Reasoning)100100100999998979223080.8%
GPT-5.2100100100100100100996441080.3%
GPT-5 Mini1001001001001001001001000080.0%
Z.AI GLM 4.61001001001009998867737079.7%
Nemotron 3 Super1001001001001009999990079.6%
Z.AI GLM 51001001001009997878132079.5%
o4 Mini100100100100100100100880078.8%
Z.AI GLM 4.71001001001001009694942078.5%
GPT-5.4 (Reasoning, Low)100100100100999993796578.1%
Mistral Small 4100100100989384796128074.2%
Mistral Small 4 (Reasoning)1001001001009897884213174.0%
Qwen 3.6 Flash1001001001001009693433073.5%
Qwen 3.5 Flash1001001001009790716214073.4%
GPT-5.4 Nano (Reasoning, Low)1001001001001009990376073.2%
GPT-4o, May 13th (temp=0)100100100988382626144273.1%
GPT-4o, Aug. 6th (temp=0)100100979079725941413171.1%
o4 Mini High100100100100100999700069.6%
Stealth: Aurora Alpha10010010010099999600069.3%
Ministral 3 14B10010010097967979218168.0%
Claude Sonnet 4.61001001001001009054322067.8%
ByteDance Seed 1.6 Flash10099999997908184167.8%
Qwen 3.6 27B1001009999917773326067.7%
Qwen 3.5 Plus (2026-04-20)100100100100100996700066.6%
LFM2 24B10010099928768473513064.1%
ByteDance Seed 2.0 Mini10010010099978138203063.9%
Inception Mercury1001001009994795600062.7%
Nemotron 3 Nano1001009996884644300060.3%
Cohere Command R+ (Aug. 2024)100100988573716340059.3%
GPT-5.410010010087856628230058.8%
Hermes 3 70B100100948380605091157.8%
Claude Opus 4.61001001001008786110057.5%
Mistral Large10010010099735918140056.3%
Gemini 3 Pro (Preview)1001001001008665000055.1%
ByteDance Seed 1.610097949483404010055.0%
GPT-5.4 Mini (Reasoning)10099969275412300052.7%
DeepSeek V4 Pro (Reasoning)1001001001001000000050.0%
Mistral Small Creative1001009998921000049.1%
Gemini 2.5 Flash Lite (Reasoning)100100100994513700046.4%
Claude Haiku 4.510010094855918000045.7%
Qwen 3 32B10099878143191010044.0%
Qwen 3.5 9B1001008475673100043.0%
Mistral Large 310010091763223400042.6%
Hermes 3 405B989796642828000041.0%
Claude Opus 4.5100939268403000039.5%
Llama 3.1 8B100999764241000038.5%
Claude Sonnet 41009980453811310037.6%
Ministral 3B100927366370000036.8%
DeepSeek V3 (2025-03-24)1008986352621200035.8%
Ministral 3 8B10078686614131120035.2%
Grok 4.1 Fast978762383520700034.7%
Claude Opus 4.7 (Reasoning)87686660321000031.3%
Mistral Medium 3.1100987027122000031.0%
Qwen 3.5 Plus (2026-02-15)100100100330000030.5%
Rocinante 12B89805046330000029.7%
DeepSeek V3 (2024-12-26)92855437223000029.3%
GPT-5 Nano10010092000000029.2%
ByteDance Seed 2.0 Lite10094534100000028.8%
Stealth: Healer Alpha1008875631000027.3%
GPT-5.4 Mini (Reasoning, Low)100643728253000025.7%
GPT-4o, Aug. 6th (temp=1)1008965100000025.6%
Mistral NeMO10079393121000025.2%
WizardLM 2 8x22b80564141166200024.2%
Z.AI GLM 4.7 Flash8064552800000022.7%
Claude Sonnet 4.59995131110000022.0%
Xiaomi MIMO v2.5100980000000019.8%
Grok 4 Fast9271151170000019.5%
Llama 3.1 70B866433320000018.8%
GPT-4o, May 13th (temp=1)997613000000018.8%
Llama 3.1 Nemotron 70B716833320000017.9%
Ministral 3 3B100680000000016.8%
GPT-5.4 Mini10038141111000016.5%
Xiaomi MIMO v2.5 Pro97670000000016.4%
Claude 3 Haiku484641200000013.8%
Mistral Large 298276200000013.3%
DeepSeek V4 Pro100280000000012.8%
Arcee AI: Trinity Large (Preview)80370000000011.7%
Claude 3.5 Sonnet6520171020000011.5%
GPT-4.1 Mini99140000000011.4%
Gemini 3.5 Flash (Reasoning, Minimal)4947000000009.6%
Z.AI GLM 4.5 Air960000000009.6%
GPT-5.4 Nano7311000000008.5%
DeepSeek V3.14023840000007.5%
Gemini 2.5 Flash Lite649000000007.2%
Ministral 8B3317300000005.3%
Aion 2.0470000000004.7%
Z.AI GLM 4.5244100000002.9%
Qwen 2.5 72B250000000002.5%
Stealth: Hunter Alpha190000000001.9%
Claude Opus 4.7160000000001.6%
Gemini 3 Flash (Preview)70000000000.7%
Writer: Palmyra X561000000000.7%
Claude 3.7 Sonnet22000000000.3%
Gemma 3 12B21000000000.3%
Mistral Small 3.2 24B10000000000.1%
DeepSeek-V2 Chat00000000000.0%
Gemini 3.1 Flash Lite00000000000.0%
Grok 4.300000000000.0%
DeepSeek V4 Flash (Reasoning)00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Gemma 4 26B00000000000.0%
DeepSeek V3.200000000000.0%
Grok 400000000000.0%
Gemini 2.5 Flash00000000000.0%
Claude Opus 400000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Grok 4.20 (Beta)00000000000.0%
DeepSeek V4 Flash00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 4 31B00000000000.0%
GPT-4.100000000000.0%
Grok 4.2000000000000.0%
Qwen3 235B A22B Instruct 250700000000000.0%
GPT-4.1 Nano00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 4B00000000000.0%
Gemini 3.1 Flash Lite (Reasoning)00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude 3.7 Sonnet1001001001001001001001001009899.7%
Gemini 2.5 Flash Lite1001001001001001009999999799.4%
GPT-4o Mini (temp=0)10010010010010010010099999599.2%
DeepSeek-V2 Chat100100999999989898979798.6%
Claude Opus 4.6 (Reasoning)1001001001001001001001001008698.5%
Mistral NeMO1001001001001001009998959298.3%
Grok 4.3 (Reasoning)10010010010099989797969598.2%
Gemini 3.1 Pro (Preview)1001001001001001001001001007997.9%
MoonshotAI: Kimi K2.610010010010010010010099938397.4%
Claude Sonnet 4.6100100100100100999898938697.4%
Claude Opus 4.7 (Reasoning)100100969595959594949295.7%
Claude Haiku 4.5100100100100100999999896995.5%
Claude Sonnet 4.6 (Reasoning)1001001001001001009897926395.0%
Claude Opus 4.610010010010099999894807394.2%
Claude Sonnet 4.510099999998979791877394.2%
Gemma 3 12B10010010010010010010098893191.7%
Gemma 3 4B1001001001001001009996853791.6%
Grok 4.20 (Reasoning)9999989796938584827590.9%
DeepSeek V4 Flash (Reasoning)10010010010099999891823790.6%
GPT-5.5 (Reasoning)1001001009998979696793790.2%
Writer: Palmyra X51001001001001001001009998089.6%
Z.AI GLM 5100100999797949282696789.6%
Qwen 3.5 27B100100100100100100999988088.5%
Gemma 4 26B (Reasoning)1001001009999999581673787.5%
Gemini 2.5 Flash (Reasoning)100100999999989876703787.5%
MoonshotAI: Kimi K2.5100100100100100100948681987.0%
Claude Opus 4.51009999999997979485086.9%
Grok 41001001009997898074684785.4%
Z.AI GLM 4.610010010010096969387373784.6%
Nemotron 3 Super1001001001009999987868084.1%
DeepSeek V3.210010010010099979575373783.9%
Qwen3.7 Max1001001009488886666666683.5%
Gemma 3 27B1001001009898988978373783.4%
Gemini 2.5 Flash Lite (Reasoning)10099999793838080633783.1%
DeepSeek V3.1100100100100100989465373782.9%
GPT-5.4 Mini1001001009693929171532482.0%
Qwen 3.6 35B100100100979695936766081.4%
GPT-5.4 (Reasoning, Low)100100999998966666523781.3%
Qwen 3.5 122B1001001001009590878452080.8%
Stealth: Hunter Alpha1001001009795938776371579.9%
MiniMax M2.710010010099999796920078.3%
Gemini 3.5 Flash (Reasoning)10010099968479797967078.1%
Gemma 4 26B10099999998967437373777.6%
Qwen 3.5 Flash1001001001001009897770077.2%
GPT-5.2999797969693786637075.9%
Qwen3.6 Max Preview100100100100100100665237075.4%
Aion 2.01001001009486837765301775.2%
Claude Opus 4100100989392807437373774.7%
Qwen 3.5 35B10010010099989683660074.2%
Z.AI GLM 5.110097966967666665655174.1%
Qwen 3.5 397B A17B10010010099999666660072.7%
Z.AI GLM 4.5100100100100100973737372072.6%
GPT-5.4 (Reasoning)10099979787785137373772.1%
GPT-OSS 120B98979797948884630072.0%
Gemini 2.5 Pro1001001009890565337373770.6%
Llama 3.1 Nemotron 70B100100100999795423737170.6%
Qwen3 235B A22B Instruct 25071001001001009997373737070.6%
Grok 4.1 Fast9998897978735757412870.0%
Grok 4.20 (Beta, Reasoning)100999895958862610069.8%
Qwen 2.5 72B10010010010010010059380069.6%
Qwen 3.6 Flash100100100937565595237668.7%
Inception Mercury 299999895908884230067.5%
Qwen 3.5 Plus (2026-02-15)100100979695373737373767.3%
GPT-5.110099978078703737373767.1%
Gemma 4 31B (Reasoning)9790888068675453373767.1%
Xiaomi MIMO v2.5 Pro10010098938770373737065.8%
GPT-5.5100100999867373737373764.7%
Z.AI GLM 5 Turbo100100100736758563737062.6%
GPT-5.5 (Reasoning, Low)1001009996955144371062.3%
GPT-4.1 Nano10099999898893700062.0%
Xiaomi MIMO v2.59999969593923700061.1%
Grok 4.31009695956050373729060.0%
ByteDance Seed 1.6100100100999797600059.9%
DeepSeek V4 Flash10097878237373737373758.7%
Mistral Small 3.2 24B9997979587832610058.5%
DeepSeek V3 (2024-12-26)1001001009990504130058.4%
Z.AI GLM 4.7 Flash100100100953737373737057.9%
Arcee AI: Trinity Large (Preview)100100949191683400057.6%
Gemini 2.5 Flash10099975837373737373757.6%
Z.AI GLM 4.71009896953737373737057.3%
Llama 3.1 70B1009893923737373737557.1%
DeepSeek V4 Pro (Reasoning)100100836537373737373756.9%
Claude 3.5 Sonnet9695843737373737373753.2%
GPT-5.4 Nano (Reasoning)100100948661403700051.7%
Inception Mercury979592848363000051.4%
Gemini 3 Flash (Preview, Reasoning)1009797787267000051.1%
Z.AI GLM 4.5 Air10010099918037000050.7%
GPT-5.4 Nano (Reasoning, Low)1001001009654372000050.6%
GPT-5.4 Mini (Reasoning)948669555246373723049.8%
Ministral 8B9999979556371300049.6%
Claude Opus 4.7100100373737373737373749.4%
Gemma 4 31B9271703737373737373749.0%
o4 Mini High9896786353523770048.3%
DeepSeek V3 (2025-03-24)10094736549454200046.7%
Arcee AI: Trinity Mini999378665524231311046.1%
Stealth: Healer Alpha96868558373632256046.0%
GPT-5.4 Mini (Reasoning, Low)9999897637331962045.9%
Ministral 3B10093926043373000045.4%
o4 Mini10010092654628910044.0%
GPT-5.4 Nano100955037373737379043.8%
Mistral Large 21009998734815311043.8%
Claude Sonnet 49637373737373737373742.8%
GPT-4o Mini (temp=1)9237373737373737373742.3%
Gemini 3 Flash (Preview)8837373737373737373741.9%
Mistral Medium 3.11009983613229660041.6%
ByteDance Seed 2.0 Mini100100999800000039.7%
Grok 4.201003737373737373737039.4%
Qwen 3.5 9B1001007966460000039.1%
Qwen 3.6 27B1001008065380000038.3%
GPT-4o, May 13th (temp=0)3737373737373737373736.8%
GPT-5.43737373737373737373736.8%
GPT-4o, May 13th (temp=1)3737373737373737373736.8%
GPT-4.1 Mini3737373737373737373736.8%
DeepSeek V4 Pro3737373737373737373736.8%
GPT-4o, Aug. 6th (temp=1)3737373737373737373736.8%
GPT-4o, Aug. 6th (temp=0)3737373737373737373736.8%
Qwen 3.5 Plus (2026-04-20)9893848200000035.7%
GPT-5100947837371000034.6%
Stealth: Aurora Alpha9688534424171300033.6%
Grok 4.20 (Beta)10099993700000033.4%
GPT-5 Nano9696962600000031.5%
Rocinante 12B999893910000030.0%
Hermes 3 70B10094742910000029.8%
ByteDance Seed 2.0 Lite9793821200000028.5%
LFM2 24B80605434273000025.9%
GPT-4.13737373737373700025.8%
MiniMax M2.5100100371700000025.4%
GPT-5 Mini10010044100000024.5%
Hermes 3 405B9690302020000023.9%
Grok 4 Fast9463361273211021.9%
Qwen 3 32B9474151387000021.1%
WizardLM 2 8x22b1008812911000021.0%
Nemotron 3 Nano74730000000014.7%
Mistral Small 4 (Reasoning)97211000000011.8%
ByteDance Seed 1.6 Flash2828262500000010.8%
Mistral Small Creative9474210000010.6%
Claude 3 Haiku10011000000010.2%
Gemini 3 Pro (Preview)980000000009.8%
Cohere Command R+ (Aug. 2024)6320100000008.4%
Ministral 3 3B4335100000007.8%
Ministral 3 14B3513430000005.5%
Mistral Large 3440000000004.4%
Mistral Small 4373000000004.0%
Llama 3.1 8B311000000003.2%
Mistral Large250000000002.5%
Ministral 3 8B90000000000.9%
Gemini 3.5 Flash (Reasoning, Minimal)00000000000.0%
Gemini 3.1 Flash Lite (Reasoning)00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Gemini 3.1 Flash Lite00000000000.0%

dialogue-500

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen3.7 Max100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)100100100100100100100100999499.3%
Z.AI GLM 5.11001001009999999888876994.0%
Z.AI GLM 5 Turbo100100100100100100100100884393.0%
GPT-5 Mini1001001009994929286857992.6%
GPT-510010010010010010010090771087.7%
o4 Mini10010010010096958772632183.3%
MoonshotAI: Kimi K2.61001001009993838280652883.0%
Claude Sonnet 4.6 (Reasoning)9999969590847372482878.4%
Grok 4.3 (Reasoning)1001001001009993866638078.1%
Claude Opus 4.6 (Reasoning)100100999688767166481776.1%
Nemotron 3 Super1001009896919087870074.9%
o4 Mini High100100100100987979500070.6%
Gemma 4 26B (Reasoning)1009995898375544215065.4%
Gemma 4 31B (Reasoning)98979780766960165059.8%
MiniMax M2.51009999989897420059.8%
Gemini 2.5 Flash (Reasoning)99999994815729130057.3%
Qwen 3.5 27B99989892914824172057.0%
Qwen 3.6 Flash1001009987555038150054.4%
Grok 4.20 (Reasoning)959576666051271311049.5%
Nemotron 3 Nano10099716565632610049.1%
Inception Mercury 29996897244381599047.1%
ByteDance Seed 2.0 Mini1007671635952241311246.9%
Gemini 3 Flash (Preview, Reasoning)100989390810000046.1%
MoonshotAI: Kimi K2.51009477755553100045.6%
GPT-5.59692887048331243044.6%
Claude Sonnet 410099888533191500044.0%
Mistral Large988988737012000043.0%
MiniMax M2.710088836436332300042.6%
Stealth: Aurora Alpha8884605755501610041.0%
Grok 4.20 (Beta, Reasoning)9871625449301110037.6%
Qwen 3.6 35B100938666167300037.2%
Qwen 3.5 Flash9794918400000036.7%
ByteDance Seed 2.0 Lite90826563499000035.8%
DeepSeek V4 Pro (Reasoning)100100894200000033.1%
Z.AI GLM 4.7100865739386000032.6%
GPT-OSS 120B99985136330000031.7%
GPT-5.110095704243000031.5%
Claude Opus 4.610086544492211029.9%
Ministral 3 14B10086812130000029.1%
Claude Opus 4.597924035204000028.7%
GPT-4o, May 13th (temp=0)9045433230201552028.4%
Qwen 3.5 122B9983791560000028.3%
Qwen 3.5 397B A17B9695353050000026.1%
Qwen 3.5 Plus (2026-04-20)9998332230000025.6%
ByteDance Seed 1.69071481995440025.0%
Inception Mercury100722320150000023.1%
GPT-5.5 (Reasoning)98473526124100022.3%
GPT-5.5 (Reasoning, Low)797652410000021.2%
GPT-4o, May 13th (temp=1)9361261180000020.1%
GPT-5.4 (Reasoning)10049351321100020.0%
Gemini 2.5 Pro98933100000019.5%
Claude 3 Haiku985930610000019.3%
Ministral 8B99811000000018.1%
Claude Opus 4.7 (Reasoning)97820000000017.9%
GPT-5.4 (Reasoning, Low)49393130210000017.0%
Mistral NeMO895312310000015.8%
GPT-5.493505311000015.3%
GPT-5.4 Nano (Reasoning, Low)753523855100015.3%
Xiaomi MIMO v2.5 Pro92504200000014.7%
GPT-5 Nano544940310000014.7%
Hermes 3 70B82631000000014.7%
Gemini 3 Pro (Preview)91540000000014.4%
Claude Sonnet 4.68725161060000014.4%
Gemini 2.5 Flash Lite (Reasoning)7834151110000013.9%
Claude Opus 4.798361000000013.6%
Claude Sonnet 4.57323221230000013.3%
Z.AI GLM 581454000000013.1%
GPT-5.288340000000012.2%
Ministral 3 3B93125530000011.8%
Qwen 3.6 27B94190000000011.4%
WizardLM 2 8x22b9843100000010.6%
Z.AI GLM 4.69971000000010.6%
Qwen 3.5 35B886200000009.7%
GPT-5.4 Mini5724930000009.2%
Rocinante 12B860000000008.6%
Arcee AI: Trinity Large (Preview)860000000008.6%
Qwen 3.5 Plus (2026-02-15)5233000000008.6%
Qwen3.6 Max Preview810000000008.1%
Mistral Small 4657000000007.2%
Qwen 3.5 9B4220500000006.6%
Mistral Small 4 (Reasoning)3524600000006.4%
GPT-5.4 Nano507520000006.4%
Hermes 3 405B610000000006.1%
GPT-4o, Aug. 6th (temp=1)570000000005.8%
Mistral Small 3.2 24B523000000005.5%
Ministral 3B435100000005.0%
Mistral Medium 3.1480000000004.8%
GPT-5.4 Mini (Reasoning, Low)351000000003.7%
GPT-4.1 Mini2214000000003.6%
DeepSeek V3 (2025-03-24)240000000002.4%
Stealth: Healer Alpha170000000001.7%
GPT-5.4 Nano (Reasoning)141000000001.5%
Llama 3.1 8B64000000001.0%
Qwen 3 32B41000000000.5%
Llama 3.1 70B31000000000.4%
Claude 3.5 Sonnet21000000000.3%
Z.AI GLM 4.7 Flash20000000000.2%
Claude Haiku 4.520000000000.2%
Qwen 2.5 72B11000000000.2%
Stealth: Hunter Alpha10000000000.1%
ByteDance Seed 1.6 Flash00000000000.0%
Writer: Palmyra X500000000000.0%
GPT-4.100000000000.0%
Claude Opus 400000000000.0%
GPT-5.4 Mini (Reasoning)00000000000.0%
LFM2 24B00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
DeepSeek V3.100000000000.0%
Mistral Large 200000000000.0%
Gemma 4 31B00000000000.0%
Mistral Small Creative00000000000.0%
Qwen3 235B A22B Instruct 250700000000000.0%
Grok 400000000000.0%
GPT-4.1 Nano00000000000.0%
Ministral 3 8B00000000000.0%
Grok 4.300000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
Grok 4 Fast00000000000.0%
Xiaomi MIMO v2.500000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Grok 4.1 Fast00000000000.0%
DeepSeek V3.200000000000.0%
Gemma 3 12B00000000000.0%
Mistral Large 300000000000.0%
Z.AI GLM 4.500000000000.0%
DeepSeek V4 Pro00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek V4 Flash00000000000.0%
Gemma 3 4B00000000000.0%
Claude 3.7 Sonnet00000000000.0%
Aion 2.000000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemini 3.1 Flash Lite (Reasoning)00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Gemini 3.1 Flash Lite00000000000.0%
Gemma 4 26B00000000000.0%
Gemini 3.5 Flash (Reasoning, Minimal)00000000000.0%
Grok 4.2000000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Z.AI GLM 4.5 Air00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Gemini 2.5 Flash00000000000.0%
DeepSeek V4 Flash (Reasoning)00000000000.0%
Grok 4.20 (Beta)00000000000.0%
Gemma 3 27B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)10010010010010010010097957796.9%
MoonshotAI: Kimi K2.61001001009996959388867192.7%
Z.AI GLM 5 Turbo100100100100100100100100852991.3%
GPT-51001001001001001001009997089.5%
Z.AI GLM 5.1100100100999795959158083.5%
Qwen3.7 Max1001001001001001001007127980.7%
GPT-5 Mini1001001009971584300057.1%
MiniMax M2.51001001009988512600056.3%
MoonshotAI: Kimi K2.5999990795049473215056.0%
o4 Mini100100999075423200053.8%
MiniMax M2.71009998977366100053.5%
Ministral 3 14B1009795938158100052.7%
Gemma 4 31B (Reasoning)10096969594161040051.2%
Inception Mercury 210010089858033200048.8%
Nemotron 3 Super1009890776553100048.4%
Claude Opus 4.68885837847432810045.5%
o4 Mini High1009082824423000042.1%
Grok 4.3 (Reasoning)959384484741500041.3%
Stealth: Aurora Alpha84726251454223212040.1%
Claude Opus 4.5989764554133000038.7%
ByteDance Seed 2.0 Mini9889873733231100038.0%
Qwen 3.6 35B1009172513825200037.9%
GPT-5.4 Nano (Reasoning, Low)9882645243131100036.2%
Gemini 2.5 Flash (Reasoning)1001009634208200036.0%
Claude Sonnet 49694894598320034.6%
Ministral 8B96787159392000034.6%
Claude Sonnet 4.6978859302620664333.8%
ByteDance Seed 1.6 Flash8779473632262610033.4%
Grok 4.20 (Reasoning)1001008425182000032.9%
Claude Opus 4.6 (Reasoning)95726156337100032.6%
Nemotron 3 Nano10097913700000032.5%
Qwen 3.5 9B96797842177600032.4%
GPT-5.4 Mini1009540373011900032.2%
GPT-5.110098813030000031.2%
GPT-5.4 Nano (Reasoning)978651442010210031.2%
Claude Sonnet 4.59692902710000030.6%
Qwen 3.6 27B9784665800000030.5%
WizardLM 2 8x22b10083654276100030.3%
GPT-4o, Aug. 6th (temp=0)8876746000000029.9%
GPT-5.4 (Reasoning)1006760201915810029.1%
Ministral 3 3B9687514460000028.5%
Claude Sonnet 4.6 (Reasoning)10095561088211128.1%
Gemma 4 26B (Reasoning)8282713142000027.3%
Qwen 3.6 Flash979039141211600027.0%
GPT-5 Nano10077504200000026.9%
GPT-OSS 120B10099451440000026.1%
GPT-5.5 (Reasoning, Low)97545224222200025.3%
Hermes 3 70B9383442620000025.0%
Hermes 3 405B987662921100024.9%
Qwen3.6 Max Preview1008267000000024.9%
DeepSeek V4 Pro (Reasoning)1009648000000024.4%
GPT-5.410084282810000024.1%
LFM2 24B9986361810000024.0%
GPT-5.4 Nano10073382800000023.9%
GPT-5.5 (Reasoning)9589202093111023.9%
Ministral 3B979632662000023.9%
GPT-4o, May 13th (temp=1)9488371600000023.5%
Ministral 3 8B938059200000023.4%
ByteDance Seed 1.69765441353310023.0%
Rocinante 12B1006547743000022.6%
Mistral Large 2100959970000021.9%
Qwen 3.5 35B969033100000021.9%
Gemini 3 Flash (Preview, Reasoning)978833000000021.7%
Mistral NeMO6462444053000021.7%
Qwen 3.5 Plus (2026-04-20)9272371500000021.6%
Llama 3.1 8B1006050000000021.1%
GPT-5.2987434500000021.0%
ByteDance Seed 2.0 Lite73514431100000021.0%
Arcee AI: Trinity Large (Preview)858430730000021.0%
Gemini 2.5 Flash Lite (Reasoning)1007728000000020.5%
DeepSeek V3 (2024-12-26)97934000000019.5%
Qwen 3.5 397B A17B847824100000018.7%
Qwen 3 32B10049261020000018.6%
Grok 4.20 (Beta, Reasoning)545450855432018.5%
Mistral Small 3.2 24B867510300000017.4%
Mistral Small 495780000000017.3%
Qwen 3.5 27B99721000000017.1%
Z.AI GLM 4.7973230700000016.7%
GPT-5.4 (Reasoning, Low)94231615124000016.5%
Mistral Large5348311762110015.8%
Cohere Command R+ (Aug. 2024)79654100000014.9%
GPT-5.56431272330000014.8%
GPT-4o, Aug. 6th (temp=1)933010200000013.5%
Gemini 2.5 Pro673413300000011.9%
Claude 3.5 Sonnet62398511100011.7%
Mistral Small 4 (Reasoning)100130000000011.3%
Stealth: Hunter Alpha100111000000011.1%
Gemini 3.1 Flash Lite (Reasoning)9910000000010.0%
Stealth: Healer Alpha950000000009.5%
Z.AI GLM 4.5930000000009.3%
GPT-4.1 Mini60151143000009.2%
Mistral Large 35031000000008.2%
DeepSeek V3 (2025-03-24)6413110000008.0%
Inception Mercury723000000007.5%
Llama 3.1 70B678000000007.5%
Qwen 3.5 Flash671000000006.8%
Xiaomi MIMO v2.5 Pro661000000006.7%
Qwen 2.5 72B27251000000006.2%
Qwen 3.5 122B3212700000005.1%
Mistral Medium 3.1481000000004.9%
GPT-5.4 Mini (Reasoning, Low)433200000004.8%
Z.AI GLM 52215810000004.6%
Claude Opus 4320000000003.2%
Llama 3.1 Nemotron 70B219000000003.1%
GPT-5.4 Mini (Reasoning)1712000000002.9%
GPT-4.1 Nano230000000002.4%
Grok 4 Fast182200000002.2%
GPT-4o, May 13th (temp=0)140000000001.4%
Qwen 3.5 Plus (2026-02-15)71000000000.8%
Gemini 3.1 Flash Lite70000000000.7%
Claude Haiku 4.551000000000.6%
Z.AI GLM 4.7 Flash40000000000.4%
Grok 430000000000.3%
Z.AI GLM 4.621000000000.3%
Claude 3 Haiku30000000000.3%
Xiaomi MIMO v2.520000000000.2%
Mistral Small Creative20000000000.2%
Grok 4.1 Fast10000000000.2%
Claude Opus 4.7 (Reasoning)00000000000.0%
Writer: Palmyra X500000000000.0%
Z.AI GLM 4.5 Air00000000000.0%
Gemini 3 Pro (Preview)00000000000.0%
DeepSeek V3.100000000000.0%
DeepSeek V4 Flash00000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemma 3 12B00000000000.0%
Qwen3 235B A22B Instruct 250700000000000.0%
GPT-4.100000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Claude 3.7 Sonnet00000000000.0%
Claude Opus 4.700000000000.0%
Arcee AI: Trinity Mini00000000000.0%
DeepSeek V4 Flash (Reasoning)00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
Aion 2.000000000000.0%
Gemma 4 31B00000000000.0%
DeepSeek V4 Pro00000000000.0%
Gemini 3.5 Flash (Reasoning, Minimal)00000000000.0%
Grok 4.300000000000.0%
Gemma 4 26B00000000000.0%
DeepSeek V3.200000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Grok 4.20 (Beta)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Grok 4.2000000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)1001001001001001001001001009999.9%
Nemotron 3 Super1001001001009996925144078.1%
Qwen3.7 Max100100100969676717060977.8%
Grok 4 Fast999795948885846534874.8%
Gemini 3.5 Flash (Reasoning)100100100967971684110066.4%
Claude 3.5 Sonnet9999959287878265065.3%
Mistral Medium 3.11009591918946423814060.5%
DeepSeek-V2 Chat10087807979683833211059.6%
Grok 4.1 Fast1009682756154503014656.9%
o4 Mini High10010091746657232011054.3%
Z.AI GLM 5 Turbo1009999907038710050.5%
Llama 3.1 70B999594919123520049.9%
GPT-5 Mini10010010099990000049.8%
Mistral Small Creative100998978542823231049.5%
Z.AI GLM 5.1978986858049100048.6%
GPT-5.5 (Reasoning)9694716357444031046.9%
GPT-5.4 Mini9998827746241688045.8%
GPT-4o, Aug. 6th (temp=0)100936740404040351145.6%
DeepSeek V3.1999889794219871044.2%
GPT-4o, Aug. 6th (temp=1)100898885710000043.3%
ByteDance Seed 1.61009786815410000043.0%
Ministral 3 8B988155544242242010142.8%
GPT-5.2929085815310651042.2%
Grok 4935857544743262510841.9%
Claude Sonnet 4.6100100845432221532041.2%
Mistral Small 3.2 24B98927671565300040.0%
Llama 3.1 Nemotron 70B98877674459411039.6%
GPT-510085715838241000038.5%
GPT-OSS 120B1009681482920322038.1%
Claude 3 Haiku978860504427200037.0%
Mistral Small 4976966575223110036.6%
Mistral Large 299988934179850035.9%
Claude Sonnet 4.5938475353433200035.6%
DeepSeek V3 (2024-12-26)96595738323127115035.5%
Mistral Small 4 (Reasoning)979048413732300034.8%
MiniMax M2.5978067541613200032.7%
GPT-5.4 Mini (Reasoning, Low)10080644797530031.6%
Gemma 3 12B777269413710631031.5%
Stealth: Aurora Alpha96876147112000030.4%
GPT-5.199826434107410030.1%
Inception Mercury9696722920000029.7%
DeepSeek V4 Pro (Reasoning)9990623572000029.5%
Cohere Command R+ (Aug. 2024)88655949259000029.5%
Qwen 3 32B9895791461000029.4%
GPT-5.4 (Reasoning, Low)10098454521100029.2%
GPT-5.4 Nano99736532166000029.0%
Qwen3.6 Max Preview1009985000000028.4%
GPT-5.59785653210000028.1%
Claude Opus 4.6 (Reasoning)86837717116000028.0%
MiniMax M2.79687583710000027.8%
Ministral 3B86744836301000027.5%
Claude Sonnet 4.6 (Reasoning)937935341610610027.5%
Qwen 3.6 Flash82655343320000027.5%
ByteDance Seed 1.6 Flash9790721200000027.2%
ByteDance Seed 2.0 Mini1001003720100000026.7%
GPT-5.4 (Reasoning)755642302723810026.2%
Ministral 3 14B93774821130000025.2%
Mistral Large 3535147424010000024.3%
Qwen 2.5 72B93703124231000024.3%
Hermes 3 70B8979461590000023.9%
Grok 4.20 (Beta, Reasoning)98484128220000023.7%
DeepSeek V3 (2025-03-24)98981211105200023.5%
GPT-5.4 Mini (Reasoning)8573501154310023.2%
MoonshotAI: Kimi K2.6894630271410900022.5%
Gemini 2.5 Flash (Reasoning)8870271697520022.5%
Grok 4.20 (Reasoning)7967592000000022.4%
Z.AI GLM 4.681604129101000022.1%
Claude Opus 4.7 (Reasoning)8471362422110022.0%
Rocinante 12B10076301300000021.9%
Llama 3.1 8B7471372085000021.4%
Hermes 3 405B8871312100000021.1%
Qwen 3.5 397B A17B948115710000019.9%
Claude 3.7 Sonnet504534282019210019.8%
Qwen 3.6 35B1005939000000019.7%
Arcee AI: Trinity Mini98758510000018.7%
GPT-5.4 Nano (Reasoning)99692210000017.3%
GPT-5.4894216885221017.2%
o4 Mini8957121210000017.1%
WizardLM 2 8x22b983826600000016.8%
Z.AI GLM 4.5 Air95684100000016.8%
MoonshotAI: Kimi K2.595291615110000016.7%
Gemini 2.5 Flash Lite (Reasoning)88740000000016.3%
Stealth: Healer Alpha7848201041000016.2%
Grok 4.3 (Reasoning)9919171461000015.6%
Gemma 4 26B (Reasoning)92547000000015.3%
GPT-4o Mini (temp=1)99480000000014.7%
LFM2 24B743722421000014.1%
GPT-4.1 Nano4830292521000013.5%
Aion 2.0962210700000013.5%
GPT-4.1 Mini96234321000012.9%
GPT-5.4 Nano (Reasoning, Low)100194000000012.3%
GPT-4o, May 13th (temp=1)82299000000012.1%
Claude Sonnet 499116200000011.8%
Claude Haiku 4.584137321100011.1%
Qwen 3.6 27B9387000000010.8%
Qwen 3.5 27B513416430000010.7%
Qwen 3.5 35B79158400000010.6%
Qwen 3.5 Flash59440000000010.3%
Writer: Palmyra X59730000000010.1%
Mistral NeMO8710000000009.7%
GPT-5.5 (Reasoning, Low)72101040000009.6%
Xiaomi MIMO v2.57910331000009.6%
Gemini 3.1 Flash Lite830000000008.3%
Inception Mercury 232291720000008.0%
Z.AI GLM 4.7 Flash760000000007.6%
Arcee AI: Trinity Large (Preview)750000000007.5%
Grok 4.3636000000007.0%
Gemini 2.5 Pro3427000000006.2%
DeepSeek V3.23212521110005.4%
Xiaomi MIMO v2.5 Pro399400000005.2%
Qwen 3.5 9B491000000005.0%
Gemini 3 Pro (Preview)452000000004.7%
Z.AI GLM 4.5288000000003.6%
Qwen3 235B A22B Instruct 2507296000000003.5%
Claude Opus 4.6321100000003.3%
Claude Opus 41913000000003.1%
Grok 4.20228200000003.1%
Gemini 2.5 Flash260000000002.7%
GPT-4o, May 13th (temp=0)260000000002.7%
Stealth: Hunter Alpha1212000000002.5%
Claude Opus 4.797430000002.3%
Nemotron 3 Nano230000000002.3%
DeepSeek V4 Flash110000000001.1%
Qwen 3.5 Plus (2026-04-20)100000000001.0%
Mistral Large90000000000.9%
Gemini 3 Flash (Preview)71000000000.9%
DeepSeek V4 Flash (Reasoning)43100000000.7%
ByteDance Seed 2.0 Lite70000000000.7%
Gemma 4 26B40000000000.4%
Gemma 3 27B30000000000.3%
Ministral 3 3B20000000000.2%
Gemini 2.5 Flash Lite11000000000.2%
Grok 4.20 (Beta)10000000000.1%
Qwen 3.5 122B10000000000.1%
Claude Opus 4.510000000000.1%
Z.AI GLM 4.700000000000.0%
DeepSeek V4 Pro00000000000.0%
Z.AI GLM 500000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Ministral 8B00000000000.0%
Gemma 4 31B00000000000.0%
GPT-4.100000000000.0%
Gemma 3 4B00000000000.0%
Gemma 4 31B (Reasoning)00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemini 3 Flash (Preview, Reasoning)00000000000.0%
GPT-5 Nano00000000000.0%
Gemini 3.5 Flash (Reasoning, Minimal)00000000000.0%
Gemini 3.1 Flash Lite (Preview)00000000000.0%
Gemini 3.1 Flash Lite (Reasoning)00000000000.0%