Matches word count

Test: Dialogue tags

Avg. Score
39.3%
Scenarios
6

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5 Mini95.7%$0.01158.5s63%
2Claude Sonnet 4.681.1%$0.007912.8s45%
3Claude Opus 4.683.6%$0.01415.5s40%
4GPT-4o Mini (temp=0)75.7%$0.00038.5s30%
5Claude Opus 4.578.8%$0.01414.3s33%
6GPT-4o, Aug. 6th (temp=0)69.0%$0.00536.3s18%
7Gemini 3 Flash (Preview)64.8%$0.00155.0s18%
8GPT-589.2%$0.0531.5m41%
9GPT-4o, Aug. 6th (temp=1)61.6%$0.00546.3s10%
10GPT-4.158.5%$0.00447.8s9%
11Minimax M2.588.4%$0.0173.1m40%
12GPT-4o Mini (temp=1)57.6%$0.00038.4s6%
13o4 Mini High86.3%$0.0501.9m37%
14Grok 459.0%$0.01227.3s14%
15Claude Opus 468.2%$0.04425.5s22%
16Claude 3.7 Sonnet51.5%$0.008511.6s12%
17Gemini 3.1 Pro (Preview)98.7%$0.1572.1m80%
18Grok 4 Fast48.8%$0.00046.8s7%
19Claude Haiku 4.547.7%$0.00266.6s8%
20o4 Mini70.5%$0.0261.1m17%
21Grok 4.1 Fast51.3%$0.000410.3s4%
22Gemini 2.5 Flash Lite42.9%$0.00022.6s5%
23Llama 3.1 70B40.5%$0.00055.2s4%
24Llama 3.1 8B39.4%$0.00012.0s1%
25Claude 3 Haiku37.2%$0.00074.7s1%
26Mistral Medium 3.139.8%$0.001213.1s2%
27GPT-4.1 Mini38.9%$0.00096.9s0%
28Arcee AI: Trinity Large (Preview)39.1%$0.000012.2s1%
29Stealth: Aurora Alpha67.7%9.2s9%
30DeepSeek V3 (2024-12-26)38.6%$0.000619.3s3%
31GPT-4.1 Nano37.3%$0.00026.0s0%
32Hermes 3 405B37.4%$0.000026.5s4%
33DeepSeek V3 (2025-03-24)34.3%$0.000617.8s0%
34Gemini 2.5 Flash30.0%$0.00154.0s0%
35GPT-5 Nano59.6%$0.00401.6m3%
36DeepSeek-V2 Chat37.3%$0.000229.9s0%
37Mistral Large 330.6%$0.000912.1s0%
38DeepSeek V3.135.4%$0.000628.3s0%
39GPT-5.253.0%$0.02738.3s1%
40Gemma 3 12B27.5%$0.000111.9s0%
41GPT-5.156.1%$0.03052.9s3%
42GPT-4o, May 13th (temp=1)33.2%$0.008815.6s0%
43Qwen 2.5 72B27.3%$0.000315.6s0%
44GPT-4o, May 13th (temp=0)31.8%$0.009014.3s0%
45Mistral Small 3.2 24B23.8%$0.00029.0s0%
46Mistral Large33.7%$0.01513.3s0%
47Claude 3.5 Haiku22.3%$0.00207.4s0%
48Claude Sonnet 4.526.4%$0.008011.8s1%
49Hermes 3 70B25.5%$0.000220.8s0%
50Ministral 8B18.8%$0.00004.0s0%
51Ministral 3 3B17.4%$0.00012.0s0%
52Gemini 3 Pro (Preview)42.6%$0.03123.6s1%
53Arcee AI: Trinity Mini17.5%$0.00015.6s0%
54Claude Sonnet 422.6%$0.007711.4s0%
55Claude 3.5 Sonnet28.3%$0.009232.1s1%
56DeepSeek V3.229.2%$0.000547.8s0%
57Gemini 2.5 Pro37.7%$0.02925.0s1%
58Z.AI GLM 4.515.4%$0.00098.9s0%
59Ministral 3 8B12.9%$0.00013.3s0%
60Ministral 3B12.4%$0.00002.5s0%
61Ministral 3 14B13.6%$0.00026.4s0%
62Qwen 3.5 Plus (2026-02-15)18.9%$0.001322.1s0%
63ByteDance Seed 1.6 Flash11.4%$0.000612.2s0%
64Llama 3.1 Nemotron 70B11.8%$0.000215.7s0%
65Cohere Command R+ (Aug. 2024)13.4%$0.005412.4s0%
66Gemma 3 4B6.8%$0.00005.5s0%
67Mistral NeMO7.0%$0.00016.6s0%
68Writer: Palmyra X510.9%$0.003812.6s0%
69Gemma 3 27B7.3%$0.000115.7s0%
70Rocinante 12B10.5%$0.000325.1s0%
71Mistral Small Creative2.7%$0.00023.8s0%
72Mistral Large 23.9%$0.003113.0s0%
73Z.AI GLM 4.7 Flash20.9%$0.00141.1m0%
74Z.AI GLM 4.743.6%$0.00692.4m4%
75ByteDance Seed 1.625.6%$0.00731.4m0%
76WizardLM 2 8x22b0.0%$0.000722.8s0%
77Z.AI GLM 4.615.7%$0.00441.3m0%
78Z.AI GLM 524.2%$0.0101.8m0%
79MoonshotAI: Kimi K2.545.0%$0.0243.2m5%
80Qwen 3.5 397B A17B63.3%$0.0394.9m7%
39.30%

Individual Scenarios

dialogue-200

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
GPT-5.210010010010010010010010010010099.9%
GPT-5 Mini1001001001001001001001001009999.9%
GPT-5.1100100100100100100100100999999.8%
Claude Opus 4.5100100100100100100100100999999.8%
o4 Mini1001001001001001001001001009899.7%
o4 Mini High1001001001001001001001001009699.6%
Claude Sonnet 4.610010010010010010010099999699.4%
GPT-4o Mini (temp=0)100100100100100100100100999098.9%
Minimax M2.51001001001001001001001001008698.6%
GPT-4o Mini (temp=1)1001001001001001009996908697.1%
Grok 4.1 Fast10010010010099989490908695.7%
GPT-4.1100100100100100999999866895.2%
Claude Opus 4.610010010010099989890818194.7%
GPT-4o, Aug. 6th (temp=1)10099999999969486868694.5%
GPT-4o, Aug. 6th (temp=0)100100100100100999998965294.4%
MoonshotAI: Kimi K2.51001001001001001009894756893.4%
Gemini 3 Flash (Preview)1001001009996969494946093.2%
Claude Opus 410010010010010010010099815293.2%
Claude Sonnet 4.5100100100100100969690756892.5%
Claude Haiku 4.51001001009999989898752789.3%
Grok 4100100100100100100999886088.2%
Claude Sonnet 4100100100100100969075682084.9%
GPT-4.1 Nano100100999996908668524383.3%
Llama 3.1 70B100100999894868175682782.8%
Grok 4 Fast10098989494818175433579.9%
Claude 3.5 Haiku9990868175757575685277.8%
Claude 3 Haiku999998989894905214074.1%
Gemini 3 Pro (Preview)1001001001009490904314073.3%
GPT-4.1 Mini100100100949490862727071.8%
Gemini 2.5 Pro10010099989686604314069.7%
Z.AI GLM 4.710010010099989668270068.8%
Arcee AI: Trinity Large (Preview)999999989686682714068.7%
DeepSeek V3 (2025-03-24)1001009999868175430068.4%
DeepSeek V3.11009994868175686014668.4%
Gemma 3 12B10098969668605220202063.0%
Gemini 2.5 Flash1001009990686060520063.0%
Llama 3.1 8B100100999998813520061.4%
Qwen 3.5 Plus (2026-02-15)1001008681686852141057.1%
GPT-4o, May 13th (temp=1)10010099968181600056.4%
Z.AI GLM 4.59996969675603500055.7%
Hermes 3 405B1001009081816820100055.1%
Claude 3.5 Sonnet1009990755243431010052.2%
DeepSeek V3.210010096967543000051.1%
Ministral 3 3B10010099989610000050.2%
Gemini 2.5 Flash Lite1001009999960000049.4%
DeepSeek V3 (2024-12-26)1009999945210400045.7%
Mistral Large10010098814327000044.9%
DeepSeek-V2 Chat9998945252351000043.9%
GPT-4o, May 13th (temp=0)999898814314000043.4%
Qwen 2.5 72B1001009086430000042.1%
Hermes 3 70B100989875142000038.7%
Mistral Small 3.2 24B998681523514200037.0%
Mistral Large 3100100907500000036.5%
Claude 3.7 Sonnet757568431414141010032.4%
Llama 3.1 Nemotron 70B100946835100000030.6%
Z.AI GLM 5100756852100000030.5%
Gemma 3 4B10099602064100029.0%
Mistral Medium 3.19675683542000028.0%
ByteDance Seed 1.6999868210000026.8%
Z.AI GLM 4.69890522000000026.1%
Gemma 3 27B9490431420000024.4%
Ministral 3 8B1009635000000023.1%
Z.AI GLM 4.7 Flash1009435000000022.9%
Ministral 3B9035353560000020.2%
Mistral NeMO96680000000016.5%
ByteDance Seed 1.6 Flash81680000000014.9%
Rocinante 12B98430000000014.1%
Ministral 8B99350000000013.4%
Mistral Large 243351010000008.9%
Ministral 3 14B524000000005.6%
Writer: Palmyra X5522000000005.4%
Cohere Command R+ (Aug. 2024)200000000002.1%
Mistral Small Creative40000000000.4%
WizardLM 2 8x22b00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Minimax M2.5100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001009999.9%
Claude Opus 4.61001001001001001001001001009999.8%
GPT-5100100100100100100100100999999.8%
o4 Mini High1001001001001001001001001009699.6%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100999699.4%
GPT-4o Mini (temp=1)10010010010010010010099999699.4%
o4 Mini100100100100100100100100988197.9%
Qwen 3.5 397B A17B10010010010010010010099997597.2%
Claude Opus 4.5100100999999999896909097.1%
Gemini 3 Pro (Preview)10010010010096969696907595.0%
GPT-4o, Aug. 6th (temp=0)100100100100100999898904392.8%
GPT-4.110010010010010010010099992792.5%
Grok 4.1 Fast10010010010010010010099981491.1%
Claude Sonnet 4.69999999696949086866090.6%
GPT-4o Mini (temp=0)1001001009999968686686089.5%
DeepSeek-V2 Chat100100100100100100969686087.8%
Gemini 3 Flash (Preview)1001001009999969468604385.9%
Arcee AI: Trinity Large (Preview)1001001009894868686752084.5%
GPT-4.1 Nano100100100999999949060184.2%
Claude 3 Haiku100100999981817575432778.1%
GPT-4.1 Mini1009998949086816835075.1%
Mistral Medium 3.11001001001001009996141171.1%
Z.AI GLM 4.7100100100989894902010070.9%
MoonshotAI: Kimi K2.51009986868675685243069.7%
DeepSeek V3 (2024-12-26)10010099969481604320069.4%
Claude Opus 4100999996949068356268.9%
Llama 3.1 70B1009999989086433527668.3%
Mistral Large100100100100100997520067.7%
Gemini 2.5 Pro96969490908675352066.5%
GPT-4o, May 13th (temp=1)1001001009998907500066.3%
Llama 3.1 8B100100999896866860065.3%
Grok 4100100100100996052356065.2%
Hermes 3 405B948181817568525252464.0%
DeepSeek V3 (2025-03-24)10010099986852434327063.0%
GPT-4o, May 13th (temp=0)100100999481686800061.1%
Claude Haiku 4.5989690818160523514060.9%
ByteDance Seed 1.6100100999999981000060.4%
DeepSeek V3.1100100999481684320058.7%
Z.AI GLM 4.6100100999996602020057.7%
Mistral Large 3100999686756035100056.2%
DeepSeek V3.29686818152433527201053.2%
Mistral Small 3.2 24B100100100905252000049.4%
Qwen 2.5 72B100999998522014102049.4%
Claude 3.5 Haiku100968675751414146048.2%
Gemini 2.5 Flash Lite100989486810000045.9%
Z.AI GLM 4.7 Flash10010010094270000042.1%
Claude 3.7 Sonnet9081683535271064035.7%
Grok 4 Fast9696603527141000033.8%
Hermes 3 70B10010043432720200033.6%
Claude 3.5 Sonnet8675684335141040033.6%
Writer: Palmyra X510094752710000029.7%
Gemini 2.5 Flash100100433522000028.2%
Qwen 3.5 Plus (2026-02-15)100754327204000027.0%
Ministral 3B9681522700000025.6%
Claude Sonnet 4.5904343352010644025.6%
ByteDance Seed 1.6 Flash10010043400000024.7%
Z.AI GLM 57568431460000020.7%
Ministral 8B98964400000020.1%
Claude Sonnet 4683527272014100019.4%
Ministral 3 14B94860000000018.0%
Z.AI GLM 4.56835201460000014.4%
Llama 3.1 Nemotron 70B992710000000013.6%
Gemma 3 12B98146111000012.1%
Arcee AI: Trinity Mini10000000000010.0%
Rocinante 12B940000000009.4%
Cohere Command R+ (Aug. 2024)6810210000008.1%
Mistral NeMO750000000007.5%
Mistral Small Creative606400000007.0%
Ministral 3 3B201414102000006.1%
Gemma 3 27B272000000002.9%
Ministral 3 8B10000000000.1%
Mistral Large 200000000000.0%
Gemma 3 4B00000000000.0%
WizardLM 2 8x22b00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100999899.7%
o4 Mini High1001001001001001009999999899.5%
GPT-5.21001001001001001001001001009499.3%
GPT-4o, Aug. 6th (temp=0)1001001001001001009999999699.1%
Minimax M2.5100100100100100100100100999099.0%
Claude Opus 4.610010010010099999999988197.4%
GPT-4o Mini (temp=0)100100100100100999998968197.3%
Qwen 3.5 397B A17B100100100100100100100100908197.2%
GPT-4o Mini (temp=1)1001001001001001009999965294.6%
GPT-4.1100100999999969696686892.1%
GPT-4o, Aug. 6th (temp=1)100100999996969486756090.5%
GPT-5 Nano100100100100100100100100100090.0%
GPT-5.110010010010010010010094901489.8%
GPT-510010010010010099999690488.8%
GPT-4o, May 13th (temp=0)100100100100100100999968086.6%
Claude Opus 4999999999898908668684.2%
Claude Opus 4.59998989896969075351079.4%
Stealth: Aurora Alpha100100100100100100100940079.4%
Gemini 3 Flash (Preview)100100999490868168272777.4%
Grok 4.1 Fast100100100999998983527676.1%
Z.AI GLM 4.71001009994908681350068.6%
o4 Mini1001001001009668524310066.9%
Grok 4100100100100989435352066.3%
Llama 3.1 70B1001009090868152526065.8%
MoonshotAI: Kimi K2.51001009494907568206064.7%
Claude Sonnet 4.69986756868605252353563.1%
DeepSeek V3.210099999896864300062.2%
Arcee AI: Trinity Large (Preview)1009890868668522020062.1%
Mistral Medium 3.11009996817575602010061.6%
Ministral 8B1001001009690753520059.9%
Llama 3.1 8B100969490815243350059.2%
DeepSeek V3.1100100998175684360057.3%
Claude 3.5 Sonnet100948675755243354056.4%
GPT-4.1 Nano100999486814327274056.2%
GPT-4o, May 13th (temp=1)10010096686852521010055.5%
Hermes 3 405B1009086816852272014054.0%
Mistral Large10010096818175100053.4%
Hermes 3 70B1001009690434327274053.1%
Gemini 2.5 Flash Lite90757575685235350050.6%
GPT-4.1 Mini100100999852431000050.1%
Z.AI GLM 51001001008160431041049.9%
Claude Haiku 4.5100968675753520101049.9%
ByteDance Seed 1.69996948175271400048.7%
DeepSeek V3 (2025-03-24)10010098986010000046.6%
Claude 3 Haiku10099947543351400046.0%
DeepSeek-V2 Chat1008686756843000045.9%
Qwen 2.5 72B1001009999436100044.9%
DeepSeek V3 (2024-12-26)100100966860101040044.7%
Gemini 2.5 Flash100999675520000042.2%
Gemini 3 Pro (Preview)1009894861010620040.5%
Grok 4 Fast1009686682014420039.1%
Ministral 3 14B1009086522710000036.7%
Mistral Small 3.2 24B99988160100000034.8%
Claude 3.7 Sonnet9068353527201044029.3%
Z.AI GLM 4.7 Flash10090861000000028.7%
Qwen 3.5 Plus (2026-02-15)686060522714210028.5%
Ministral 3 3B9694351420000024.1%
Gemini 2.5 Pro8681272700000022.2%
Mistral Large 39852272710000020.5%
Llama 3.1 Nemotron 70B996810622200019.0%
ByteDance Seed 1.6 Flash100816000000018.7%
Writer: Palmyra X586866110000018.1%
Z.AI GLM 4.5865220000000015.9%
Ministral 3B75600000000013.5%
Claude Sonnet 4.5753514620000013.3%
Gemma 3 12B86274400000012.1%
Claude Sonnet 452436644210011.8%
Ministral 3 8B10020000000010.2%
Mistral Small Creative4335600000008.4%
Claude 3.5 Haiku35271042110008.0%
Z.AI GLM 4.6604220000006.8%
Rocinante 12B4320000000006.4%
Cohere Command R+ (Aug. 2024)200000000002.0%
Mistral Large 260000000000.6%
Arcee AI: Trinity Mini40000000000.4%
Gemma 3 27B20000000000.2%
Mistral NeMO00000000000.0%
Gemma 3 4B00000000000.0%
WizardLM 2 8x22b00000000000.0%

dialogue-500

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Sonnet 4.61001001001001001009999969098.3%
Gemini 3.1 Pro (Preview)1001001001001001001001001002092.0%
Claude Opus 4.510010010010099999694902089.9%
o4 Mini High100100100100100100999081087.1%
Minimax M2.510010010010010096949090087.1%
GPT-5 Mini10010010010010099999860085.6%
GPT-5100100100100999998960079.2%
Claude 3.7 Sonnet1001009986868175350066.3%
Stealth: Aurora Alpha100100100100100902000061.1%
o4 Mini100100100100985235146060.5%
GPT-4o, Aug. 6th (temp=0)99868686868168100060.4%
Claude Opus 41001009998686843100058.6%
Grok 496818175686060434056.9%
Gemini 3 Flash (Preview)1001001009868601000053.6%
Gemma 3 12B10010098986827100049.1%
GPT-4o Mini (temp=0)10010094434335272720049.0%
Claude Haiku 4.5100998175686000042.9%
Hermes 3 405B1009990752720000041.3%
Grok 4 Fast1009996524320000041.0%
Qwen 3.5 397B A17B100999896140000040.7%
Claude Opus 4.6999994352727000038.1%
DeepSeek V3 (2024-12-26)10094908100000036.5%
Arcee AI: Trinity Mini10096906800000035.4%
Mistral Medium 3.1100100816840000035.3%
Gemini 2.5 Flash Lite10094906000000034.4%
GPT-5 Nano10099991000000030.8%
Mistral Large 398815252200000030.3%
Gemini 2.5 Pro9994812700000030.2%
Z.AI GLM 4.7 Flash10010090000000029.0%
Grok 4.1 Fast1008668000000025.4%
Z.AI GLM 59990201000000022.0%
Z.AI GLM 4.7968627000000021.1%
MoonshotAI: Kimi K2.5948135000000021.0%
Ministral 3 8B10010010000000021.0%
GPT-5.1100940000000019.4%
Gemini 3 Pro (Preview)100860000000018.6%
Claude Sonnet 4.590521414100000018.1%
Gemma 3 27B100601000000016.1%
DeepSeek-V2 Chat99524200000015.6%
Cohere Command R+ (Aug. 2024)816010000000015.1%
Claude 3 Haiku100432000000014.6%
Gemma 3 4B90270000000011.8%
GPT-4o, Aug. 6th (temp=1)86274000000011.8%
Ministral 3 3B100140000000011.4%
Claude Sonnet 4861010100000010.7%
Gemini 2.5 Flash10010000000010.1%
Llama 3.1 8B10000000000010.0%
GPT-4o Mini (temp=1)8114400000009.9%
DeepSeek V3 (2025-03-24)980000000009.8%
DeepSeek V3.1980000000009.8%
Qwen 2.5 72B944000000009.7%
Ministral 8B960000000009.6%
GPT-5.2900000000009.0%
GPT-4.1 Mini900000000009.0%
Ministral 3B900000000009.0%
DeepSeek V3.2860000000008.6%
GPT-4.1811000000008.2%
Llama 3.1 Nemotron 70B750000000007.5%
Mistral Small 3.2 24B680000000006.9%
ByteDance Seed 1.6680000000006.8%
Mistral Large4310000000005.3%
Hermes 3 70B520000000005.2%
Mistral Large 2520000000005.2%
Llama 3.1 70B3514200000005.1%
Z.AI GLM 4.6350000000003.5%
Claude 3.5 Sonnet2010000000003.0%
Z.AI GLM 4.5270000000002.7%
Ministral 3 14B200000000002.0%
ByteDance Seed 1.6 Flash100000000001.0%
Mistral Small Creative60000000000.6%
Rocinante 12B20000000000.2%
Mistral NeMO00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Claude 3.5 Haiku00000000000.0%
Writer: Palmyra X500000000000.0%
WizardLM 2 8x22b00000000000.0%
Arcee AI: Trinity Large (Preview)00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
GPT-4.1 Nano00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini1001001001001001001001001009899.8%
GPT-5100100100100100100100100999499.3%
Claude Opus 4.61001001001001001009999996896.5%
Claude Sonnet 4.61009890909090868175080.3%
o4 Mini10010010099999886810076.4%
o4 Mini High10010010010010010099140071.4%
Minimax M2.510010010010010010010060070.6%
Claude 3.7 Sonnet10099989681605242059.2%
GPT-4o Mini (temp=0)10094949481681440054.8%
Claude Opus 4100999494863520141054.3%
Grok 4 Fast1001001001009920400052.3%
Claude Opus 4.51001009981433535204051.7%
Grok 41009486604343200042.9%
Arcee AI: Trinity Mini999996605210000041.6%
Gemini 2.5 Flash Lite999986812014000040.0%
Stealth: Aurora Alpha1001001009420000039.6%
GPT-4.1100966852436000036.6%
Qwen 3.5 397B A17B10094866800000034.8%
Claude Haiku 4.599949052100000034.5%
Gemini 2.5 Flash9990862700000030.3%
Cohere Command R+ (Aug. 2024)969490100000028.1%
GPT-5 Nano10010068000000026.8%
GPT-5.1999468000000026.0%
GPT-4o Mini (temp=1)998168100000025.0%
GPT-4o, Aug. 6th (temp=1)10081521410000024.9%
Gemini 3 Flash (Preview)968643000000022.6%
Llama 3.1 8B6860434300000021.5%
Mistral Medium 3.110010014000000021.4%
Mistral Large100994000000020.3%
Gemini 2.5 Pro100960000000019.6%
Gemini 3 Pro (Preview)968110000000018.7%
Hermes 3 70B756835000000017.8%
Z.AI GLM 4.7904335000000016.9%
Llama 3.1 70B100680000000016.8%
DeepSeek V3 (2024-12-26)8635201400000015.6%
Rocinante 12B100270000000012.7%
Mistral Large 398206000000012.4%
MoonshotAI: Kimi K2.5100200000000012.1%
Grok 4.1 Fast10061000000010.7%
Gemma 3 12B60430000000010.4%
DeepSeek V3 (2025-03-24)60430000000010.4%
GPT-5.210000000000010.0%
Ministral 3 3B990000000009.9%
DeepSeek-V2 Chat906000000009.7%
Hermes 3 405B960000000009.6%
DeepSeek V3.16035000000009.5%
ByteDance Seed 1.6 Flash940000000009.4%
Arcee AI: Trinity Large (Preview)900000000009.0%
Ministral 3 8B861000000008.7%
Claude Sonnet 4860000000008.6%
Claude Sonnet 4.5756400000008.5%
GPT-4.1 Mini756000000008.1%
Claude 3 Haiku752000000007.7%
Z.AI GLM 5680000000006.8%
Writer: Palmyra X5600000000006.0%
Mistral Small 3.2 24B600000000006.0%
Z.AI GLM 4.5350000000003.5%
Z.AI GLM 4.7 Flash270000000002.7%
ByteDance Seed 1.6200000000002.0%
Mistral Large 2200000000002.0%
DeepSeek V3.210000000000.1%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Claude 3.5 Sonnet00000000000.0%
Mistral NeMO00000000000.0%
Ministral 3B00000000000.0%
Qwen 2.5 72B00000000000.0%
Ministral 8B00000000000.0%
Ministral 3 14B00000000000.0%
Gemma 3 4B00000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
WizardLM 2 8x22b00000000000.0%
GPT-4.1 Nano00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Gemma 3 27B00000000000.0%
Claude 3.5 Haiku00000000000.0%
GPT-4o, May 13th (temp=1)00000000000.0%
Mistral Small Creative00000000000.0%
Z.AI GLM 4.600000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini10010010010010099999998089.4%
Claude 3.7 Sonnet1001001009998989081524386.1%
Claude Opus 4.61001001009990816052353575.3%
Minimax M2.51001001001001009999521075.0%
GPT-5100100999898969000068.0%
GPT-4o, Aug. 6th (temp=0)10010098966852525252066.9%
GPT-4o Mini (temp=0)10010010098817568270064.9%
o4 Mini High1001001001001006827140060.9%
Gemini 3 Flash (Preview)1001009894753535270056.4%
Claude Opus 4.510010096908127202010655.1%
Claude Sonnet 4.61008175686060433520655.0%
Claude Opus 496868175753535141049.9%
GPT-4o, Aug. 6th (temp=1)1001009898864200048.7%
Grok 4 Fast100100989052141020046.6%
Gemini 2.5 Flash Lite949086433520000036.9%
Grok 49086868100000034.5%
Mistral Large 310099433500000027.7%
GPT-4.11008675200000026.3%
Stealth: Aurora Alpha1008181000000026.3%
Cohere Command R+ (Aug. 2024)100962020140000025.1%
Claude 3.5 Sonnet989652000000024.6%
o4 Mini1009820000000021.8%
Mistral Medium 3.1757560400000021.4%
GPT-4o, May 13th (temp=1)996052000000021.1%
DeepSeek-V2 Chat10010010000000021.0%
Rocinante 12B1009014000000020.5%
DeepSeek V3 (2024-12-26)7575272000000019.8%
GPT-4o Mini (temp=1)99962000000019.7%
Ministral 3 14B100960000000019.6%
Llama 3.1 8B908614000000019.1%
GPT-4.1 Mini96940000000019.0%
Mistral NeMO100810000000018.1%
Gemma 3 12B10043271000000018.0%
Qwen 2.5 72B994335200000018.0%
Arcee AI: Trinity Mini98810000000017.9%
Gemini 2.5 Pro906027000000017.8%
Z.AI GLM 581682200000015.3%
Z.AI GLM 4.7813535000000015.1%
Ministral 3 8B755214210000014.5%
Mistral Large68350000000010.3%
Arcee AI: Trinity Large (Preview)10000000000010.0%
GPT-5 Nano9900000000010.0%
Qwen 3.5 397B A17B980000000009.8%
Ministral 8B980000000009.8%
Gemini 3 Pro (Preview)960000000009.6%
MoonshotAI: Kimi K2.5900000000009.0%
Claude Haiku 4.543351000000008.8%
Grok 4.1 Fast860000000008.7%
ByteDance Seed 1.6860000000008.6%
DeepSeek V3.1860000000008.6%
Mistral Small 3.2 24B860000000008.6%
DeepSeek V3 (2025-03-24)751000000007.6%
Mistral Large 2680000000006.8%
Ministral 3B600000000006.0%
Gemini 2.5 Flash600000000006.0%
Writer: Palmyra X5600000000006.0%
Hermes 3 70B434100000004.8%
Llama 3.1 70B430000000004.3%
Claude 3 Haiku270000000002.7%
Ministral 3 3B270000000002.7%
GPT-5.1140000000001.4%
Qwen 3.5 Plus (2026-02-15)60000000000.6%
Hermes 3 405B20000000000.2%
Claude Sonnet 4.510000000000.1%
GPT-4o, May 13th (temp=0)00000000000.0%
GPT-4.1 Nano00000000000.0%
Z.AI GLM 4.500000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Claude Sonnet 400000000000.0%
DeepSeek V3.200000000000.0%
Gemma 3 27B00000000000.0%
Z.AI GLM 4.7 Flash00000000000.0%
Z.AI GLM 4.600000000000.0%
ByteDance Seed 1.6 Flash00000000000.0%
Mistral Small Creative00000000000.0%
WizardLM 2 8x22b00000000000.0%
Gemma 3 4B00000000000.0%
GPT-5.200000000000.0%
Claude 3.5 Haiku00000000000.0%