Dialogue to Total Word Ratio

Test: Dialogue tags

Avg. Score
30.3%
Scenarios
6

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude Sonnet 4.655.4%$0.007912.8s11%
2Gemini 3.1 Pro (Preview)99.6%$0.1572.1m95%
3GPT-5 Mini67.3%$0.01158.5s12%
4Claude Sonnet 444.5%$0.007711.4s7%
5Claude Sonnet 4.546.4%$0.008011.8s5%
6Claude Opus 4.649.3%$0.01415.5s6%
7Claude Opus 4.547.8%$0.01414.3s5%
8o4 Mini61.7%$0.0261.1m13%
9GPT-4o, Aug. 6th (temp=0)30.6%$0.00536.3s12%
10GPT-5.255.8%$0.02738.3s10%
11Claude 3.5 Haiku34.5%$0.00207.4s3%
12Claude Haiku 4.536.9%$0.00266.6s0%
13GPT-5.158.2%$0.03052.9s11%
14GPT-573.4%$0.0531.5m20%
15Ministral 3 14B32.5%$0.00026.4s0%
16Mistral NeMO32.5%$0.00016.6s0%
17Gemma 3 12B31.1%$0.000111.9s0%
18Gemini 2.5 Pro46.6%$0.02925.0s4%
19Grok 4.1 Fast28.6%$0.000410.3s0%
20ByteDance Seed 1.649.1%$0.00731.4m6%
21Ministral 3B23.1%$0.00002.5s0%
22Llama 3.1 70B22.3%$0.00055.2s0%
23Hermes 3 70B26.1%$0.000220.8s0%
24ByteDance Seed 1.6 Flash23.2%$0.000612.2s0%
25Mistral Medium 3.123.8%$0.001213.1s0%
26DeepSeek V3 (2024-12-26)25.2%$0.000619.3s0%
27Z.AI GLM 4.643.6%$0.00441.3m3%
28Grok 4 Fast19.7%$0.00046.8s0%
29Mistral Small 3.2 24B20.3%$0.00029.0s0%
30Qwen 3.5 Plus (2026-02-15)25.2%$0.001322.1s0%
31Ministral 3 3B17.8%$0.00012.0s0%
32Writer: Palmyra X523.6%$0.003812.6s0%
33Mistral Small Creative18.2%$0.00023.8s0%
34Gemini 2.5 Flash Lite17.8%$0.00022.6s0%
35Ministral 8B17.9%$0.00004.0s0%
36DeepSeek-V2 Chat26.4%$0.000229.9s0%
37GPT-4o, Aug. 6th (temp=1)21.7%$0.00546.3s0%
38Arcee AI: Trinity Large (Preview)19.9%$0.000012.2s0%
39Ministral 3 8B17.0%$0.00013.3s0%
40Z.AI GLM 4.519.3%$0.00098.9s0%
41Llama 3.1 Nemotron 70B20.6%$0.000215.7s0%
42Claude 3.5 Sonnet29.0%$0.009232.1s2%
43GPT-4o, May 13th (temp=0)23.7%$0.009014.3s1%
44DeepSeek V3 (2025-03-24)19.7%$0.000617.8s0%
45Llama 3.1 8B14.3%$0.00012.0s0%
46GPT-4o Mini (temp=0)16.5%$0.00038.5s0%
47Gemma 3 4B15.3%$0.00005.5s0%
48Rocinante 12B21.6%$0.000325.1s0%
49Mistral Large 219.2%$0.003113.0s0%
50Gemini 3 Flash (Preview)15.4%$0.00155.0s0%
51DeepSeek V3.122.4%$0.000628.3s0%
52Claude 3.7 Sonnet21.3%$0.008511.6s0%
53Qwen 2.5 72B17.1%$0.000315.6s0%
54Claude 3 Haiku13.5%$0.00074.7s0%
55Cohere Command R+ (Aug. 2024)18.7%$0.005412.4s0%
56o4 Mini High64.0%$0.0501.9m16%
57GPT-4.1 Nano13.0%$0.00026.0s0%
58Hermes 3 405B19.5%$0.000026.5s0%
59Mistral Large23.6%$0.01513.3s0%
60Stealth: Aurora Alpha48.9%9.2s10%
61GPT-4.1 Mini12.8%$0.00096.9s0%
62WizardLM 2 8x22b17.2%$0.000722.8s0%
63GPT-4o, May 13th (temp=1)19.6%$0.008815.6s0%
64Z.AI GLM 547.3%$0.0101.8m3%
65Arcee AI: Trinity Mini10.8%$0.00015.6s0%
66Mistral Large 313.2%$0.000912.1s0%
67Gemma 3 27B13.9%$0.000115.7s0%
68Gemini 2.5 Flash10.0%$0.00154.0s0%
69GPT-4o Mini (temp=1)9.5%$0.00038.4s0%
70MoonshotAI: Kimi K2.566.1%$0.0243.2m18%
71DeepSeek V3.221.3%$0.000547.8s0%
72Grok 421.3%$0.01227.3s0%
73Z.AI GLM 4.7 Flash26.9%$0.00141.1m0%
74Z.AI GLM 4.747.5%$0.00692.4m4%
75GPT-4.14.3%$0.00447.8s0%
76Gemini 3 Pro (Preview)25.4%$0.03123.6s0%
77GPT-5 Nano32.6%$0.00401.6m0%
78Claude Opus 425.1%$0.04425.5s0%
79Minimax M2.553.9%$0.0173.1m6%
80Qwen 3.5 397B A17B56.2%$0.0394.9m6%
30.33%

Individual Scenarios

dialogue-200

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
GPT-5 Mini1001001001001001001001001009999.9%
GPT-5.11001001001001001001001001009999.9%
Z.AI GLM 4.71001001001001001001001001009999.9%
Qwen 3.5 397B A17B1001001001001001001001001009799.7%
GPT-5.2100100100100100100100100999699.4%
o4 Mini High1001001001001001001001001009399.3%
Z.AI GLM 510010010010010010010095948297.1%
Claude Sonnet 41001001009998989492929196.5%
MoonshotAI: Kimi K2.51001001001001001001001001005595.5%
GPT-5 Nano1001001001001001001001001003793.7%
o4 Mini1001001001001001001001001003293.2%
Claude Opus 4.51001001001001001009594796593.2%
ByteDance Seed 1.61001001001001001009898573788.9%
Claude Sonnet 4.510099979695939259524582.9%
Gemini 2.5 Pro10010010010010099999928082.3%
Claude 3.5 Haiku10010010010094937971401379.0%
Stealth: Aurora Alpha1001001001001001001005237178.9%
Claude Sonnet 4.69898968885737360525277.5%
Z.AI GLM 4.7 Flash10010010097959084610072.7%
Claude Opus 410010094929078704034069.7%
Claude Haiku 4.510010010097968955490068.5%
Gemini 3 Pro (Preview)100100100100100373737373768.4%
Claude Opus 4.610010010099999835178065.7%
Z.AI GLM 4.61001001001001009921210064.0%
Gemma 3 12B100100989088827040063.3%
Minimax M2.510010010099523737133054.0%
Gemini 3 Flash (Preview)9996373737373737373749.0%
Qwen 3.5 Plus (2026-02-15)5858574437373737373743.8%
Ministral 3 3B10096846941191000041.9%
Writer: Palmyra X5100989472451000040.9%
DeepSeek V3.2100100998310000038.3%
Claude 3.5 Sonnet100965852131000032.0%
Rocinante 12B1009990000000028.9%
Z.AI GLM 4.51009680000000027.6%
Mistral NeMO1009941000000024.0%
Mistral Large94832015144000023.1%
Ministral 3 14B100450000000014.4%
Arcee AI: Trinity Large (Preview)85460000000013.1%
Grok 4.1 Fast10000000000010.0%
DeepSeek V3 (2024-12-26)4935000000008.4%
Claude 3.7 Sonnet820000000008.2%
GPT-4o, May 13th (temp=1)660000000006.6%
GPT-4o, Aug. 6th (temp=1)458000000005.3%
Hermes 3 70B510000000005.1%
DeepSeek V3 (2025-03-24)190000000002.0%
Llama 3.1 8B50000000000.5%
GPT-4.1 Mini30000000000.3%
GPT-4.111000000000.2%
WizardLM 2 8x22b00000000000.0%
Mistral Small Creative00000000000.0%
Grok 400000000000.0%
Mistral Large 200000000000.0%
Ministral 8B00000000000.0%
GPT-4o, May 13th (temp=0)00000000000.0%
Hermes 3 405B00000000000.0%
ByteDance Seed 1.6 Flash00000000000.0%
Mistral Large 300000000000.0%
Ministral 3B00000000000.0%
Claude 3 Haiku00000000000.0%
Ministral 3 8B00000000000.0%
Grok 4 Fast00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Qwen 2.5 72B00000000000.0%
Gemma 3 27B00000000000.0%
Mistral Medium 3.100000000000.0%
Llama 3.1 70B00000000000.0%
Mistral Small 3.2 24B00000000000.0%
DeepSeek V3.100000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemma 3 4B00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
GPT-4.1 Nano00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Qwen 3.5 397B A17B1001001001001001001001001009999.9%
MoonshotAI: Kimi K2.5100100100100100100100100986195.8%
Minimax M2.51001001001001001001001001005295.2%
GPT-5100100100100100100100100100090.0%
Gemini 2.5 Pro10010010010099979488714589.2%
GPT-5.110010010010010010010010092089.2%
GPT-5.2100100100100100100996441080.3%
GPT-5 Mini1001001001001001001001000080.0%
Z.AI GLM 4.61001001001009998867737079.7%
Z.AI GLM 51001001001009997878132079.5%
o4 Mini100100100100100100100880078.8%
Z.AI GLM 4.71001001001001009694942078.5%
GPT-4o, May 13th (temp=0)100100100988382626144273.1%
GPT-4o, Aug. 6th (temp=0)100100979079725941413171.1%
o4 Mini High100100100100100999700069.6%
Stealth: Aurora Alpha10010010010099999600069.3%
Ministral 3 14B10010010097967979218168.0%
Claude Sonnet 4.61001001001001009054322067.8%
ByteDance Seed 1.6 Flash10099999997908184167.8%
Cohere Command R+ (Aug. 2024)100100988573716340059.3%
Hermes 3 70B100100948380605091157.8%
Claude Opus 4.61001001001008786110057.5%
Mistral Large10010010099735918140056.3%
Gemini 3 Pro (Preview)1001001001008665000055.1%
ByteDance Seed 1.610097949483404010055.0%
Mistral Small Creative1001009998921000049.1%
Claude Haiku 4.510010094855918000045.7%
Mistral Large 310010091763223400042.6%
Hermes 3 405B989796642828000041.0%
Claude Opus 4.5100939268403000039.5%
Llama 3.1 8B100999764241000038.5%
Claude Sonnet 41009980453811310037.6%
Ministral 3B100927366370000036.8%
DeepSeek V3 (2025-03-24)1008986352621200035.8%
Ministral 3 8B10078686614131120035.2%
Grok 4.1 Fast978762383520700034.7%
Mistral Medium 3.1100987027122000031.0%
Qwen 3.5 Plus (2026-02-15)100100100330000030.5%
Rocinante 12B89805046330000029.7%
DeepSeek V3 (2024-12-26)92855437223000029.3%
GPT-5 Nano10010092000000029.2%
GPT-4o, Aug. 6th (temp=1)1008965100000025.6%
Mistral NeMO10079393121000025.2%
WizardLM 2 8x22b80564141166200024.2%
Z.AI GLM 4.7 Flash8064552800000022.7%
Claude Sonnet 4.59995131110000022.0%
Claude 3.5 Haiku1008724520000021.9%
Grok 4 Fast9271151170000019.5%
Llama 3.1 70B866433320000018.8%
GPT-4o, May 13th (temp=1)997613000000018.8%
Llama 3.1 Nemotron 70B716833320000017.9%
Ministral 3 3B100680000000016.8%
Claude 3 Haiku484641200000013.8%
Mistral Large 298276200000013.3%
Arcee AI: Trinity Large (Preview)80370000000011.7%
Claude 3.5 Sonnet6520171020000011.5%
GPT-4.1 Mini99140000000011.4%
DeepSeek V3.14023840000007.5%
Gemini 2.5 Flash Lite649000000007.2%
Ministral 8B3317300000005.3%
Z.AI GLM 4.5244100000002.9%
Qwen 2.5 72B250000000002.5%
Gemini 3 Flash (Preview)70000000000.7%
Writer: Palmyra X561000000000.7%
Claude 3.7 Sonnet22000000000.3%
Gemma 3 12B21000000000.3%
Mistral Small 3.2 24B10000000000.1%
DeepSeek-V2 Chat00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
DeepSeek V3.200000000000.0%
Grok 400000000000.0%
Gemini 2.5 Flash00000000000.0%
Claude Opus 400000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Gemma 3 27B00000000000.0%
GPT-4.100000000000.0%
GPT-4.1 Nano00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 4B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude 3.7 Sonnet1001001001001001001001001009899.7%
Gemini 2.5 Flash Lite1001001001001001009999999799.4%
GPT-4o Mini (temp=0)10010010010010010010099999599.2%
DeepSeek-V2 Chat100100999999989898979798.6%
Mistral NeMO1001001001001001009998959298.3%
Gemini 3.1 Pro (Preview)1001001001001001001001001007997.9%
Claude Sonnet 4.6100100100100100999898938697.4%
Claude Haiku 4.5100100100100100999999896995.5%
Claude Opus 4.610010010010099999894807394.2%
Claude Sonnet 4.510099999998979791877394.2%
Gemma 3 12B10010010010010010010098893191.7%
Gemma 3 4B1001001001001001009996853791.6%
Writer: Palmyra X51001001001001001001009998089.6%
Z.AI GLM 5100100999797949282696789.6%
MoonshotAI: Kimi K2.5100100100100100100948681987.0%
Claude Opus 4.51009999999997979485086.9%
Grok 41001001009997898074684785.4%
Z.AI GLM 4.610010010010096969387373784.6%
DeepSeek V3.210010010010099979575373783.9%
Gemma 3 27B1001001009898988978373783.4%
DeepSeek V3.1100100100100100989465373782.9%
GPT-5.29997979696937866373779.6%
Claude Opus 4100100989392807437373774.7%
Qwen 3.5 397B A17B10010010099999666660072.7%
Z.AI GLM 4.5100100100100100973737372072.6%
Gemini 2.5 Pro1001001009890565337373770.6%
Llama 3.1 Nemotron 70B100100100999795423737170.6%
Grok 4.1 Fast9998897978735757412870.0%
Qwen 2.5 72B10010010010010010059380069.6%
Qwen 3.5 Plus (2026-02-15)100100979695373737373767.3%
GPT-5.110099978078703737373767.1%
GPT-4.1 Nano10099999898893700062.0%
ByteDance Seed 1.6100100100999797600059.9%
Mistral Small 3.2 24B9997979587832610058.5%
DeepSeek V3 (2024-12-26)1001001009990504130058.4%
Z.AI GLM 4.7 Flash100100100953737373737057.9%
Arcee AI: Trinity Large (Preview)100100949191683400057.6%
Gemini 2.5 Flash10099975837373737373757.6%
Z.AI GLM 4.71009896953737373737057.3%
Llama 3.1 70B1009893923737373737557.1%
Claude 3.5 Sonnet9695843737373737373753.2%
Ministral 8B9999979556371300049.6%
o4 Mini High9896786353523770048.3%
DeepSeek V3 (2025-03-24)10094736549454200046.7%
Arcee AI: Trinity Mini999378665524231311046.1%
Ministral 3B10093926043373000045.4%
o4 Mini10010092654628910044.0%
Mistral Large 21009998734815311043.8%
Claude Sonnet 49637373737373737373742.8%
GPT-4o Mini (temp=1)9237373737373737373742.3%
Gemini 3 Flash (Preview)8837373737373737373741.9%
Mistral Medium 3.11009983613229660041.6%
GPT-4o, May 13th (temp=0)3737373737373737373736.8%
GPT-4o, Aug. 6th (temp=1)3737373737373737373736.8%
GPT-4o, Aug. 6th (temp=0)3737373737373737373736.8%
GPT-4o, May 13th (temp=1)3737373737373737373736.8%
GPT-4.1 Mini3737373737373737373736.8%
GPT-5100947837371000034.6%
Stealth: Aurora Alpha9688534424171300033.6%
GPT-5 Nano9696962600000031.5%
Rocinante 12B1009893910000030.0%
Hermes 3 70B10094742910000029.8%
Claude 3.5 Haiku997253381411540029.6%
GPT-4.13737373737373700025.8%
Minimax M2.5100100371700000025.4%
GPT-5 Mini10010044100000024.5%
Hermes 3 405B9690302020000023.9%
Grok 4 Fast9463361273211021.9%
WizardLM 2 8x22b1008812911000021.0%
ByteDance Seed 1.6 Flash2828262500000010.8%
Mistral Small Creative9474210000010.6%
Claude 3 Haiku10011000000010.2%
Gemini 3 Pro (Preview)980000000009.8%
Cohere Command R+ (Aug. 2024)6320100000008.4%
Ministral 3 3B4335100000007.8%
Ministral 3 14B3513430000005.5%
Mistral Large 3440000000004.4%
Llama 3.1 8B311000000003.2%
Mistral Large250000000002.5%
Ministral 3 8B90000000000.9%

dialogue-500

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-5 Mini1001001009994929286857992.6%
GPT-510010010010010010010090771087.7%
o4 Mini10010010010096958772632183.3%
o4 Mini High100100100100987979500070.6%
Minimax M2.51009999989897420059.8%
MoonshotAI: Kimi K2.51009477755553100045.6%
Claude Sonnet 410099888533191500044.0%
Mistral Large988988737012000043.0%
Stealth: Aurora Alpha8884605755501610041.0%
Z.AI GLM 4.7100865739386000032.6%
GPT-5.110095704243000031.5%
Claude Opus 4.610086544492211029.9%
Ministral 3 14B10086812130000029.1%
Claude Opus 4.597924035204000028.7%
GPT-4o, May 13th (temp=0)9045433230201552028.4%
Qwen 3.5 397B A17B9695353050000026.1%
ByteDance Seed 1.69071481995440025.0%
GPT-4o, May 13th (temp=1)9361261180000020.1%
Gemini 2.5 Pro98933100000019.5%
Claude 3 Haiku985930610000019.3%
Ministral 8B99811000000018.1%
Mistral NeMO895312310000015.8%
GPT-5 Nano544940310000014.7%
Hermes 3 70B82631000000014.7%
Gemini 3 Pro (Preview)91540000000014.4%
Claude Sonnet 4.68725161060000014.4%
Claude Sonnet 4.57323221230000013.3%
Z.AI GLM 581454000000013.1%
GPT-5.288340000000012.2%
Ministral 3 3B93125530000011.8%
WizardLM 2 8x22b9843100000010.6%
Z.AI GLM 4.69971000000010.6%
Claude 3.5 Haiku413718300000010.0%
Rocinante 12B860000000008.6%
Arcee AI: Trinity Large (Preview)860000000008.6%
Qwen 3.5 Plus (2026-02-15)5233000000008.6%
Hermes 3 405B610000000006.1%
GPT-4o, Aug. 6th (temp=1)570000000005.8%
Mistral Small 3.2 24B523000000005.5%
Ministral 3B435100000005.0%
Mistral Medium 3.1480000000004.8%
GPT-4.1 Mini2214000000003.6%
DeepSeek V3 (2025-03-24)240000000002.4%
Llama 3.1 8B64000000001.0%
Llama 3.1 70B31000000000.4%
Claude 3.5 Sonnet21000000000.3%
Z.AI GLM 4.7 Flash20000000000.2%
Claude Haiku 4.520000000000.2%
Qwen 2.5 72B11000000000.2%
ByteDance Seed 1.6 Flash00000000000.0%
Writer: Palmyra X500000000000.0%
GPT-4.100000000000.0%
Claude Opus 400000000000.0%
Arcee AI: Trinity Mini00000000000.0%
DeepSeek V3.100000000000.0%
Mistral Large 200000000000.0%
Mistral Small Creative00000000000.0%
Grok 400000000000.0%
GPT-4.1 Nano00000000000.0%
Ministral 3 8B00000000000.0%
Cohere Command R+ (Aug. 2024)00000000000.0%
DeepSeek V3 (2024-12-26)00000000000.0%
Grok 4 Fast00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Grok 4.1 Fast00000000000.0%
DeepSeek V3.200000000000.0%
Gemma 3 12B00000000000.0%
Mistral Large 300000000000.0%
Z.AI GLM 4.500000000000.0%
GPT-4o, Aug. 6th (temp=0)00000000000.0%
Gemma 3 4B00000000000.0%
Claude 3.7 Sonnet00000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Gemma 3 27B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
GPT-51001001001001001001009997089.5%
GPT-5 Mini1001001009971584300057.1%
Minimax M2.51001001009988512600056.3%
MoonshotAI: Kimi K2.5999990795049473215056.0%
o4 Mini100100999075423200053.8%
Ministral 3 14B1009795938158100052.7%
Claude Opus 4.68885837847432810045.5%
o4 Mini High1009082824423000042.1%
Stealth: Aurora Alpha84726251454223212040.1%
Claude Opus 4.5989764554133000038.7%
Claude Sonnet 49694894598320034.6%
Ministral 8B96787159392000034.6%
Claude Sonnet 4.6978859302620664333.8%
ByteDance Seed 1.6 Flash8779473632262610033.4%
GPT-5.110098813030000031.2%
Claude Sonnet 4.59692902710000030.6%
WizardLM 2 8x22b10083654276100030.3%
GPT-4o, Aug. 6th (temp=0)8876746000000029.9%
Ministral 3 3B9687514460000028.5%
GPT-5 Nano10077504200000026.9%
Hermes 3 70B9383442620000025.0%
Hermes 3 405B987662921100024.9%
Ministral 3B979632662000023.9%
GPT-4o, May 13th (temp=1)9488371600000023.5%
Ministral 3 8B938059200000023.4%
ByteDance Seed 1.69765441353310023.0%
Rocinante 12B1006547743000022.6%
Mistral Large 2100959970000021.9%
Mistral NeMO6462444053000021.7%
Llama 3.1 8B1006050000000021.1%
GPT-5.2987434500000021.0%
Arcee AI: Trinity Large (Preview)858430730000021.0%
DeepSeek V3 (2024-12-26)97934000000019.5%
Qwen 3.5 397B A17B847824100000018.7%
Mistral Small 3.2 24B867510300000017.4%
Claude 3.5 Haiku825516653100016.8%
Z.AI GLM 4.7973230700000016.7%
Mistral Large5348311762110015.8%
Cohere Command R+ (Aug. 2024)79655100000014.9%
GPT-4o, Aug. 6th (temp=1)933010200000013.5%
GPT-4.1 Mini871511430000011.9%
Gemini 2.5 Pro673413300000011.9%
Claude 3.5 Sonnet62398511100011.7%
Z.AI GLM 4.5930000000009.3%
Mistral Large 35031000000008.2%
DeepSeek V3 (2025-03-24)6413110000008.0%
Llama 3.1 70B678000000007.5%
Qwen 2.5 72B27251000000006.2%
Mistral Medium 3.1481000000004.9%
Z.AI GLM 52215810000004.6%
Claude Opus 4320000000003.2%
Llama 3.1 Nemotron 70B212000000002.5%
GPT-4.1 Nano230000000002.4%
Grok 4 Fast182200000002.2%
GPT-4o, May 13th (temp=0)140000000001.4%
Qwen 3.5 Plus (2026-02-15)71000000000.8%
Claude Haiku 4.551000000000.6%
Z.AI GLM 4.7 Flash40000000000.4%
Grok 430000000000.3%
Z.AI GLM 4.621000000000.3%
Claude 3 Haiku30000000000.3%
Mistral Small Creative20000000000.2%
Grok 4.1 Fast10000000000.2%
Writer: Palmyra X500000000000.0%
Gemini 3 Pro (Preview)00000000000.0%
DeepSeek V3.100000000000.0%
DeepSeek-V2 Chat00000000000.0%
Gemma 3 12B00000000000.0%
GPT-4.100000000000.0%
Claude 3.7 Sonnet00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemini 2.5 Flash Lite00000000000.0%
DeepSeek V3.200000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Gemma 3 27B00000000000.0%
Gemma 3 4B00000000000.0%
Gemini 3 Flash (Preview)00000000000.0%
Gemini 2.5 Flash00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Gemini 3.1 Pro (Preview)1001001001001001001001001009999.9%
Grok 4 Fast999795948885846534874.8%
Claude 3.5 Sonnet9999959287878265065.3%
Mistral Medium 3.11009591918946423814060.5%
DeepSeek-V2 Chat10087807979683833211059.6%
Grok 4.1 Fast1009682756154503014656.9%
o4 Mini High10010091746657232011054.3%
Llama 3.1 70B999594919123520049.9%
GPT-5 Mini10010010099990000049.8%
Claude 3.5 Haiku9999968975201610049.7%
Mistral Small Creative100998978542823231049.5%
GPT-4o, Aug. 6th (temp=0)100936740404040351145.6%
DeepSeek V3.1999889794219870044.2%
GPT-4o, Aug. 6th (temp=1)100898885710000043.3%
ByteDance Seed 1.61009786815410000043.0%
Ministral 3 8B988155544242242010142.8%
GPT-5.2929085815310651042.2%
Grok 4935857544743262510841.9%
Claude Sonnet 4.6100100845432221532041.2%
Mistral Small 3.2 24B98927671565300040.0%
GPT-510085715838241000038.5%
Claude 3 Haiku978860504427200037.0%
Mistral Large 299988934179850035.9%
Claude Sonnet 4.5938475353433200035.6%
DeepSeek V3 (2024-12-26)96595738323127115035.5%
Minimax M2.5978067541613200032.7%
Llama 3.1 Nemotron 70B9887764594411032.6%
Gemma 3 12B777269413710631031.5%
GPT-5.199826834107410030.4%
Stealth: Aurora Alpha96876147112000030.4%
Cohere Command R+ (Aug. 2024)88626149258000029.5%
Ministral 3B86744836301000027.5%
ByteDance Seed 1.6 Flash9790721200000027.2%
Ministral 3 14B93774821130000025.2%
Mistral Large 3535147424010000024.3%
Qwen 2.5 72B93703124231000024.3%
Hermes 3 70B8979461590000023.9%
DeepSeek V3 (2025-03-24)98981211105200023.5%
Z.AI GLM 4.681604129101000022.1%
Llama 3.1 8B7471372085000021.4%
Hermes 3 405B8871312100000021.1%
Qwen 3.5 397B A17B948115710000019.9%
Claude 3.7 Sonnet504534282019210019.8%
Arcee AI: Trinity Mini98758510000018.7%
o4 Mini8957121210000017.1%
WizardLM 2 8x22b983826600000016.8%
MoonshotAI: Kimi K2.595291615110000016.7%
GPT-4o Mini (temp=1)99480000000014.7%
GPT-4.1 Nano4830292521000013.5%
GPT-4.1 Mini96234321000012.9%
GPT-4o, May 13th (temp=1)82299000000012.1%
Claude Sonnet 499116200000011.8%
Claude Haiku 4.584137321100011.1%
Writer: Palmyra X59730000000010.1%
Mistral NeMO8710000000009.7%
Rocinante 12B353017132000009.7%
Z.AI GLM 4.7 Flash760000000007.6%
Arcee AI: Trinity Large (Preview)750000000007.5%
Gemini 2.5 Pro3427000000006.2%
DeepSeek V3.23212521110005.4%
Gemini 3 Pro (Preview)452000000004.7%
Z.AI GLM 4.5288000000003.6%
Claude Opus 4.6321100000003.3%
Claude Opus 41913000000003.1%
Gemini 2.5 Flash260000000002.7%
GPT-4o, May 13th (temp=0)260000000002.7%
Mistral Large90000000000.9%
Gemini 3 Flash (Preview)71000000000.9%
Gemma 3 27B30000000000.3%
Ministral 3 3B20000000000.2%
Gemini 2.5 Flash Lite11000000000.2%
Claude Opus 4.510000000000.1%
Z.AI GLM 4.700000000000.0%
Z.AI GLM 500000000000.0%
Qwen 3.5 Plus (2026-02-15)00000000000.0%
Ministral 8B00000000000.0%
GPT-4.100000000000.0%
Gemma 3 4B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
GPT-5 Nano00000000000.0%