Dialogue tag variety (said vs. fancy)

Test: Bad Writing Habits

Avg. Score
61.6%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude Sonnet 4.699.4%$0.03139.3s95%
2Minimax M2.592.4%$0.00341.3m60%
3Mistral Large85.7%$0.01430.9s42%
4Mistral Large 286.8%$0.01329.4s39%
5Ministral 8B80.0%$0.000410.4s35%
6Claude Sonnet 4.588.4%$0.03538.1s45%
7Z.AI GLM 588.4%$0.00841.2m42%
8Qwen 3.5 397B A17B95.7%$0.0143.0m66%
9Mistral Large 382.1%$0.003330.3s35%
10GPT-5 Mini81.8%$0.010057.4s44%
11Mistral Small Creative78.8%$0.00079.1s29%
12Ministral 3 3B79.1%$0.000511.1s29%
13Claude Opus 4.591.0%$0.07053.4s56%
14Ministral 3B77.5%$0.00018.1s28%
15Claude Haiku 4.577.2%$0.01121.6s32%
16ByteDance Seed 1.6 Flash77.0%$0.001327.3s29%
17Ministral 3 8B75.1%$0.000819.6s27%
18Writer: Palmyra X577.5%$0.01122.0s28%
19Mistral Medium 3.175.6%$0.004836.5s25%
20Ministral 3 14B69.6%$0.000711.7s22%
21Claude Sonnet 480.1%$0.03243.7s31%
22DeepSeek V3 (2025-03-24)68.8%$0.001439.4s17%
23ByteDance Seed 1.685.2%$0.0132.5m34%
24Mistral NeMO62.9%$0.000510.1s14%
25GPT-590.2%$0.0652.8m55%
26Claude Opus 4.686.0%$0.0781.2m37%
27Grok 4.1 Fast66.1%$0.001837.8s16%
28Gemini 2.5 Pro73.3%$0.03636.2s23%
29Llama 3.1 70B64.3%$0.001529.4s11%
30Z.AI GLM 4.665.4%$0.006551.5s14%
31DeepSeek V3 (2024-12-26)63.6%$0.002154.6s14%
32Claude 3.5 Sonnet70.7%$0.04835.5s19%
33Gemini 2.5 Flash56.9%$0.005210.6s9%
34GPT-4.163.7%$0.01844.7s15%
35o4 Mini59.3%$0.01525.7s11%
36Qwen 2.5 72B56.5%$0.001036.7s10%
37Z.AI GLM 4.559.4%$0.005142.1s10%
38GPT-4.1 Nano51.8%$0.000713.3s7%
39Arcee AI: Trinity Large (Preview)56.3%$0.000043.6s9%
40Grok 4 Fast52.9%$0.001724.1s8%
41Claude 3.5 Haiku54.2%$0.003510.8s3%
42WizardLM 2 8x22b66.5%$0.00261.8m16%
43Arcee AI: Trinity Mini50.0%$0.00039.2s5%
44GPT-4o, Aug. 6th (temp=0)54.9%$0.02322.7s13%
45Gemma 3 27B56.7%$0.000652.6s9%
46GPT-4o, May 13th (temp=0)57.7%$0.03514.1s10%
47o4 Mini High61.2%$0.02547.2s10%
48DeepSeek V3.264.4%$0.00141.9m14%
49DeepSeek-V2 Chat55.7%$0.002153.3s8%
50MoonshotAI: Kimi K2.577.1%$0.0193.2m29%
51GPT-5.172.0%$0.0541.8m23%
52GPT-4.1 Mini43.7%$0.002719.0s8%
53DeepSeek V3.160.6%$0.00201.8m13%
54Claude 3 Haiku40.3%$0.002514.9s6%
55Gemini 2.5 Flash Lite40.0%$0.00099.5s4%
56Hermes 3 405B48.9%$0.003253.2s5%
57Rocinante 12B46.1%$0.001438.4s3%
58Claude 3.7 Sonnet56.4%$0.04246.7s12%
59Z.AI GLM 4.7 Flash49.2%$0.00171.2m8%
60Gemini 3.1 Pro (Preview)79.9%$0.1071.8m28%
61Z.AI GLM 4.751.7%$0.0101.4m8%
62Stealth: Aurora Alpha35.7%$0.00009.8s0%
63GPT-5 Nano47.2%$0.00421.4m9%
64Gemini 3 Flash (Preview)37.4%$0.007819.6s3%
65GPT-5.261.4%$0.0561.5m17%
66Llama 3.1 8B45.2%$0.00031.3m5%
67Llama 3.1 Nemotron 70B33.9%$0.003831.7s1%
68GPT-4o Mini (temp=0)34.6%$0.001234.8s0%
69Gemma 3 12B31.2%$0.000441.3s3%
70Cohere Command R+ (Aug. 2024)38.6%$0.02052.5s3%
71Qwen 3.5 Plus (2026-02-15)30.5%$0.006031.5s0%
72Grok 452.5%$0.0481.7m9%
73Claude Opus 485.3%$0.2091.4m36%
74Gemma 3 4B19.7%$0.000220.0s0%
75Hermes 3 70B30.8%$0.00101.2m0%
76GPT-4o Mini (temp=1)20.1%$0.001234.8s0%
77GPT-4o, May 13th (temp=1)25.2%$0.03314.4s0%
78GPT-4o, Aug. 6th (temp=1)19.5%$0.01824.4s0%
79Gemini 3 Pro (Preview)31.2%$0.05554.4s4%
80Mistral Small 3.2 24B68.9%$0.00695.7m15%
61.61%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100978897.0%
Claude Opus 41001001001007995.9%
Claude Sonnet 4.5100100100947994.6%
GPT-510010086827789.0%
Qwen 3.5 397B A17B1001001001004288.4%
Gemini 3.1 Pro (Preview)10010083825083.0%
Minimax M2.5100100100791879.4%
Claude Opus 4.6100100100801579.0%
Z.AI GLM 5100100100504378.6%
Mistral Large 210010010079075.7%
Ministral 8B1001007973070.3%
Writer: Palmyra X594938864067.9%
Mistral Medium 3.11009995211265.4%
GPT-4o, Aug. 6th (temp=0)100100947060.3%
Claude 3.5 Sonnet91796759059.1%
Ministral 3 14B767059454558.9%
Mistral Large 3917867421157.7%
GPT-5.11001004737056.7%
GPT-5.2817168312955.9%
ByteDance Seed 1.6 Flash1007367211054.1%
Qwen 2.5 72B716353503354.1%
Claude Sonnet 4100100700054.0%
Ministral 3 8B76696755053.3%
MoonshotAI: Kimi K2.5100100590051.8%
Rocinante 12B100100590051.8%
GPT-4.19383810051.4%
GPT-5 Mini85846817051.0%
Mistral Small 3.2 24B10077690049.3%
Grok 4.1 Fast100100390047.8%
Mistral NeMO100673025044.3%
o4 Mini64503932738.5%
Gemini 2.5 Flash8369320036.9%
WizardLM 2 8x22b7667310034.6%
Mistral Large947900034.6%
Ministral 3 3B887900033.2%
Claude 3 Haiku50393932032.0%
Grok 463353021029.7%
Ministral 3B835900028.5%
Hermes 3 70B5353250026.3%
Llama 3.1 Nemotron 70B913900026.0%
GPT-4o, May 13th (temp=0)6339190024.2%
GPT-4o, May 13th (temp=1)882020021.8%
Claude Haiku 4.54747120021.2%
Grok 4 Fast732570021.0%
o4 Mini High100000020.0%
DeepSeek-V2 Chat100000020.0%
Claude 3.7 Sonnet100000020.0%
DeepSeek V3 (2024-12-26)791700019.0%
DeepSeek V3 (2025-03-24)672500018.3%
DeepSeek V3.1592570018.2%
Arcee AI: Trinity Large (Preview)91000018.2%
Mistral Small Creative82000016.4%
GPT-4o, Aug. 6th (temp=1)73000014.6%
Z.AI GLM 4.7 Flash67000013.3%
Hermes 3 405B353000012.9%
DeepSeek V3.264000012.9%
Gemini 2.5 Pro63000012.6%
Gemini 3 Flash (Preview)59300012.4%
GPT-5 Nano52000010.3%
GPT-4.1 Mini43700010.0%
Llama 3.1 70B50000010.0%
Llama 3.1 8B50000010.0%
Z.AI GLM 4.728174009.7%
Z.AI GLM 4.53170007.6%
Z.AI GLM 4.62500005.0%
Cohere Command R+ (Aug. 2024)1700003.3%
Gemini 3 Pro (Preview)700001.4%
Gemma 3 12B200000.4%
Stealth: Aurora Alpha000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Claude 3.5 Haiku000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemini 2.5 Flash Lite000000.0%
Gemma 3 27B000000.0%
Arcee AI: Trinity Mini000000.0%
GPT-4.1 Nano000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Qwen 3.5 397B A17B1001001001008396.7%
GPT-51001001001008396.7%
Claude Sonnet 41001001001007695.2%
Claude Opus 4.510010097886790.3%
Claude Sonnet 4.51001001001003987.8%
GPT-5 Mini100100100892382.5%
Z.AI GLM 4.6100100100100080.0%
Mistral Large 3100100100100080.0%
Llama 3.1 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Claude Opus 41001009994078.8%
Minimax M2.510010010071074.2%
Writer: Palmyra X510010010050070.0%
Ministral 3B10010010025065.0%
DeepSeek V3.11001001007061.4%
GPT-5.110010050391761.1%
Gemini 3.1 Pro (Preview)1001001000060.0%
MoonshotAI: Kimi K2.51001001000060.0%
Z.AI GLM 51001001000060.0%
ByteDance Seed 1.61001001000060.0%
Hermes 3 405B1001001000060.0%
Mistral Small Creative1001001000060.0%
GPT-5.294797925055.3%
Grok 4.1 Fast100100390047.8%
o4 Mini100100250045.0%
Z.AI GLM 4.7 Flash88735014044.8%
Rocinante 12B10085307044.5%
Ministral 3 14B10059500041.8%
Arcee AI: Trinity Large (Preview)10073323041.7%
DeepSeek V3 (2025-03-24)10010070041.4%
Claude Opus 4.610010000040.0%
Grok 4 Fast10010000040.0%
Mistral Small 3.2 24B10010000040.0%
Mistral NeMO10010000040.0%
Z.AI GLM 4.51009100038.2%
Claude 3.5 Sonnet10045390036.7%
Hermes 3 70B1006770034.8%
DeepSeek V3.21006300032.6%
Ministral 8B10025250030.0%
WizardLM 2 8x22b1005000030.0%
GPT-4o, Aug. 6th (temp=0)1003900027.8%
Claude 3 Haiku1003900027.8%
Ministral 3 8B1003900027.8%
GPT-4.11002500025.0%
Mistral Medium 3.11001700023.3%
Gemini 3 Flash (Preview)7620170022.4%
Grok 4100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
GPT-4o, May 13th (temp=0)100000020.0%
DeepSeek-V2 Chat100000020.0%
GPT-4o, Aug. 6th (temp=1)100000020.0%
ByteDance Seed 1.6 Flash100000020.0%
Llama 3.1 8B100000020.0%
GPT-4.1 Nano100000020.0%
GPT-4.1 Mini83000016.7%
Claude 3.7 Sonnet56000011.2%
GPT-5 Nano3900007.8%
Qwen 2.5 72B17170006.7%
Gemma 3 27B2500005.0%
Gemma 3 4B2500005.0%
o4 Mini High000000.0%
Z.AI GLM 4.7000000.0%
Gemini 3 Pro (Preview)000000.0%
Stealth: Aurora Alpha000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
GPT-4o, May 13th (temp=1)000000.0%
Gemini 2.5 Flash000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 12B000000.0%
Gemini 2.5 Flash Lite000000.0%
Llama 3.1 Nemotron 70B000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Mistral Large100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.7%
Z.AI GLM 4.61001001001009999.7%
DeepSeek-V2 Chat1001001001009899.7%
Gemma 3 27B1001001001009799.5%
Mistral Small Creative1001001001009699.3%
GPT-4.11001001001009198.2%
Mistral NeMO1001001001009198.2%
Grok 4100100100999298.2%
Gemini 3.1 Pro (Preview)1001001001009098.1%
Mistral Small 3.2 24B1001001001008697.2%
DeepSeek V3.21001001001008396.7%
Gemma 3 12B100100100919096.2%
GPT-5 Mini100100100988296.0%
Llama 3.1 70B1001001001007995.7%
Ministral 3B1001001001007995.7%
Claude Haiku 4.51001001001007695.2%
Gemini 2.5 Pro100100100888895.0%
DeepSeek V3.110010093918994.8%
Grok 4.1 Fast1001001001007394.6%
Z.AI GLM 4.71001001001007094.0%
Claude 3.7 Sonnet1001001001006693.2%
GPT-5 Nano100100100855487.7%
GPT-4o Mini (temp=0)1009289817587.4%
Grok 4 Fast100100100923685.6%
GPT-4.1 Nano1009191813279.1%
GPT-4o, Aug. 6th (temp=0)10010083692876.1%
Gemini 3 Flash (Preview)979592484475.4%
Arcee AI: Trinity Mini10010079762175.1%
GPT-4.1 Mini10010077474273.1%
Qwen 3.5 Plus (2026-02-15)1009185473571.8%
Cohere Command R+ (Aug. 2024)10010010035066.9%
Gemini 2.5 Flash1001001000060.0%
Llama 3.1 8B100100880057.5%
GPT-4o, May 13th (temp=1)1008061271356.1%
Gemini 2.5 Flash Lite1008150221754.0%
Rocinante 12B1001004517052.3%
Gemini 3 Pro (Preview)100794023549.5%
Gemma 3 4B100554732046.7%
Llama 3.1 Nemotron 70B100882014745.6%
Claude 3.5 Haiku10050177034.8%
GPT-4o, Aug. 6th (temp=1)9950170033.1%
GPT-4o Mini (temp=1)36251715018.7%
Stealth: Aurora Alpha523900018.3%
Hermes 3 70B4700009.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Grok 4.1 Fast1001001001009999.7%
Claude Sonnet 41001001001009298.5%
WizardLM 2 8x22b1001001001009298.5%
MoonshotAI: Kimi K2.51001001001008797.3%
Mistral Large100100100898895.4%
DeepSeek V3 (2025-03-24)100100100918394.9%
Z.AI GLM 4.5100100100967594.1%
Claude Opus 41001001001007094.0%
o4 Mini10010095888393.2%
Mistral Small Creative100100100887692.7%
Mistral Medium 3.11001001001006292.4%
Mistral Small 3.2 24B1001001001005991.8%
Claude 3 Haiku100100100965089.2%
o4 Mini High10010090797689.0%
Rocinante 12B10010097736787.4%
DeepSeek V3.2100100100766187.4%
ByteDance Seed 1.6 Flash100100100715986.0%
GPT-5 Mini959089837286.0%
Gemini 3.1 Pro (Preview)1001001001002384.6%
Arcee AI: Trinity Large (Preview)100100100734583.6%
Ministral 8B1001001001001783.3%
DeepSeek-V2 Chat100100100564880.8%
DeepSeek V3 (2024-12-26)1009084794880.3%
Claude 3.5 Sonnet10010010088778.9%
Claude 3.7 Sonnet10010095692778.3%
Gemma 3 27B1008885753676.7%
Grok 4 Fast1008179635976.3%
GPT-4.110010010080076.0%
Ministral 3 8B1001009173774.3%
Llama 3.1 8B888883595073.5%
Gemini 2.5 Pro10010010055071.0%
Ministral 3 14B100888870069.0%
Hermes 3 70B10010010045068.9%
Mistral NeMO10010091361768.8%
Ministral 3 3B10010079322567.2%
GPT-4o, May 13th (temp=0)1001009125063.2%
Z.AI GLM 4.7 Flash1001006943062.3%
Hermes 3 405B1001001000060.0%
Gemini 2.5 Flash86757556058.5%
Z.AI GLM 4.7837947433958.0%
Arcee AI: Trinity Mini1006753452557.9%
Ministral 3B100100797057.1%
Qwen 2.5 72B1006750441555.0%
DeepSeek V3.1100794239051.9%
Grok 489853932049.2%
GPT-4o, Aug. 6th (temp=0)595352392846.3%
GPT-4o Mini (temp=0)9965630045.5%
Llama 3.1 Nemotron 70B67675525042.6%
Cohere Command R+ (Aug. 2024)81553925039.9%
Gemma 3 12B7975285037.1%
GPT-4.1 Mini75572922036.6%
GPT-4.1 Nano73633017036.6%
Gemini 3 Pro (Preview)92252315031.0%
Qwen 3.5 Plus (2026-02-15)6347360029.1%
GPT-5 Nano48251512019.8%
GPT-4o, May 13th (temp=1)521800014.2%
GPT-4o, Aug. 6th (temp=1)363500014.1%
GPT-4o Mini (temp=1)2500005.0%
Gemini 3 Flash (Preview)1743004.7%
Gemini 2.5 Flash Lite2000003.9%
Stealth: Aurora Alpha000000.0%
Claude 3.5 Haiku000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
GPT-4.1100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Mistral Large 3100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 2.5 72B1001001001009999.7%
GPT-5.21001001001009599.1%
Gemini 3.1 Pro (Preview)1001001001009498.8%
DeepSeek V3 (2024-12-26)1001001001009298.5%
Grok 4.1 Fast1001001001009198.2%
ByteDance Seed 1.61001001001009198.2%
GPT-4o, May 13th (temp=0)1001001001008697.2%
o4 Mini100100100949297.1%
WizardLM 2 8x22b1001001001008496.9%
DeepSeek V3.21001001001008396.7%
Minimax M2.5100100100988496.5%
o4 Mini High10010097928995.6%
Llama 3.1 70B100100100888895.0%
Mistral Small Creative1001001001007595.0%
Mistral Medium 3.110010096958194.6%
GPT-5 Mini100100100888193.6%
Gemini 2.5 Flash100100100947393.5%
Z.AI GLM 4.7 Flash100100100917493.1%
Claude Haiku 4.5100100100927393.0%
DeepSeek V3.110010095937592.7%
Llama 3.1 8B100100100837391.3%
Claude 3.7 Sonnet100100100817491.1%
DeepSeek V3 (2025-03-24)100100100736988.4%
Ministral 3B100100100816188.3%
DeepSeek-V2 Chat100100100964488.1%
Gemini 2.5 Pro10010085807387.6%
GPT-4o, Aug. 6th (temp=0)100100100766287.5%
Ministral 3 14B100100100973987.3%
Z.AI GLM 4.6939189837987.1%
ByteDance Seed 1.6 Flash1001001001002985.8%
Mistral Large100100100814785.7%
GPT-5 Nano1008581767583.4%
Hermes 3 405B100100100100080.0%
GPT-4o Mini (temp=0)10010096881579.8%
Ministral 3 8B100100100792079.6%
MoonshotAI: Kimi K2.5100100100564179.5%
Gemini 3 Pro (Preview)1009172646177.7%
Ministral 8B100100100533577.7%
Mistral NeMO10010010088077.5%
Grok 4 Fast10010010085077.1%
Gemma 3 27B100100100542876.4%
Gemini 3 Flash (Preview)1008987733175.9%
Grok 410010097453174.6%
Mistral Small 3.2 24B1001008879373.8%
Z.AI GLM 4.710010073582871.9%
Arcee AI: Trinity Large (Preview)10010010035066.9%
GPT-4o Mini (temp=1)807354524861.3%
GPT-4.1 Nano1007059561259.5%
Qwen 3.5 Plus (2026-02-15)1001006431059.1%
Rocinante 12B100100810056.2%
Claude 3 Haiku1001004729055.1%
Hermes 3 70B100100630052.6%
Ministral 3 3B100100554051.7%
GPT-4o, May 13th (temp=1)83675041048.3%
GPT-4.1 Mini100752322043.8%
Gemma 3 12B76563931040.5%
Cohere Command R+ (Aug. 2024)7971200033.9%
Arcee AI: Trinity Mini10045200032.9%
Gemma 3 4B79292221030.0%
Llama 3.1 Nemotron 70B1002500025.0%
GPT-4o, Aug. 6th (temp=1)794300024.3%
Stealth: Aurora Alpha100000020.0%
Gemini 2.5 Flash Lite56000011.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Z.AI GLM 51001001001008897.5%
DeepSeek V3.1100100100998897.2%
Writer: Palmyra X51001001001008396.7%
Ministral 3 8B100100100887392.1%
Claude Haiku 4.51001001001005991.8%
Claude Sonnet 4.51001001001005591.0%
GPT-5.21001001001005490.8%
Ministral 8B100100100995089.7%
ByteDance Seed 1.6 Flash100100100836289.0%
Gemini 2.5 Flash Lite1001001001002585.0%
Ministral 3 3B1001001001002585.0%
Claude Opus 4100100100733982.4%
GPT-5.110010086705582.2%
GPT-5 Mini100100100931481.3%
ByteDance Seed 1.6100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
Claude Sonnet 410010010096079.2%
Mistral Large 210010010094078.9%
DeepSeek V3 (2025-03-24)10010010085778.5%
Mistral Large10010094792078.5%
MoonshotAI: Kimi K2.5100100100503977.8%
Mistral Medium 3.110010081624577.5%
Mistral Small 3.2 24B10010010083076.7%
Gemma 3 27B1008885634576.2%
DeepSeek V3.210010010077075.4%
Claude Opus 4.51001008167069.5%
o4 Mini10010010045068.9%
Claude 3.7 Sonnet1008873353365.7%
Mistral Large 310010010025065.0%
Llama 3.1 70B10010010025065.0%
WizardLM 2 8x22b1001007239062.2%
Mistral Small Creative100965553060.8%
o4 Mini High1001001000060.0%
Gemini 2.5 Pro1001001000060.0%
Z.AI GLM 4.61001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Arcee AI: Trinity Mini1001001000060.0%
Z.AI GLM 4.710099757056.2%
Z.AI GLM 4.5100887320056.0%
Ministral 3B100100730054.6%
Ministral 3 14B100837614054.6%
Hermes 3 405B1001005017053.3%
Mistral NeMO1001002521049.2%
DeepSeek-V2 Chat100635525048.6%
Gemini 3.1 Pro (Preview)67675550348.2%
Grok 4 Fast1001001717046.7%
Grok 4.1 Fast100100257046.4%
GPT-4o, Aug. 6th (temp=0)100100257046.4%
Arcee AI: Trinity Large (Preview)10093320045.0%
DeepSeek V3 (2024-12-26)100453932043.2%
Grok 49475257040.3%
Qwen 2.5 72B10010000040.0%
GPT-4.1 Nano10010000040.0%
Rocinante 12B10010000040.0%
Gemma 3 4B10079120038.1%
GPT-4.110055290036.7%
Llama 3.1 Nemotron 70B10059170035.2%
Gemini 3 Flash (Preview)1005670032.7%
Cohere Command R+ (Aug. 2024)10035147732.5%
Gemma 3 12B7667140031.2%
GPT-4.1 Mini1003900027.8%
Hermes 3 70B1003200026.5%
Llama 3.1 8B100000020.0%
Qwen 3.5 Plus (2026-02-15)91000018.2%
GPT-4o, May 13th (temp=0)89000017.8%
Gemini 3 Pro (Preview)323100012.7%
Claude 3 Haiku252500010.0%
GPT-5 Nano1100002.2%
Stealth: Aurora Alpha000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o, May 13th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100100100100.0%
Minimax M2.51009491837588.8%
Mistral Large 31009189806284.4%
ByteDance Seed 1.6 Flash10010094813782.5%
Ministral 8B100100100732880.2%
Z.AI GLM 510010010099280.1%
Qwen 3.5 397B A17B100100100100080.0%
ByteDance Seed 1.61001009488076.4%
Mistral Large 2817167676469.9%
Ministral 3 3B1001009450068.9%
Mistral Large966359595666.7%
Claude Haiku 4.51007059473962.9%
Claude Opus 4.51006750423558.6%
GPT-4o, May 13th (temp=0)100734732250.8%
GPT-4.1100100500050.0%
GPT-5 Mini91565039047.3%
Mistral Medium 3.110089420046.3%
Ministral 3B10079500045.7%
Qwen 2.5 72B10097200043.4%
Mistral Small Creative100503925042.8%
GPT-5.2583939353240.5%
Ministral 3 14B6967650040.2%
MoonshotAI: Kimi K2.58365470039.0%
Claude Opus 4.682363625737.2%
WizardLM 2 8x22b1007370036.0%
Claude Sonnet 4.588532512035.6%
Claude 3 Haiku7552470035.0%
Mistral NeMO887970034.6%
GPT-563464118033.6%
GPT-4o, Aug. 6th (temp=0)73452514031.3%
GPT-5 Nano905900029.8%
Claude 3.7 Sonnet5756332029.8%
Gemini 2.5 Pro1002500025.0%
Ministral 3 8B1001820024.1%
GPT-4.1 Mini705000024.0%
Hermes 3 405B595530023.4%
Claude Opus 4921900022.2%
Arcee AI: Trinity Large (Preview)6335110021.8%
Hermes 3 70B763200021.7%
Stealth: Aurora Alpha5632170021.1%
GPT-4o, May 13th (temp=1)4343170020.5%
Gemini 3.1 Pro (Preview)100000020.0%
Mistral Small 3.2 24B100000020.0%
Gemini 2.5 Flash Lite534700020.0%
Claude Sonnet 4672500018.3%
DeepSeek V3.288000017.5%
Qwen 3.5 Plus (2026-02-15)641030015.5%
GPT-5.13030142014.9%
Writer: Palmyra X562430013.7%
DeepSeek V3 (2024-12-26)412200012.6%
Claude 3.5 Sonnet59000011.8%
Llama 3.1 8B55000011.0%
Grok 4.1 Fast352000010.8%
DeepSeek V3 (2025-03-24)4500008.9%
Z.AI GLM 4.7 Flash3570008.4%
DeepSeek V3.13900007.8%
Llama 3.1 Nemotron 70B3900007.8%
GPT-4o Mini (temp=0)22150007.3%
Gemma 3 27B2520005.4%
Z.AI GLM 4.51870005.1%
o4 Mini High2500005.0%
Z.AI GLM 4.62500005.0%
Gemma 3 4B2000003.9%
o4 Mini700001.4%
Rocinante 12B700001.4%
Grok 4 Fast700001.4%
Gemini 2.5 Flash200000.4%
Z.AI GLM 4.7000000.0%
Gemini 3 Pro (Preview)000000.0%
Grok 4000000.0%
Gemini 3 Flash (Preview)000000.0%
DeepSeek-V2 Chat000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
Claude 3.5 Haiku000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
Llama 3.1 70B000000.0%
Arcee AI: Trinity Mini000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
GPT-4.1 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-5 Mini100100100886790.8%
Claude Opus 4.5100100100100080.0%
Minimax M2.5100100100100080.0%
Gemini 2.5 Flash Lite100100100100080.0%
Mistral Large100100100100080.0%
GPT-5100100100673279.8%
Claude Sonnet 4100100100453976.7%
Mistral Small Creative10010010076075.2%
Mistral NeMO10010010039067.8%
Z.AI GLM 51001009425765.3%
Claude Sonnet 4.510010010025065.0%
GPT-4o, Aug. 6th (temp=0)1001001007061.4%
MoonshotAI: Kimi K2.51001001000060.0%
Gemini 2.5 Pro1001001000060.0%
Claude Opus 41001001000060.0%
ByteDance Seed 1.61001001000060.0%
Grok 4 Fast1001001000060.0%
Claude 3.5 Sonnet1001001000060.0%
Mistral Large 21001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Llama 3.1 70B1001001000060.0%
Claude 3 Haiku1001001000060.0%
GPT-4.1 Nano1001001000060.0%
Rocinante 12B1001001000060.0%
Ministral 3 14B100100507051.4%
WizardLM 2 8x22b1001002525050.0%
DeepSeek V3.2100100390047.8%
Hermes 3 70B100100390047.8%
Mistral Small 3.2 24B100100250045.0%
o4 Mini High10010070041.4%
DeepSeek V3 (2025-03-24)10010070041.4%
Claude Haiku 4.510010070041.4%
Qwen 2.5 72B10088140040.2%
Grok 4.1 Fast10010000040.0%
GPT-4o, Aug. 6th (temp=1)10010000040.0%
GPT-4.1 Mini10010000040.0%
DeepSeek V3.110010000040.0%
Gemma 3 4B10010000040.0%
ByteDance Seed 1.6 Flash10050177034.8%
Mistral Medium 3.11004500028.9%
Llama 3.1 8B1003900027.8%
Gemma 3 27B1002500025.0%
GPT-5.2762570021.6%
Grok 4100000020.0%
Z.AI GLM 4.6100000020.0%
GPT-5 Nano100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
Mistral Large 3100000020.0%
DeepSeek-V2 Chat100000020.0%
GPT-4o, May 13th (temp=1)100000020.0%
Hermes 3 405B100000020.0%
Ministral 3 8B100000020.0%
Cohere Command R+ (Aug. 2024)100000020.0%
Claude 3.7 Sonnet72000014.3%
Arcee AI: Trinity Large (Preview)452500013.9%
Writer: Palmyra X559700013.2%
GPT-5.150000010.0%
Z.AI GLM 4.7 Flash3900007.8%
GPT-4o, May 13th (temp=0)3500006.9%
Qwen 3.5 Plus (2026-02-15)2500005.0%
Gemini 3 Pro (Preview)700001.4%
GPT-4.1700001.4%
Stealth: Aurora Alpha700001.4%
o4 Mini000000.0%
Z.AI GLM 4.7000000.0%
Gemini 3 Flash (Preview)000000.0%
Z.AI GLM 4.5000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 12B000000.0%
Llama 3.1 Nemotron 70B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Ministral 3B100100100100100100.0%
ByteDance Seed 1.6 Flash1001001001009999.8%
GPT-5.11001001001009999.8%
Claude Sonnet 41001001001009699.3%
Gemma 3 27B1001001001009198.2%
GPT-5.2100100100968997.1%
Grok 4100100100949197.1%
GPT-4o, May 13th (temp=0)1001001001008597.0%
GPT-4.11001001001008396.7%
Grok 4.1 Fast10010097949196.6%
Mistral Large1001001001008296.4%
GPT-4.1 Nano1001001001008196.2%
Stealth: Aurora Alpha1001001001008096.0%
Mistral Small Creative1001001001008096.0%
GPT-5 Mini100100100968396.0%
Ministral 3 8B1001001001007995.7%
Ministral 3 3B1001001001007695.2%
Claude 3.5 Sonnet1001001001007595.0%
MoonshotAI: Kimi K2.5100100100918294.6%
Mistral NeMO1001001001007194.2%
Z.AI GLM 51001001001007094.0%
Mistral Medium 3.11001001001007094.0%
Minimax M2.5100100100967393.8%
o4 Mini100100100917693.4%
Mistral Large 31001001001006793.3%
Ministral 8B1001001001006593.0%
Gemini 2.5 Flash10010097966792.1%
Z.AI GLM 4.5100100100896791.1%
Arcee AI: Trinity Large (Preview)100100100975690.6%
Z.AI GLM 4.7100100100916190.4%
o4 Mini High100100100984789.0%
GPT-4.1 Mini1009391817387.6%
Grok 4 Fast1009489854783.4%
Arcee AI: Trinity Mini100100100852882.7%
Qwen 2.5 72B10010088853681.9%
Claude 3.7 Sonnet1008580766981.8%
DeepSeek V3.210010090645080.9%
DeepSeek-V2 Chat100100100544179.0%
DeepSeek V3 (2025-03-24)10010010083778.1%
Gemini 2.5 Flash Lite868580646175.3%
ByteDance Seed 1.6100100100502575.0%
Gemini 3 Flash (Preview)978280595675.0%
Claude 3.5 Haiku100100100392572.8%
Hermes 3 70B10010010056772.7%
GPT-5 Nano1009664613771.6%
GPT-4o, Aug. 6th (temp=0)10010067503470.1%
Gemini 3.1 Pro (Preview)100100100251768.3%
Cohere Command R+ (Aug. 2024)10010010035066.9%
Z.AI GLM 4.610010082292066.0%
DeepSeek V3 (2024-12-26)10010010027065.4%
Z.AI GLM 4.7 Flash1008947413763.0%
Rocinante 12B10010010014062.7%
WizardLM 2 8x22b10010050391260.2%
Llama 3.1 8B1001005939059.6%
Gemini 2.5 Pro1008665251858.8%
DeepSeek V3.11006855422558.0%
Gemma 3 12B827571322557.0%
GPT-4o, May 13th (temp=1)79756262055.4%
GPT-4o Mini (temp=0)99915422053.4%
Llama 3.1 Nemotron 70B10088790053.2%
Gemma 3 4B100706222050.7%
Gemini 3 Pro (Preview)966443252049.5%
Hermes 3 405B100100257046.4%
Llama 3.1 70B100100250045.0%
GPT-4o Mini (temp=1)9756150033.6%
Qwen 3.5 Plus (2026-02-15)77312925032.3%
GPT-4o, Aug. 6th (temp=1)10030120028.4%
Claude 3 Haiku100000020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-51001001001009799.5%
Z.AI GLM 51001001001009799.5%
GPT-5.21001001001009498.7%
Ministral 3 3B1001001001009398.6%
Claude Sonnet 41001001001008897.5%
ByteDance Seed 1.61001001001008897.5%
GPT-5.1100100100968996.9%
Mistral Large100100100948896.4%
Claude Opus 41001001001008196.2%
Writer: Palmyra X5100100100938695.8%
Claude Sonnet 4.6100100100888594.6%
Minimax M2.5100100100977594.4%
Qwen 3.5 397B A17B1001001001006793.3%
Ministral 3 8B1001001001006793.3%
Mistral Medium 3.1100100100937092.6%
GPT-5 Mini1001001001006292.5%
ByteDance Seed 1.6 Flash10010089877990.9%
Mistral Large 3100100100945589.8%
DeepSeek V3 (2024-12-26)10010094836287.9%
GPT-4o, May 13th (temp=0)10010099895087.6%
GPT-4.1100100100815086.2%
o4 Mini High10010099625983.9%
DeepSeek V3 (2025-03-24)100100100595382.5%
Gemini 2.5 Pro1009779695980.8%
Hermes 3 405B100100100851179.3%
Llama 3.1 Nemotron 70B10010088831777.5%
Ministral 3 14B10010083633977.1%
Mistral Small Creative100100100503476.7%
Llama 3.1 70B10010010079075.7%
o4 Mini1009680534775.5%
Claude Haiku 4.510010076732574.8%
Ministral 8B1009191761474.4%
GPT-4.1 Mini1001008379773.8%
Claude 3.5 Sonnet100100100323072.5%
Grok 410010070692172.0%
Stealth: Aurora Alpha1001009059069.8%
Gemini 3.1 Pro (Preview)10010010044068.7%
Z.AI GLM 4.51007573534268.6%
DeepSeek-V2 Chat1001007367067.9%
GPT-4o Mini (temp=0)100887571066.7%
Gemma 3 27B1009365641166.6%
Claude 3.7 Sonnet1008264612566.3%
WizardLM 2 8x22b1001009139066.0%
MoonshotAI: Kimi K2.510010056451763.5%
Grok 4.1 Fast916456454460.0%
Mistral Small 3.2 24B1001001000060.0%
Arcee AI: Trinity Large (Preview)9796907058.1%
Claude 3 Haiku797350393555.0%
Llama 3.1 8B100795039754.9%
Rocinante 12B100100730054.6%
Mistral NeMO1001004325053.6%
DeepSeek V3.29489672050.3%
Cohere Command R+ (Aug. 2024)100100500050.0%
Grok 4 Fast10083520047.2%
GPT-4o, Aug. 6th (temp=0)73716720046.1%
Qwen 2.5 72B100100280045.6%
GPT-5 Nano816137221042.1%
Z.AI GLM 4.7 Flash65636110039.9%
Z.AI GLM 4.6893532301239.7%
GPT-4.1 Nano10050390037.8%
GPT-4o Mini (temp=1)9150310034.4%
Z.AI GLM 4.751503215731.1%
DeepSeek V3.110035200030.8%
GPT-4o, May 13th (temp=1)1005300030.6%
Gemma 3 12B8342210029.2%
Gemini 2.5 Flash6255177028.2%
Gemini 2.5 Flash Lite7142280028.1%
Hermes 3 70B1001700023.3%
Claude 3.5 Haiku100770022.9%
GPT-4o, Aug. 6th (temp=1)5035172020.6%
Gemma 3 4B891100020.1%
Arcee AI: Trinity Mini89700019.3%
Gemini 3 Flash (Preview)473900017.1%
Gemini 3 Pro (Preview)23107008.1%
Qwen 3.5 Plus (2026-02-15)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Claude Sonnet 4.5100100100999899.5%
Claude Opus 41001001001009799.4%
DeepSeek V3 (2025-03-24)100100100979798.9%
GPT-5 Mini1001001001009398.6%
Gemma 3 27B100100100959397.8%
Arcee AI: Trinity Large (Preview)100100100988296.1%
Claude Opus 4.61001001001007995.7%
Stealth: Aurora Alpha100100100978195.5%
Claude Opus 4.51001001001007394.6%
Ministral 3B1001001001007094.0%
Minimax M2.5100100100837691.9%
Claude 3.7 Sonnet100100100975991.3%
GPT-5 Nano10010097886790.3%
GPT-510010097916089.7%
Ministral 3 8B100100100787089.5%
ByteDance Seed 1.6 Flash1001001001004689.2%
Claude Haiku 4.5999189856585.9%
Llama 3.1 70B1009491835985.6%
Claude Sonnet 410010097795285.5%
Qwen 2.5 72B100100100813984.0%
Mistral Large 2100100100942583.9%
GPT-4.1 Nano100100100635082.6%
Cohere Command R+ (Aug. 2024)100100100673981.1%
Ministral 8B1009691843381.0%
Mistral Large 310010083734780.8%
Mistral Medium 3.110010097901480.4%
Writer: Palmyra X51009982803980.1%
GPT-4o, May 13th (temp=0)10010010091078.2%
GPT-4.110010010077776.8%
o4 Mini High10010094701575.9%
Claude 3.5 Sonnet10010097463675.8%
GPT-5.1969185821673.8%
Ministral 3 14B10010010067073.3%
ByteDance Seed 1.6100100100501172.2%
Gemini 2.5 Flash Lite1001009656070.4%
DeepSeek V3.21007973484168.2%
Arcee AI: Trinity Mini10010010039067.8%
DeepSeek V3.1979375541767.1%
GPT-4o, Aug. 6th (temp=0)1009679471467.1%
Mistral Small 3.2 24B100100100201266.5%
MoonshotAI: Kimi K2.51009472252262.6%
Gemini 2.5 Pro998261502162.5%
Llama 3.1 Nemotron 70B1007967391760.2%
GPT-5.2797564592159.8%
GPT-4o Mini (temp=0)1008167271257.4%
DeepSeek V3 (2024-12-26)1001004739057.3%
Gemini 2.5 Flash100977312156.7%
o4 Mini88855652056.1%
Mistral NeMO1008939321755.5%
Z.AI GLM 4.693645628048.3%
GPT-4o Mini (temp=1)9986520047.5%
Hermes 3 405B99715214047.2%
Rocinante 12B10079320042.2%
GPT-4.1 Mini81594714741.5%
Gemini 3.1 Pro (Preview)10010000040.0%
Z.AI GLM 4.5100671411038.4%
Grok 4.1 Fast81393921036.0%
Llama 3.1 8B1007300034.6%
Gemma 3 12B9159170033.4%
Z.AI GLM 4.783372718033.0%
DeepSeek-V2 Chat5955480032.4%
Grok 41005340031.4%
WizardLM 2 8x22b6762250030.7%
Qwen 3.5 Plus (2026-02-15)8255150030.3%
GPT-4o, May 13th (temp=1)7728117024.6%
Z.AI GLM 4.7 Flash57302510024.5%
Gemini 3 Pro (Preview)5032250021.5%
GPT-4o, Aug. 6th (temp=1)673200019.8%
Gemini 3 Flash (Preview)73000014.6%
Hermes 3 70B70000014.0%
Claude 3 Haiku352200011.3%
Gemma 3 4B3930008.4%
Grok 4 Fast2570006.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Ministral 8B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001007995.7%
GPT-51001001001007094.0%
GPT-5 Mini1001001001006793.3%
Ministral 3 3B1001001001005591.0%
Ministral 3 8B100100100796789.0%
Z.AI GLM 51001001001004588.9%
Arcee AI: Trinity Large (Preview)10010083735081.3%
Mistral Large 210010010097780.9%
Gemini 3.1 Pro (Preview)100100100100080.0%
Claude Opus 4.6100100100100080.0%
Claude Sonnet 4100100100100080.0%
Claude Sonnet 4.5100100100100080.0%
Ministral 3 14B10010010094078.9%
Claude Opus 4100100100791478.4%
Gemini 2.5 Pro10010010088077.5%
ByteDance Seed 1.6100100100671776.7%
Mistral Medium 3.110010086672976.3%
Mistral Small 3.2 24B10010010079075.7%
Mistral Large 3100100100393975.6%
DeepSeek V3 (2024-12-26)10010010073074.6%
Writer: Palmyra X510010070693073.8%
Gemma 3 27B10010010067073.3%
DeepSeek V3.11001007659067.0%
Ministral 3B10010010025766.4%
Mistral NeMO1001009225063.4%
Mistral Small Creative10010010012062.4%
Claude Opus 4.5100777353060.6%
Llama 3.1 70B1001001000060.0%
Gemini 2.5 Flash Lite1001001000060.0%
WizardLM 2 8x22b1001001000060.0%
ByteDance Seed 1.6 Flash100100990059.7%
Grok 4100897039059.7%
Stealth: Aurora Alpha100886342058.5%
Grok 4.1 Fast100735755057.0%
Z.AI GLM 4.5100100790055.7%
Mistral Large100100790055.7%
DeepSeek V3.2100100677054.8%
GPT-5 Nano100706725052.3%
GPT-4o, Aug. 6th (temp=1)100595050051.8%
Hermes 3 405B100100590051.8%
Qwen 2.5 72B1001003225051.5%
GPT-5.2100884525051.4%
GPT-4o, May 13th (temp=0)10091590050.1%
Rocinante 12B100100390047.8%
Claude 3.7 Sonnet89644635046.8%
DeepSeek-V2 Chat100100257046.4%
Z.AI GLM 4.6100100250045.0%
Cohere Command R+ (Aug. 2024)100791714743.2%
Llama 3.1 Nemotron 70B10010070041.4%
Llama 3.1 8B10010070041.4%
GPT-4.1 Nano10010070041.4%
Hermes 3 70B10073257041.0%
GPT-4.110088170040.8%
Claude Haiku 4.510010000040.0%
Claude 3.5 Sonnet1008300036.7%
Gemma 3 12B55474532035.7%
DeepSeek V3 (2025-03-24)10025250030.0%
Grok 4 Fast1005000030.0%
Gemma 3 4B767000029.2%
Arcee AI: Trinity Mini1003900027.8%
GPT-5.16350150025.7%
Z.AI GLM 4.7 Flash7925180024.4%
o4 Mini High100700021.4%
Claude 3 Haiku732570021.0%
o4 Mini100000020.0%
GPT-4o Mini (temp=1)88000017.5%
GPT-4.1 Mini67770016.2%
Gemini 3 Flash (Preview)453000014.9%
GPT-4o, May 13th (temp=1)73000014.6%
Qwen 3.5 Plus (2026-02-15)4500008.9%
GPT-4o Mini (temp=0)4300008.6%
Z.AI GLM 4.7700001.4%
Gemini 3 Pro (Preview)000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001009498.9%
Claude Opus 4.61001001001008997.9%
Claude Opus 41001001001008997.8%
Claude Opus 4.5100100100968996.9%
Mistral Large 21001001001008396.7%
Claude Sonnet 4.6100100100888895.0%
Mistral Large100100100966391.8%
Z.AI GLM 4.710010091867790.9%
Claude Sonnet 4.510010099882582.2%
ByteDance Seed 1.6 Flash10010085774280.8%
Ministral 3B949183735980.2%
Llama 3.1 70B1008379715076.6%
Grok 4.1 Fast10010010081076.2%
Ministral 3 8B10010079505075.7%
GPT-5.110010010074074.8%
MoonshotAI: Kimi K2.51001009180074.3%
Mistral Small Creative1009281732073.2%
DeepSeek V3 (2024-12-26)1001007370770.0%
Mistral Medium 3.110010067631869.6%
Minimax M2.510010073453069.6%
Ministral 8B100888367067.5%
GPT-51001009734066.1%
Gemini 2.5 Pro1001001007061.4%
Mistral NeMO1001007325761.0%
Grok 4 Fast1009436322557.5%
Grok 41001006317055.9%
Ministral 3 3B97835939055.7%
GPT-4.1100855025052.1%
DeepSeek V3 (2025-03-24)100834532052.1%
Z.AI GLM 4.6100100590051.8%
Claude Sonnet 4776347392049.0%
Claude 3.5 Sonnet100593925044.6%
GPT-4o, Aug. 6th (temp=0)10083350043.6%
Ministral 3 14B8863550041.1%
o4 Mini High83593920040.2%
Gemini 3 Flash (Preview)10010000040.0%
DeepSeek-V2 Chat9769129037.5%
GPT-4.1 Nano948800036.4%
Claude Haiku 4.563473532035.3%
GPT-4o, May 13th (temp=0)50484731035.1%
WizardLM 2 8x22b898000033.9%
o4 Mini10036170030.5%
GPT-5 Mini75411714029.3%
Llama 3.1 Nemotron 70B676700026.7%
Writer: Palmyra X5755500026.0%
Hermes 3 405B676300025.9%
Qwen 2.5 72B6929280025.2%
DeepSeek V3.2595000021.8%
GPT-4.1 Mini100000020.0%
GPT-5 Nano88700018.9%
DeepSeek V3.1761700018.5%
Llama 3.1 8B88000017.5%
Rocinante 12B88000017.5%
Mistral Small 3.2 24B4900009.7%
Stealth: Aurora Alpha4300008.6%
GPT-4o, May 13th (temp=1)25170008.3%
Cohere Command R+ (Aug. 2024)25140007.7%
Z.AI GLM 4.53600007.3%
Arcee AI: Trinity Large (Preview)3600007.1%
GPT-4o, Aug. 6th (temp=1)3200006.5%
GPT-5.22500005.0%
Claude 3 Haiku1470004.2%
Gemma 3 12B1200002.4%
Z.AI GLM 4.7 Flash720001.8%
Gemini 2.5 Flash700001.4%
Gemma 3 4B200000.4%
Gemini 3 Pro (Preview)000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Claude 3.7 Sonnet000000.0%
Claude 3.5 Haiku000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemini 2.5 Flash Lite000000.0%
Gemma 3 27B000000.0%
Hermes 3 70B000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Claude Opus 4.5100100100945990.7%
GPT-510010098835387.0%
Z.AI GLM 5100100100100080.0%
Gemini 2.5 Pro100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Claude Haiku 4.5100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Mistral Small Creative100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Ministral 3 8B100100100100080.0%
Ministral 3B100100100100080.0%
ByteDance Seed 1.6 Flash1009494792077.4%
Qwen 3.5 397B A17B10010010050070.0%
Writer: Palmyra X510010010050070.0%
GPT-5 Mini10010060572969.3%
Mistral Large10010010035066.9%
Mistral Large 31001001007061.4%
DeepSeek V3 (2025-03-24)1001001000060.0%
DeepSeek V3.11001001000060.0%
Ministral 3 3B1001001000060.0%
GPT-4.1 Nano1001001000060.0%
WizardLM 2 8x22b1001001000060.0%
Claude Sonnet 4100100940058.9%
Claude Sonnet 4.51001005039057.8%
Gemini 3 Pro (Preview)1001005925056.8%
Claude Opus 410091830054.9%
Llama 3.1 8B100100730054.6%
GPT-4.1100100500050.0%
MoonshotAI: Kimi K2.5100100390047.8%
Ministral 8B100100257046.4%
Z.AI GLM 4.7 Flash100100250045.0%
Grok 4 Fast10010070041.4%
Z.AI GLM 4.5100502525741.4%
o4 Mini High10010000040.0%
Qwen 3.5 Plus (2026-02-15)10010000040.0%
Llama 3.1 70B10010000040.0%
Gemini 2.5 Flash Lite10010000040.0%
Ministral 3 14B10010000040.0%
DeepSeek V3 (2024-12-26)1008800037.5%
DeepSeek V3.21007300034.6%
Cohere Command R+ (Aug. 2024)1005900031.8%
Claude Opus 4.61005070031.4%
Z.AI GLM 4.71005000030.0%
Gemini 3 Flash (Preview)1003270027.9%
Hermes 3 405B1002570026.4%
GPT-5.1943100025.2%
o4 Mini100700021.4%
GPT-4o, Aug. 6th (temp=0)100700021.4%
Claude 3.5 Sonnet100000020.0%
Mistral Medium 3.1100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
Claude 3 Haiku100000020.0%
Mistral NeMO100000020.0%
Rocinante 12B100000020.0%
GPT-4o, May 13th (temp=1)83000016.7%
Mistral Large 273700016.0%
DeepSeek-V2 Chat67000013.3%
GPT-4.1 Mini252500010.0%
Claude 3.7 Sonnet3900007.8%
Stealth: Aurora Alpha3500006.9%
Grok 43200006.5%
GPT-5 Nano2500005.0%
Grok 4.1 Fast700001.4%
Gemma 3 12B700001.4%
Llama 3.1 Nemotron 70B700001.4%
GPT-5.2000000.0%
GPT-4o, May 13th (temp=0)000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 27B000000.0%
Qwen 2.5 72B000000.0%
Hermes 3 70B000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.7%
Gemini 3.1 Pro (Preview)1001001001009799.5%
GPT-4.11001001001009799.5%
Gemma 3 27B1001001001009799.5%
Ministral 3 8B1001001001009799.5%
Minimax M2.51001001001009699.3%
Z.AI GLM 4.7 Flash1001001001009699.2%
Stealth: Aurora Alpha1001001001009198.2%
Qwen 3.5 Plus (2026-02-15)1001001001009098.0%
WizardLM 2 8x22b100100100979097.5%
GPT-5 Mini1001001001008597.1%
Grok 4.1 Fast1001001001008396.7%
DeepSeek V3 (2024-12-26)1001001001008396.7%
Mistral Small 3.2 24B1001001001008396.7%
Gemini 3 Flash (Preview)10010096918995.3%
Claude Sonnet 4100100100938295.0%
GPT-4.1 Mini1001001001006292.4%
DeepSeek V3.1100100100857792.3%
GPT-4o, May 13th (temp=0)100100100827391.1%
GPT-4.1 Nano10010097975389.6%
Qwen 2.5 72B1009493897089.3%
Claude Haiku 4.5100100100716988.0%
GPT-4o Mini (temp=0)100100100983987.4%
Rocinante 12B1001001001003586.9%
Mistral NeMO1001001001003186.2%
Llama 3.1 Nemotron 70B10010094706385.5%
GPT-5 Nano100100100625583.5%
DeepSeek-V2 Chat1001001001001482.9%
Grok 41001001001001382.5%
Gemini 2.5 Flash Lite10010096674681.7%
Hermes 3 405B100100100594781.1%
Gemini 3 Pro (Preview)100100100473376.0%
Arcee AI: Trinity Mini10010076691070.9%
Cohere Command R+ (Aug. 2024)10010010041068.3%
Llama 3.1 8B1001001007061.4%
ByteDance Seed 1.61001001000060.0%
Llama 3.1 70B1001001000060.0%
Gemini 2.5 Flash100975045058.4%
GPT-5.21006957342857.5%
Hermes 3 70B100817617054.7%
Claude 3 Haiku10093710052.8%
GPT-4o, Aug. 6th (temp=0)855957201847.8%
Claude 3.5 Haiku100100250045.0%
GPT-4o Mini (temp=1)68545043042.9%
Gemma 3 12B7667520038.9%
GPT-4o, Aug. 6th (temp=1)9750360036.6%
GPT-4o, May 13th (temp=1)1002120201434.9%
Gemma 3 4B32313012021.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
DeepSeek V3 (2025-03-24)1001001001009498.9%
Mistral Large 31001001001009498.9%
Minimax M2.51001001001009298.5%
Claude Opus 4.61001001001009198.2%
Gemini 2.5 Pro1001001001008897.5%
Writer: Palmyra X5100100100939397.3%
Claude Opus 41001001001008697.2%
DeepSeek V3 (2024-12-26)1001001001008396.7%
o4 Mini1001001001007995.7%
ByteDance Seed 1.6 Flash1001001001007995.7%
Mistral Large1001001001007695.2%
Qwen 3.5 397B A17B100100100987694.8%
Ministral 8B1001001001007094.0%
Claude 3.7 Sonnet1001001001006593.1%
Gemini 3.1 Pro (Preview)1001001001005991.8%
MoonshotAI: Kimi K2.51001001001005791.4%
Z.AI GLM 4.51001001001005490.8%
Ministral 3 8B100100100836790.0%
WizardLM 2 8x22b100100100935389.3%
Z.AI GLM 4.6100100100816489.1%
Z.AI GLM 4.7100100100816388.8%
GPT-51001001001004488.7%
Claude 3.5 Sonnet1001001001003987.8%
Claude 3 Haiku100100100676285.7%
DeepSeek V3.1100100100814785.5%
Grok 4.1 Fast1001001001002885.5%
Claude Haiku 4.5100100100705685.3%
Ministral 3B100100100594781.1%
Z.AI GLM 5100100100100080.0%
ByteDance Seed 1.6100100100100080.0%
GPT-5 Mini1008975636277.8%
GPT-5 Nano100979483976.9%
DeepSeek-V2 Chat100100100611875.8%
o4 Mini High100100100621174.6%
Arcee AI: Trinity Large (Preview)1001008883074.2%
GPT-5.210010010070074.0%
Gemma 3 27B1001008367070.0%
Ministral 3 14B100100100252068.9%
GPT-4.1 Mini998171593468.7%
GPT-4o, May 13th (temp=0)10010010039468.5%
Gemini 2.5 Flash100977965068.2%
Llama 3.1 8B1001007959067.5%
GPT-4o, Aug. 6th (temp=0)1001007948766.7%
Hermes 3 405B1009794171464.4%
GPT-4.1 Nano94936753061.4%
Grok 4918383281760.5%
Cohere Command R+ (Aug. 2024)1001001000060.0%
Llama 3.1 70B100100970059.5%
Mistral NeMO100100970059.5%
GPT-4.1917069392859.4%
Ministral 3 3B100736359059.0%
Grok 4 Fast99795956058.5%
Qwen 2.5 72B100100810056.2%
Gemini 3 Pro (Preview)1006138372552.0%
Z.AI GLM 4.7 Flash99775125050.4%
Hermes 3 70B97884515048.9%
Gemini 3 Flash (Preview)91895011048.2%
Arcee AI: Trinity Mini10073592046.8%
Llama 3.1 Nemotron 70B10094207044.2%
Stealth: Aurora Alpha97752512041.8%
Gemma 3 12B9176280039.0%
Rocinante 12B10039307035.2%
GPT-4o Mini (temp=0)724817151232.7%
Qwen 3.5 Plus (2026-02-15)7152200028.5%
GPT-4o Mini (temp=1)7728180024.7%
Gemini 2.5 Flash Lite565070022.7%
Gemma 3 4B73100014.8%
GPT-4o, May 13th (temp=1)18187008.6%
GPT-4o, Aug. 6th (temp=1)700001.4%
Claude 3.5 Haiku000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Mistral Large 3100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Ministral 3 14B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009799.5%
DeepSeek V3 (2025-03-24)1001001001009799.5%
Claude Opus 41001001001009498.9%
Ministral 3 3B1001001001009498.7%
Z.AI GLM 4.71001001001009398.7%
Mistral Small Creative1001001001009398.6%
Grok 4.1 Fast1001001001009298.4%
Gemini 3.1 Pro (Preview)1001001001009198.1%
o4 Mini1001001001009098.0%
Claude Sonnet 4.51001001001008897.5%
Claude Haiku 4.51001001001008897.5%
WizardLM 2 8x22b1001001001008897.5%
GPT-4o Mini (temp=0)10010097969497.5%
ByteDance Seed 1.6 Flash1001001001008597.0%
Minimax M2.5100100100968596.4%
o4 Mini High100100100958395.8%
GPT-5 Mini1001001001007995.7%
Claude Sonnet 41001001001007795.4%
GPT-4.1 Nano1001001001007394.6%
Qwen 3.5 Plus (2026-02-15)100100100908094.1%
Gemini 3 Pro (Preview)10010097908193.6%
Claude 3.5 Sonnet1001001001006793.3%
GPT-4.1100100100976492.3%
GPT-5 Nano1001001001005591.0%
DeepSeek-V2 Chat100100100797690.9%
Gemini 2.5 Pro100100100836990.5%
Claude 3.7 Sonnet100100100737289.1%
Stealth: Aurora Alpha1009491737085.7%
Llama 3.1 70B10010088795985.0%
Qwen 2.5 72B100100100892883.5%
GPT-4o, May 13th (temp=0)1001001001001583.1%
Gemma 3 27B10010093764683.0%
Grok 4100100100833182.9%
Ministral 3B1001001001001082.0%
Llama 3.1 8B100100100100781.4%
Ministral 8B100100100100781.4%
Grok 4 Fast100100100614280.5%
Claude 3.5 Haiku100100100100080.0%
Ministral 3 8B100100100851279.4%
GPT-4o, May 13th (temp=1)1009791802578.8%
DeepSeek V3 (2024-12-26)1001009992078.2%
GPT-4o Mini (temp=1)10010010085077.0%
GPT-4.1 Mini1009479763676.9%
Gemini 3 Flash (Preview)10010010083076.7%
DeepSeek V3.110010010069474.6%
Gemini 2.5 Flash10010068443970.2%
Rocinante 12B10010010040769.5%
Z.AI GLM 4.7 Flash100996960065.7%
Arcee AI: Trinity Large (Preview)100917656064.7%
Cohere Command R+ (Aug. 2024)1009376322164.4%
Mistral Small 3.2 24B1001007743063.9%
Mistral NeMO10010063281460.9%
Gemma 3 12B100945741058.5%
Arcee AI: Trinity Mini8870507042.9%
GPT-5.280605024042.8%
Gemini 2.5 Flash Lite8867317739.9%
GPT-4o, Aug. 6th (temp=0)100621717039.0%
Llama 3.1 Nemotron 70B1007970037.1%
Hermes 3 405B10025180028.7%
Gemma 3 4B8518157025.0%
GPT-4o, Aug. 6th (temp=1)1002500025.0%
Claude 3 Haiku635700024.0%
Hermes 3 70B63000012.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
DeepSeek-V2 Chat1001001001009799.5%
Claude Haiku 4.51001001001008897.5%
Mistral NeMO1001001001008897.5%
Claude Opus 4.51001001001008396.7%
Grok 4.1 Fast100100100938896.1%
GPT-5 Mini100100100918595.2%
Mistral Small 3.2 24B1001001001007394.6%
MoonshotAI: Kimi K2.51001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Writer: Palmyra X51001001001005591.0%
Mistral Small Creative100100100885989.3%
Claude Opus 4.61001001001003987.8%
Ministral 3 8B1001001001003987.8%
DeepSeek V3 (2024-12-26)100100100924587.4%
DeepSeek V3.21009494794382.1%
Claude Opus 4100100100100080.0%
Z.AI GLM 4.710010010073776.0%
Ministral 3B1001009483075.6%
GPT-5.11009493632574.9%
Z.AI GLM 4.51001009679074.9%
DeepSeek V3.110010010067073.3%
Ministral 8B10010073592571.4%
Mistral Large 3100100100252570.0%
Ministral 3 14B10010067503269.8%
o4 Mini High100918373069.5%
Z.AI GLM 510010010039067.8%
GPT-4o, May 13th (temp=0)10010010039067.8%
Gemma 3 27B10010073452067.5%
DeepSeek V3 (2025-03-24)1001006759766.6%
Llama 3.1 70B10010010025065.0%
Grok 41001008835064.4%
Z.AI GLM 4.7 Flash1001009121062.4%
Gemma 3 12B100817647060.7%
Claude Sonnet 41001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Qwen 2.5 72B1001001000060.0%
Ministral 3 3B100100940058.9%
GPT-5 Nano100856730056.3%
GPT-5.21009141311154.8%
GPT-4.11008839251753.6%
GPT-4o, Aug. 6th (temp=0)10073730049.2%
Gemini 2.5 Flash Lite100100390047.8%
ByteDance Seed 1.6 Flash10010077042.9%
Gemini 3 Flash (Preview)9663390039.7%
Grok 4 Fast10067250038.3%
GPT-4.1 Nano1008800037.5%
Arcee AI: Trinity Mini1008300036.7%
Claude 3.7 Sonnet1007600035.2%
o4 Mini1005900031.8%
Gemini 3 Pro (Preview)8150250031.2%
Arcee AI: Trinity Large (Preview)1004370030.0%
Llama 3.1 8B1005000030.0%
GPT-4.1 Mini736700027.9%
Cohere Command R+ (Aug. 2024)735500025.6%
Claude 3 Haiku733900022.4%
Stealth: Aurora Alpha100700021.4%
Gemma 3 4B5045100020.9%
Hermes 3 405B100000020.0%
Qwen 3.5 Plus (2026-02-15)89000017.9%
GPT-4o, Aug. 6th (temp=1)452570015.4%
Llama 3.1 Nemotron 70B700001.4%
GPT-4o, May 13th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Hermes 3 70B000000.0%
Rocinante 12B000000.0%