Dialogue tag variety (said vs. fancy)

Test: Bad Writing Habits

Avg. Score
64.7%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude Sonnet 4.699.4%$0.03139.3s95%
2Claude Sonnet 4.6 (Reasoning)99.4%$0.0601.2m93%
3Z.AI GLM 5 Turbo93.9%$0.008133.2s56%
4MiniMax M2.592.4%$0.00341.3m60%
5MiniMax M2.789.5%$0.00401.1m54%
6Mistral Large85.7%$0.01430.9s42%
7Mistral Large 286.8%$0.01329.4s39%
8Ministral 8B80.0%$0.000410.4s35%
9Claude Sonnet 4.588.4%$0.03538.1s45%
10Z.AI GLM 588.4%$0.00841.2m42%
11Qwen 3.5 397B A17B95.7%$0.0143.0m66%
12Mistral Large 382.1%$0.003330.3s35%
13GPT-5 Mini81.8%$0.010057.4s44%
14Mistral Small Creative78.8%$0.00079.1s29%
15Ministral 3 3B79.1%$0.000511.1s29%
16Qwen 3.5 35B86.5%$0.0181.0m41%
17Qwen 3.5 9B85.3%$0.00111.4m40%
18Claude Opus 4.591.0%$0.07053.4s56%
19Ministral 3B77.5%$0.00018.1s28%
20Qwen 3.5 Flash82.4%$0.002547.5s30%
21Claude Haiku 4.577.2%$0.01121.6s32%
22ByteDance Seed 1.6 Flash77.0%$0.001327.3s29%
23GPT-5.4 Nano (Reasoning)75.3%$0.006124.5s32%
24Mistral Small 4 (Reasoning)76.3%$0.002230.2s30%
25GPT-5.4 Mini (Reasoning, Low)77.8%$0.01516.8s30%
26Ministral 3 8B75.1%$0.000819.6s27%
27Mistral Small 474.4%$0.001418.2s26%
28Writer: Palmyra X577.5%$0.01122.0s28%
29GPT-5.4 Mini76.8%$0.01516.8s27%
30GPT-5.4 Nano (Reasoning, Low)72.6%$0.005520.6s28%
31Qwen3 235B A22B Instruct 250777.6%$0.001159.2s29%
32Qwen 3.5 27B84.3%$0.0201.6m40%
33Mistral Medium 3.175.6%$0.004836.5s25%
34GPT-5.487.7%$0.0491.4m44%
35Ministral 3 14B69.6%$0.000711.7s22%
36Claude Sonnet 480.1%$0.03243.7s31%
37Inception Mercury73.6%$0.01117.6s21%
38GPT-5.4 Nano69.9%$0.005726.3s23%
39GPT-5.4 Mini (Reasoning)75.3%$0.02228.1s24%
40Qwen 3.5 122B80.3%$0.0251.1m28%
41GPT-5.4 (Reasoning, Low)85.1%$0.0551.4m41%
42Stealth: Healer Alpha67.2%$0.000023.7s16%
43DeepSeek V3 (2025-03-24)68.8%$0.001439.4s17%
44ByteDance Seed 1.685.2%$0.0132.5m34%
45Mistral NeMO62.9%$0.000510.1s14%
46Claude Opus 4.6 (Reasoning)88.5%$0.0881.4m43%
47Stealth: Hunter Alpha68.6%$0.000055.0s19%
48GPT-590.2%$0.0652.8m55%
49Claude Opus 4.686.0%$0.0781.2m37%
50Aion 2.070.8%$0.00641.3m24%
51Grok 4.1 Fast66.1%$0.001837.8s16%
52Gemini 2.5 Pro73.3%$0.03636.2s23%
53Llama 3.1 70B64.3%$0.001529.4s11%
54Gemini 3.1 Flash Lite (Preview)61.5%$0.00308.4s9%
55Qwen 3 32B61.9%$0.001554.6s18%
56Z.AI GLM 4.665.4%$0.006551.5s14%
57ByteDance Seed 2.0 Lite77.4%$0.0122.2m24%
58DeepSeek V3 (2024-12-26)63.6%$0.002154.6s14%
59Gemini 2.5 Flash56.9%$0.005210.6s9%
60Claude 3.5 Sonnet70.3%$0.04835.4s19%
61Grok 4.20 (Beta, Reasoning)67.6%$0.03934.0s17%
62LFM2 24B57.5%$0.000228.4s10%
63GPT-4.163.7%$0.01844.7s15%
64o4 Mini59.3%$0.01525.7s11%
65GPT-5.4 (Reasoning)87.7%$0.0892.6m47%
66Qwen 2.5 72B56.5%$0.001036.7s10%
67Z.AI GLM 4.559.4%$0.005142.1s10%
68GPT-4.1 Nano51.8%$0.000713.3s7%
69Arcee AI: Trinity Large (Preview)56.3%$0.000043.6s9%
70Grok 4 Fast52.9%$0.001724.1s8%
71WizardLM 2 8x22b66.5%$0.00261.8m16%
72Claude 3.5 Haiku54.2%$0.003510.8s3%
73Arcee AI: Trinity Mini50.0%$0.00039.2s5%
74Gemma 3 27B56.7%$0.000652.6s9%
75GPT-4o, Aug. 6th (temp=0)54.9%$0.02322.7s13%
76GPT-4o, May 13th (temp=0)57.7%$0.03514.1s10%
77DeepSeek V3.264.4%$0.00141.9m14%
78o4 Mini High61.2%$0.02547.2s10%
79DeepSeek-V2 Chat55.7%$0.002153.3s8%
80MoonshotAI: Kimi K2.577.1%$0.0193.2m29%
81GPT-5.172.0%$0.0541.8m23%
82GPT-4.1 Mini43.7%$0.002719.0s8%
83DeepSeek V3.160.6%$0.00201.8m13%
84Grok 4.20 (Beta)48.1%$0.01815.8s7%
85Claude 3 Haiku40.3%$0.002514.9s6%
86Gemini 2.5 Flash Lite40.0%$0.00099.5s4%
87Hermes 3 405B48.9%$0.003253.2s5%
88Claude 3.7 Sonnet56.4%$0.04246.7s12%
89Rocinante 12B46.1%$0.001438.4s3%
90Nemotron 3 Super52.7%$0.00001.4m8%
91Inception Mercury 238.2%$0.00327.0s4%
92Z.AI GLM 4.7 Flash49.2%$0.00171.2m8%
93Gemini 3 Flash (Preview, Reasoning)45.3%$0.01230.1s5%
94Gemini 2.5 Flash Lite (Reasoning)43.3%$0.002830.8s3%
95Gemini 3.1 Pro (Preview)79.9%$0.1071.8m28%
96Gemini 2.5 Flash (Reasoning)41.7%$0.01121.5s2%
97Z.AI GLM 4.751.7%$0.0101.4m8%
98Stealth: Aurora Alpha35.7%$0.00009.8s0%
99GPT-5 Nano47.2%$0.00421.4m9%
100Gemini 3 Flash (Preview)37.4%$0.007819.6s3%
101GPT-5.261.4%$0.0561.5m17%
102Llama 3.1 8B45.2%$0.00031.3m5%
103Llama 3.1 Nemotron 70B33.9%$0.003831.7s1%
104GPT-4o Mini (temp=0)34.6%$0.001234.8s0%
105ByteDance Seed 2.0 Mini77.8%$0.00454.9m23%
106Gemma 3 12B31.2%$0.000441.3s3%
107Cohere Command R+ (Aug. 2024)38.6%$0.02052.5s3%
108Qwen 3.5 Plus (2026-02-15)30.5%$0.006031.5s0%
109Grok 452.5%$0.0481.7m9%
110Nemotron 3 Nano30.7%$0.00101.1m2%
111Claude Opus 485.3%$0.2091.4m36%
112Gemma 3 4B19.7%$0.000220.0s0%
113Hermes 3 70B30.8%$0.00101.2m0%
114GPT-4o Mini (temp=1)20.1%$0.001234.8s0%
115GPT-4o, May 13th (temp=1)25.2%$0.03314.4s0%
116GPT-4o, Aug. 6th (temp=1)19.5%$0.01824.4s0%
117Gemini 3 Pro (Preview)31.2%$0.05554.4s4%
118Mistral Small 3.2 24B68.9%$0.00695.7m15%
64.74%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
MiniMax M2.71001001001009198.2%
ByteDance Seed 1.6100100100978897.0%
Claude Opus 4.6 (Reasoning)100100100919196.5%
GPT-5.4 (Reasoning, Low)100100100968496.0%
Claude Opus 41001001001007995.9%
Claude Sonnet 4.5100100100947994.6%
GPT-5.4 Mini (Reasoning)100100100966993.0%
GPT-5.4 Mini (Reasoning, Low)100100100966492.0%
GPT-5.4100100100965890.8%
ByteDance Seed 2.0 Lite100100100836790.0%
Qwen 3.5 35B1001001001004789.5%
GPT-510010086827789.0%
Qwen 3.5 397B A17B1001001001004288.4%
Gemini 3.1 Pro (Preview)10010083825083.0%
MiniMax M2.5100100100791879.4%
Claude Opus 4.6100100100801579.0%
Z.AI GLM 5100100100504378.6%
ByteDance Seed 2.0 Mini10010010091078.2%
Mistral Large 210010010079075.7%
GPT-5.4 Mini10010070623673.6%
Ministral 8B1001007973070.3%
Writer: Palmyra X594938864067.9%
Mistral Medium 3.11009995211265.4%
Qwen 3.5 9B1009667352865.1%
Qwen3 235B A22B Instruct 2507100998825062.2%
GPT-5.4 Nano89777667061.6%
Qwen 3.5 27B1009362371260.9%
Qwen 3.5 122B100100984060.3%
GPT-4o, Aug. 6th (temp=0)100100947060.3%
Claude 3.5 Sonnet91796759059.1%
Ministral 3 14B767059454558.9%
Mistral Large 3917867421157.7%
Qwen 3.5 Flash100100840056.8%
GPT-5.11001004737056.7%
GPT-5.2817168312955.9%
ByteDance Seed 1.6 Flash1007367211054.1%
Qwen 2.5 72B716353503354.1%
Qwen 3 32B10094760054.1%
Claude Sonnet 4100100700054.0%
Ministral 3 8B76696755053.3%
MoonshotAI: Kimi K2.5100100590051.8%
Rocinante 12B100100590051.8%
GPT-4.19383810051.4%
Mistral Small 497895021051.4%
GPT-5 Mini85846817051.0%
Mistral Small 3.2 24B10077690049.3%
GPT-5.4 Nano (Reasoning)1005041351447.9%
Grok 4.1 Fast100100390047.8%
Mistral NeMO100673025044.3%
o4 Mini64503932738.5%
Gemini 2.5 Flash8369320036.9%
Mistral Small 4 (Reasoning)7362470036.5%
GPT-5.4 Nano (Reasoning, Low)1007720035.9%
WizardLM 2 8x22b7667310034.6%
Mistral Large947900034.6%
Ministral 3 3B887900033.2%
Claude 3 Haiku50393932032.0%
Grok 463353021029.7%
Ministral 3B835900028.5%
Aion 2.0944700028.4%
Hermes 3 70B5353250026.3%
Llama 3.1 Nemotron 70B913900026.0%
Stealth: Healer Alpha1002800025.5%
Inception Mercury1002700025.4%
GPT-4o, May 13th (temp=0)6339190024.2%
Nemotron 3 Super5932250023.3%
GPT-4o, May 13th (temp=1)882020021.8%
Grok 4.20 (Beta)6128190021.6%
Claude Haiku 4.54747120021.2%
Grok 4 Fast732570021.0%
Gemini 3 Flash (Preview, Reasoning)100000020.0%
o4 Mini High100000020.0%
DeepSeek-V2 Chat100000020.0%
Claude 3.7 Sonnet100000020.0%
DeepSeek V3 (2024-12-26)791700019.0%
DeepSeek V3 (2025-03-24)672500018.3%
DeepSeek V3.1592570018.2%
Arcee AI: Trinity Large (Preview)91000018.2%
Mistral Small Creative82000016.4%
GPT-4o, Aug. 6th (temp=1)73000014.6%
Z.AI GLM 4.7 Flash67000013.3%
Hermes 3 405B353000012.9%
DeepSeek V3.264000012.9%
Gemini 2.5 Pro63000012.6%
Gemini 3 Flash (Preview)59300012.4%
GPT-5 Nano52000010.3%
GPT-4.1 Mini43700010.0%
Llama 3.1 70B50000010.0%
Llama 3.1 8B50000010.0%
Z.AI GLM 4.728174009.7%
Stealth: Hunter Alpha4700009.5%
Z.AI GLM 4.53170007.6%
Z.AI GLM 4.62500005.0%
LFM2 24B2500005.0%
Grok 4.20 (Beta, Reasoning)1770004.8%
Gemini 3.1 Flash Lite (Preview)1700003.3%
Cohere Command R+ (Aug. 2024)1700003.3%
Gemini 3 Pro (Preview)700001.4%
Gemini 2.5 Flash Lite (Reasoning)700001.4%
Gemma 3 12B200000.4%
Gemini 2.5 Flash (Reasoning)000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Claude 3.5 Haiku000000.0%
Gemini 2.5 Flash Lite000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 27B000000.0%
Nemotron 3 Nano000000.0%
GPT-4.1 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Qwen 3.5 9B1001001001009799.5%
GPT-51001001001008396.7%
Qwen 3.5 397B A17B1001001001008396.7%
Claude Sonnet 41001001001007695.2%
Claude Opus 4.510010097886790.3%
GPT-5.41009794945989.1%
Claude Sonnet 4.51001001001003987.8%
GPT-5.4 Nano (Reasoning)100100100675083.3%
GPT-5 Mini100100100892382.5%
Qwen 3.5 Flash100100100991282.2%
Stealth: Hunter Alpha100100100891781.2%
Z.AI GLM 5 Turbo100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Mistral Large 3100100100100080.0%
Llama 3.1 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Claude Opus 41001009994078.8%
GPT-5.4 (Reasoning)1007973676777.0%
GPT-5.4 (Reasoning, Low)1001009779075.2%
MiniMax M2.710010010067774.8%
MiniMax M2.510010010071074.2%
Writer: Palmyra X510010010050070.0%
Qwen 3 32B908973572566.9%
Ministral 3B10010010025065.0%
Qwen 3.5 35B10010010012062.4%
DeepSeek V3.11001001007061.4%
GPT-5.110010050391761.1%
Gemini 3.1 Pro (Preview)1001001000060.0%
Z.AI GLM 51001001000060.0%
MoonshotAI: Kimi K2.51001001000060.0%
ByteDance Seed 1.61001001000060.0%
Gemini 2.5 Flash (Reasoning)1001001000060.0%
Gemini 2.5 Flash Lite (Reasoning)1001001000060.0%
Hermes 3 405B1001001000060.0%
Inception Mercury1001001000060.0%
Mistral Small Creative1001001000060.0%
GPT-5.4 Nano (Reasoning, Low)100100880057.5%
GPT-5.4 Mini (Reasoning, Low)969139391756.3%
GPT-5.294797925055.3%
Grok 4.1 Fast100100390047.8%
o4 Mini100100250045.0%
ByteDance Seed 2.0 Lite100100250045.0%
Z.AI GLM 4.7 Flash88735014044.8%
Rocinante 12B10085307044.5%
GPT-5.4 Nano10079327043.6%
Mistral Small 4 (Reasoning)10059500041.8%
Ministral 3 14B10059500041.8%
GPT-5.4 Mini (Reasoning)1009477041.7%
Arcee AI: Trinity Large (Preview)10073323041.7%
DeepSeek V3 (2025-03-24)10010070041.4%
Claude Opus 4.610010000040.0%
Grok 4 Fast10010000040.0%
Mistral Small 3.2 24B10010000040.0%
Mistral NeMO10010000040.0%
LFM2 24B10010000040.0%
Qwen 3.5 122B1009400038.9%
Z.AI GLM 4.51009100038.2%
Claude 3.5 Sonnet10045390036.7%
Hermes 3 70B1006770034.8%
DeepSeek V3.21006300032.6%
Stealth: Healer Alpha1005900031.8%
WizardLM 2 8x22b1005000030.0%
Ministral 8B10025250030.0%
GPT-4o, Aug. 6th (temp=0)1003900027.8%
Ministral 3 8B1003900027.8%
Claude 3 Haiku1003900027.8%
Mistral Small 41002570026.4%
GPT-5.4 Mini7043140025.3%
Claude Opus 4.6 (Reasoning)1002500025.0%
Aion 2.01002500025.0%
GPT-4.11002500025.0%
Mistral Medium 3.11001700023.3%
Gemini 3 Flash (Preview)7620170022.4%
Grok 4.20 (Beta, Reasoning)100000020.0%
Grok 4100000020.0%
GPT-4o, May 13th (temp=0)100000020.0%
DeepSeek-V2 Chat100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
GPT-4o, Aug. 6th (temp=1)100000020.0%
ByteDance Seed 1.6 Flash100000020.0%
GPT-4.1 Nano100000020.0%
Llama 3.1 8B100000020.0%
GPT-4.1 Mini83000016.7%
Claude 3.7 Sonnet56000011.2%
GPT-5 Nano3900007.8%
Qwen 2.5 72B17170006.7%
Gemini 3 Flash (Preview, Reasoning)2500005.0%
Gemma 3 27B2500005.0%
Gemma 3 4B2500005.0%
o4 Mini High000000.0%
Gemini 3 Pro (Preview)000000.0%
Z.AI GLM 4.7000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Nemotron 3 Super000000.0%
Grok 4.20 (Beta)000000.0%
Inception Mercury 2000000.0%
GPT-4o, May 13th (temp=1)000000.0%
Stealth: Aurora Alpha000000.0%
Gemini 2.5 Flash Lite000000.0%
Gemini 2.5 Flash000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
GPT-4o Mini (temp=0)000000.0%
Nemotron 3 Nano000000.0%
Llama 3.1 Nemotron 70B000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.7%
Z.AI GLM 4.61001001001009999.7%
DeepSeek-V2 Chat1001001001009899.7%
Gemma 3 27B1001001001009799.5%
Mistral Small Creative1001001001009699.3%
LFM2 24B1001001001009398.7%
GPT-4.11001001001009198.2%
Gemini 2.5 Flash (Reasoning)1001001001009198.2%
ByteDance Seed 2.0 Lite1001001001009198.2%
Mistral NeMO1001001001009198.2%
Grok 4100100100999298.2%
Gemini 3.1 Pro (Preview)1001001001009098.1%
Qwen 3.5 Flash1001001001008997.9%
Mistral Small 3.2 24B1001001001008697.2%
DeepSeek V3.21001001001008396.7%
Gemma 3 12B100100100919096.2%
GPT-5 Mini100100100988296.0%
GPT-5.4 Nano (Reasoning)100100100948495.7%
Llama 3.1 70B1001001001007995.7%
Ministral 3B1001001001007995.7%
Claude Haiku 4.51001001001007695.2%
Gemini 2.5 Pro100100100888895.0%
Qwen 3 32B100100100898594.8%
DeepSeek V3.110010093918994.8%
Grok 4.1 Fast1001001001007394.6%
Nemotron 3 Super1001001001007394.6%
Stealth: Hunter Alpha100100100868694.4%
Z.AI GLM 4.71001001001007094.0%
Stealth: Healer Alpha1001001001006893.6%
Claude 3.7 Sonnet1001001001006693.2%
Mistral Small 41001001001003987.8%
GPT-5 Nano100100100855487.7%
GPT-4o Mini (temp=0)1009289817587.4%
Grok 4 Fast100100100923685.6%
GPT-5.4 Nano (Reasoning, Low)958276727079.3%
GPT-4.1 Nano1009191813279.1%
GPT-4o, Aug. 6th (temp=0)10010083692876.1%
Gemini 3 Flash (Preview)979592484475.4%
Arcee AI: Trinity Mini10010079762175.1%
GPT-4.1 Mini10010077474273.1%
Qwen 3.5 Plus (2026-02-15)1009185473571.8%
Gemini 3.1 Flash Lite (Preview)1001007962068.1%
Cohere Command R+ (Aug. 2024)10010010035066.9%
Gemini 3 Flash (Preview, Reasoning)1009356453064.7%
Gemini 2.5 Flash Lite (Reasoning)998561501862.6%
Gemini 2.5 Flash1001001000060.0%
Llama 3.1 8B100100880057.5%
Nemotron 3 Nano86848325757.2%
GPT-4o, May 13th (temp=1)1008061271356.1%
Gemini 2.5 Flash Lite1008150221754.0%
Rocinante 12B1001004517052.3%
Gemini 3 Pro (Preview)100794023549.5%
Inception Mercury 210073670047.9%
Gemma 3 4B100554732046.7%
Llama 3.1 Nemotron 70B100882014745.6%
Claude 3.5 Haiku10050177034.8%
GPT-4o, Aug. 6th (temp=1)9950170033.1%
GPT-4o Mini (temp=1)36251715018.7%
Stealth: Aurora Alpha523900018.3%
Hermes 3 70B4700009.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Grok 4.1 Fast1001001001009999.7%
Claude Sonnet 41001001001009298.5%
WizardLM 2 8x22b1001001001009298.5%
ByteDance Seed 2.0 Mini1001001001009198.2%
Mistral Small 41001001001008897.5%
MoonshotAI: Kimi K2.51001001001008797.3%
Grok 4.20 (Beta, Reasoning)1001001001008396.7%
Grok 4.20 (Beta)1001001001008196.2%
Mistral Large100100100898895.4%
DeepSeek V3 (2025-03-24)100100100918394.9%
Z.AI GLM 4.5100100100967594.1%
Claude Opus 41001001001007094.0%
o4 Mini10010095888393.2%
Mistral Small Creative100100100887692.7%
Mistral Medium 3.11001001001006292.4%
Mistral Small 3.2 24B1001001001005991.8%
Qwen 3.5 27B100100100946491.6%
Stealth: Hunter Alpha1001001001005390.6%
Stealth: Healer Alpha100100100797089.7%
Claude 3 Haiku100100100965089.2%
Qwen 3.5 9B1001001001004689.2%
o4 Mini High10010090797689.0%
Qwen 3.5 122B100100100836188.9%
Rocinante 12B10010097736787.4%
DeepSeek V3.2100100100766187.4%
ByteDance Seed 1.6 Flash100100100715986.0%
GPT-5 Mini959089837286.0%
Aion 2.0100100100646285.2%
Mistral Small 4 (Reasoning)100100100893584.8%
Gemini 3.1 Pro (Preview)1001001001002384.6%
GPT-5.4 Nano (Reasoning)10010087756084.3%
Arcee AI: Trinity Large (Preview)100100100734583.6%
GPT-5.4 Nano (Reasoning, Low)1009182747083.4%
Ministral 8B1001001001001783.3%
DeepSeek-V2 Chat100100100564880.8%
DeepSeek V3 (2024-12-26)1009084794880.3%
Claude 3.5 Sonnet10010010088778.9%
Claude 3.7 Sonnet10010095692778.3%
Nemotron 3 Super10010083703477.4%
Gemma 3 27B1008885753676.7%
Grok 4 Fast1008179635976.3%
GPT-4.110010010080076.0%
Ministral 3 8B1001009173774.3%
Llama 3.1 8B888883595073.5%
Qwen 3.5 Flash1001008380072.6%
Gemini 2.5 Pro10010010055071.0%
Ministral 3 14B100888870069.0%
Hermes 3 70B10010010045068.9%
Mistral NeMO10010091361768.8%
Ministral 3 3B10010079322567.2%
GPT-4o, May 13th (temp=0)1001009125063.2%
Z.AI GLM 4.7 Flash1001006943062.3%
Hermes 3 405B1001001000060.0%
Gemini 2.5 Flash86757556058.5%
Z.AI GLM 4.7837947433958.0%
Arcee AI: Trinity Mini1006753452557.9%
Ministral 3B100100797057.1%
Qwen 2.5 72B1006750441555.0%
LFM2 24B94916320053.6%
DeepSeek V3.1100794239051.9%
Gemini 3 Flash (Preview, Reasoning)100804225750.8%
Grok 489853932049.2%
GPT-4o, Aug. 6th (temp=0)595352392846.3%
GPT-4o Mini (temp=0)9965630045.5%
Qwen 3 32B735050252143.8%
Llama 3.1 Nemotron 70B67675525042.6%
Gemini 3.1 Flash Lite (Preview)9181350041.4%
Cohere Command R+ (Aug. 2024)81553925039.9%
Gemini 2.5 Flash (Reasoning)81453932039.4%
Gemma 3 12B7975285037.1%
GPT-4.1 Mini75572922036.6%
GPT-4.1 Nano73633017036.6%
Gemini 3 Pro (Preview)92252315031.0%
Qwen 3.5 Plus (2026-02-15)6347360029.1%
GPT-5 Nano48251512019.8%
Gemini 2.5 Flash Lite (Reasoning)59777016.1%
GPT-4o, May 13th (temp=1)521800014.2%
GPT-4o, Aug. 6th (temp=1)363500014.1%
GPT-4o Mini (temp=1)2500005.0%
Gemini 3 Flash (Preview)1743004.7%
Gemini 2.5 Flash Lite2000003.9%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Claude 3.5 Haiku000000.0%
Nemotron 3 Nano000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Qwen 2.5 72B1001001001009999.7%
GPT-5.21001001001009599.1%
ByteDance Seed 2.0 Mini1001001001009498.9%
Gemini 3.1 Pro (Preview)1001001001009498.8%
DeepSeek V3 (2024-12-26)1001001001009298.5%
ByteDance Seed 1.61001001001009198.2%
Grok 4.1 Fast1001001001009198.2%
GPT-5.4 Nano (Reasoning)1001001001008797.3%
GPT-4o, May 13th (temp=0)1001001001008697.2%
o4 Mini100100100949297.1%
WizardLM 2 8x22b1001001001008496.9%
DeepSeek V3.21001001001008396.7%
MiniMax M2.5100100100988496.5%
o4 Mini High10010097928995.6%
Llama 3.1 70B100100100888895.0%
Mistral Small Creative1001001001007595.0%
Mistral Medium 3.110010096958194.6%
Aion 2.01001001001006993.9%
GPT-5 Mini100100100888193.6%
Gemini 2.5 Flash100100100947393.5%
Z.AI GLM 4.7 Flash100100100917493.1%
Claude Haiku 4.5100100100927393.0%
DeepSeek V3.110010095937592.7%
Mistral Small 41001001001006192.1%
Qwen 3.5 122B1001001001005791.5%
Gemini 3 Flash (Preview, Reasoning)100100100827591.4%
Llama 3.1 8B100100100837391.3%
Claude 3.7 Sonnet100100100817491.1%
Grok 4.20 (Beta, Reasoning)100100100994588.7%
DeepSeek V3 (2025-03-24)100100100736988.4%
LFM2 24B1001001001004288.4%
Ministral 3B100100100816188.3%
DeepSeek-V2 Chat100100100964488.1%
Gemini 2.5 Pro10010085807387.6%
GPT-4o, Aug. 6th (temp=0)100100100766287.5%
ByteDance Seed 2.0 Lite100100100795987.5%
Ministral 3 14B100100100973987.3%
GPT-5.4 Nano (Reasoning, Low)100100100696787.2%
Z.AI GLM 4.6939189837987.1%
ByteDance Seed 1.6 Flash1001001001002985.8%
Mistral Large100100100814785.7%
Grok 4.20 (Beta)1009189757285.6%
Gemini 2.5 Flash (Reasoning)10010098834485.0%
Inception Mercury1001001001002585.0%
Stealth: Healer Alpha100100100913084.2%
GPT-5 Nano1008581767583.4%
Hermes 3 405B100100100100080.0%
Mistral Small 4 (Reasoning)10010094673980.0%
GPT-4o Mini (temp=0)10010096881579.8%
Ministral 3 8B100100100792079.6%
MoonshotAI: Kimi K2.5100100100564179.5%
GPT-5.4 Nano1009768636177.8%
Gemini 3 Pro (Preview)1009172646177.7%
Ministral 8B100100100533577.7%
Qwen 3 32B1009183635077.5%
Mistral NeMO10010010088077.5%
Grok 4 Fast10010010085077.1%
Nemotron 3 Super10010088752276.9%
Gemma 3 27B100100100542876.4%
Gemini 3 Flash (Preview)1008987733175.9%
Grok 410010097453174.6%
Mistral Small 3.2 24B1001008879373.8%
Z.AI GLM 4.710010073582871.9%
Gemini 3.1 Flash Lite (Preview)1001007768069.0%
Arcee AI: Trinity Large (Preview)10010010035066.9%
GPT-4o Mini (temp=1)807354524861.3%
GPT-4.1 Nano1007059561259.5%
Qwen 3.5 Plus (2026-02-15)1001006431059.1%
Rocinante 12B100100810056.2%
Claude 3 Haiku1001004729055.1%
Hermes 3 70B100100630052.6%
Nemotron 3 Nano100776322052.3%
Ministral 3 3B100100554051.7%
GPT-4o, May 13th (temp=1)83675041048.3%
GPT-4.1 Mini100752322043.8%
Inception Mercury 28382430041.6%
Gemma 3 12B76563931040.5%
Cohere Command R+ (Aug. 2024)7971200033.9%
Arcee AI: Trinity Mini10045200032.9%
Gemma 3 4B79292221030.0%
Gemini 2.5 Flash Lite (Reasoning)636300025.2%
Llama 3.1 Nemotron 70B1002500025.0%
GPT-4o, Aug. 6th (temp=1)794300024.3%
Stealth: Aurora Alpha100000020.0%
Gemini 2.5 Flash Lite56000011.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Qwen 3.5 Flash100100100979398.0%
Qwen3 235B A22B Instruct 25071001001001008997.8%
Z.AI GLM 51001001001008897.5%
GPT-5.4 Mini1001001001008897.5%
DeepSeek V3.1100100100998897.2%
Writer: Palmyra X51001001001008396.7%
Z.AI GLM 5 Turbo1001001001008196.2%
Claude Opus 4.6 (Reasoning)1001001001007394.6%
GPT-5.41009997898694.4%
Ministral 3 8B100100100887392.1%
Claude Haiku 4.51001001001005991.8%
Claude Sonnet 4.51001001001005591.0%
GPT-5.21001001001005490.8%
Ministral 8B100100100995089.7%
ByteDance Seed 1.6 Flash100100100836289.0%
Qwen 3.5 35B10010093726786.2%
Qwen 3.5 9B100100100893985.7%
Gemini 2.5 Flash Lite1001001001002585.0%
Ministral 3 3B1001001001002585.0%
Claude Opus 4100100100733982.4%
GPT-5.110010086705582.2%
GPT-5 Mini100100100931481.3%
MiniMax M2.710010010097780.9%
ByteDance Seed 1.6100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
ByteDance Seed 2.0 Lite100100100100080.0%
Inception Mercury100100100100080.0%
Claude Sonnet 410010010096079.2%
Mistral Large 210010010094078.9%
DeepSeek V3 (2025-03-24)10010010085778.5%
Mistral Large10010094792078.5%
MoonshotAI: Kimi K2.5100100100503977.8%
GPT-5.4 (Reasoning, Low)10010079575377.8%
Mistral Medium 3.110010081624577.5%
Mistral Small 3.2 24B10010010083076.7%
Gemma 3 27B1008885634576.2%
DeepSeek V3.210010010077075.4%
GPT-5.4 Nano10010010076075.2%
GPT-5.4 Mini (Reasoning, Low)10010082692274.5%
Stealth: Hunter Alpha100100100391771.1%
Claude Opus 4.51001008167069.5%
o4 Mini10010010045068.9%
Mistral Small 41001008359068.5%
Qwen 3 32B1001009432766.6%
Claude 3.7 Sonnet1008873353365.7%
GPT-5.4 Nano (Reasoning)1001007744765.5%
Mistral Large 310010010025065.0%
Llama 3.1 70B10010010025065.0%
WizardLM 2 8x22b1001007239062.2%
Mistral Small Creative100965553060.8%
o4 Mini High1001001000060.0%
Z.AI GLM 4.61001001000060.0%
Gemini 2.5 Pro1001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Arcee AI: Trinity Mini1001001000060.0%
Grok 4.20 (Beta, Reasoning)1007656391757.5%
Gemini 2.5 Flash (Reasoning)100100830056.7%
Z.AI GLM 4.710099757056.2%
Z.AI GLM 4.5100887320056.0%
Ministral 3B100100730054.6%
Ministral 3 14B100837614054.6%
Hermes 3 405B1001005017053.3%
Mistral Small 4 (Reasoning)1009439141251.8%
Stealth: Healer Alpha1001003914050.5%
Mistral NeMO1001002521049.2%
DeepSeek-V2 Chat100635525048.6%
Gemini 3.1 Pro (Preview)67675550348.2%
GPT-5.4 Nano (Reasoning, Low)686157312147.7%
Grok 4 Fast1001001717046.7%
Grok 4.1 Fast100100257046.4%
ByteDance Seed 2.0 Mini100100257046.4%
GPT-4o, Aug. 6th (temp=0)100100257046.4%
Arcee AI: Trinity Large (Preview)10093320045.0%
DeepSeek V3 (2024-12-26)100453932043.2%
Grok 49475257040.3%
Qwen 2.5 72B10010000040.0%
GPT-4.1 Nano10010000040.0%
Rocinante 12B10010000040.0%
Gemma 3 4B10079120038.1%
GPT-4.110055290036.7%
Llama 3.1 Nemotron 70B10059170035.2%
Gemini 3 Flash (Preview)1005670032.7%
Cohere Command R+ (Aug. 2024)10035147732.5%
Gemma 3 12B7667140031.2%
Grok 4.20 (Beta)10029270031.1%
GPT-4.1 Mini1003900027.8%
Hermes 3 70B1003200026.5%
Llama 3.1 8B100000020.0%
Gemini 3.1 Flash Lite (Preview)633500019.5%
Qwen 3.5 Plus (2026-02-15)91000018.2%
GPT-4o, May 13th (temp=0)89000017.8%
Nemotron 3 Super79000015.7%
Gemini 3 Pro (Preview)323100012.7%
Claude 3 Haiku252500010.0%
Nemotron 3 Nano4500008.9%
GPT-5 Nano1100002.2%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Inception Mercury 2000000.0%
GPT-4o, May 13th (temp=1)000000.0%
Stealth: Aurora Alpha000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
LFM2 24B000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100948896.4%
MiniMax M2.51009491837588.8%
Mistral Large 31009189806284.4%
GPT-5.4 Mini (Reasoning, Low)999185825983.2%
ByteDance Seed 1.6 Flash10010094813782.5%
Ministral 8B100100100732880.2%
Z.AI GLM 510010010099280.1%
Qwen 3.5 397B A17B100100100100080.0%
GPT-5.4 (Reasoning)1008579705677.9%
ByteDance Seed 1.61001009488076.4%
GPT-5.4 (Reasoning, Low)1008482754176.4%
Qwen 3.5 9B1001009180074.3%
GPT-5.4 Mini969473565073.9%
MiniMax M2.710010085621271.9%
Z.AI GLM 5 Turbo1001009753771.5%
Mistral Large 2817167676469.9%
GPT-5.41009678571769.6%
Ministral 3 3B1001009450068.9%
Mistral Large966359595666.7%
Claude Haiku 4.51007059473962.9%
Claude Opus 4.6 (Reasoning)100857725758.9%
Claude Opus 4.51006750423558.6%
GPT-5.4 Mini (Reasoning)896967322556.5%
Qwen 3.5 Flash100100670053.3%
Mistral Small 41006352361152.5%
GPT-4o, May 13th (temp=0)100734732250.8%
GPT-4.1100100500050.0%
Mistral Small 4 (Reasoning)83734746049.9%
Qwen3 235B A22B Instruct 250799734335049.9%
Inception Mercury100100400048.0%
GPT-5 Mini91565039047.3%
Mistral Medium 3.110089420046.3%
ByteDance Seed 2.0 Lite100593925746.0%
Ministral 3B10079500045.7%
Qwen 2.5 72B10097200043.4%
Qwen 3.5 122B100100170043.3%
Mistral Small Creative100503925042.8%
LFM2 24B10083300042.7%
Qwen 3 32B73685612042.0%
GPT-5.2583939353240.5%
Ministral 3 14B6967650040.2%
Qwen 3.5 35B10010000040.0%
MoonshotAI: Kimi K2.58365470039.0%
Claude Opus 4.682363625737.2%
WizardLM 2 8x22b1007370036.0%
Claude Sonnet 4.588532512035.6%
Claude 3 Haiku7552470035.0%
Mistral NeMO887970034.6%
GPT-563464118033.6%
Aion 2.08556157032.7%
GPT-4o, Aug. 6th (temp=0)73452514031.3%
GPT-5 Nano905900029.8%
Claude 3.7 Sonnet5756332029.8%
Qwen 3.5 27B45363521428.1%
ByteDance Seed 2.0 Mini10021140026.9%
Inception Mercury 24741363025.4%
Gemini 2.5 Pro1002500025.0%
Ministral 3 8B1001820024.1%
GPT-4.1 Mini705000024.0%
Hermes 3 405B595530023.4%
Claude Opus 4921900022.2%
Arcee AI: Trinity Large (Preview)6335110021.8%
Hermes 3 70B763200021.7%
Stealth: Aurora Alpha5632170021.1%
GPT-4o, May 13th (temp=1)4343170020.5%
Gemini 3.1 Pro (Preview)100000020.0%
Mistral Small 3.2 24B100000020.0%
Gemini 2.5 Flash Lite534700020.0%
Claude Sonnet 4672500018.3%
DeepSeek V3.288000017.5%
Nemotron 3 Nano30251712016.7%
Qwen 3.5 Plus (2026-02-15)641030015.5%
GPT-5.13030142014.9%
GPT-5.4 Nano72000014.3%
Grok 4.20 (Beta, Reasoning)69000013.8%
Writer: Palmyra X562430013.7%
GPT-5.4 Nano (Reasoning)412500013.2%
DeepSeek V3 (2024-12-26)412200012.6%
Claude 3.5 Sonnet59000011.8%
GPT-5.4 Nano (Reasoning, Low)372000011.4%
Stealth: Hunter Alpha451200011.4%
Grok 4.20 (Beta)421400011.1%
Llama 3.1 8B55000011.0%
Grok 4.1 Fast352000010.8%
DeepSeek V3 (2025-03-24)4500008.9%
Z.AI GLM 4.7 Flash3570008.4%
DeepSeek V3.13900007.8%
Llama 3.1 Nemotron 70B3900007.8%
GPT-4o Mini (temp=0)22150007.3%
Gemma 3 27B2520005.4%
Z.AI GLM 4.51870005.1%
o4 Mini High2500005.0%
Z.AI GLM 4.62500005.0%
Gemma 3 4B2000003.9%
o4 Mini700001.4%
Gemini 2.5 Flash (Reasoning)700001.4%
Rocinante 12B700001.4%
Grok 4 Fast700001.4%
Gemini 2.5 Flash200000.4%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Gemini 3 Pro (Preview)000000.0%
Z.AI GLM 4.7000000.0%
Grok 4000000.0%
Stealth: Healer Alpha000000.0%
Gemini 2.5 Flash Lite (Reasoning)000000.0%
Gemini 3 Flash (Preview)000000.0%
DeepSeek-V2 Chat000000.0%
Nemotron 3 Super000000.0%
Claude 3.5 Haiku000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
Llama 3.1 70B000000.0%
GPT-4.1 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-5 Mini100100100886790.8%
Z.AI GLM 5 Turbo100100100100080.0%
Claude Opus 4.5100100100100080.0%
MiniMax M2.5100100100100080.0%
Stealth: Hunter Alpha100100100100080.0%
Qwen 3.5 Flash100100100100080.0%
Gemini 2.5 Flash Lite100100100100080.0%
Mistral Large100100100100080.0%
LFM2 24B100100100100080.0%
GPT-5100100100673279.8%
Claude Sonnet 4100100100453976.7%
Mistral Small Creative10010010076075.2%
MiniMax M2.710010010039067.8%
Mistral NeMO10010010039067.8%
Z.AI GLM 51001009425765.3%
Claude Sonnet 4.510010010025065.0%
GPT-4o, Aug. 6th (temp=0)1001001007061.4%
Mistral Small 41001001007061.4%
Qwen 3.5 122B1001001000060.0%
MoonshotAI: Kimi K2.51001001000060.0%
ByteDance Seed 1.61001001000060.0%
Gemini 2.5 Pro1001001000060.0%
Claude Opus 41001001000060.0%
ByteDance Seed 2.0 Mini1001001000060.0%
Grok 4 Fast1001001000060.0%
Claude 3.5 Sonnet1001001000060.0%
Mistral Large 21001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Inception Mercury1001001000060.0%
Llama 3.1 70B1001001000060.0%
GPT-4.1 Nano1001001000060.0%
Claude 3 Haiku1001001000060.0%
Rocinante 12B1001001000060.0%
GPT-5.4 Nano (Reasoning, Low)100796750059.0%
Ministral 3 14B100100507051.4%
Grok 4.20 (Beta, Reasoning)100100500050.0%
WizardLM 2 8x22b1001002525050.0%
DeepSeek V3.2100100390047.8%
Hermes 3 70B100100390047.8%
GPT-5.4 Nano (Reasoning)100100257046.4%
Mistral Small 3.2 24B100100250045.0%
GPT-5.4 (Reasoning)97503925042.3%
o4 Mini High10010070041.4%
Claude Haiku 4.510010070041.4%
DeepSeek V3 (2025-03-24)10010070041.4%
Mistral Small 4 (Reasoning)8373500041.3%
Qwen 2.5 72B10088140040.2%
Qwen 3.5 27B10010000040.0%
Grok 4.1 Fast10010000040.0%
Qwen 3.5 9B10010000040.0%
Gemini 2.5 Flash Lite (Reasoning)10010000040.0%
ByteDance Seed 2.0 Lite10010000040.0%
GPT-4.1 Mini10010000040.0%
GPT-4o, Aug. 6th (temp=1)10010000040.0%
DeepSeek V3.110010000040.0%
Qwen3 235B A22B Instruct 250710010000040.0%
Gemma 3 4B10010000040.0%
ByteDance Seed 1.6 Flash10050177034.8%
Aion 2.01003970029.2%
Mistral Medium 3.11004500028.9%
Llama 3.1 8B1003900027.8%
Gemma 3 27B1002500025.0%
GPT-5.4 (Reasoning, Low)735000024.6%
GPT-5.4 Nano5047250024.3%
GPT-5.4971700022.8%
GPT-5.2762570021.6%
Z.AI GLM 4.6100000020.0%
Grok 4100000020.0%
Mistral Large 3100000020.0%
DeepSeek-V2 Chat100000020.0%
GPT-4o, May 13th (temp=1)100000020.0%
DeepSeek V3 (2024-12-26)100000020.0%
Hermes 3 405B100000020.0%
GPT-5 Nano100000020.0%
Qwen 3 32B100000020.0%
Ministral 3 8B100000020.0%
Cohere Command R+ (Aug. 2024)100000020.0%
Claude 3.7 Sonnet72000014.3%
Arcee AI: Trinity Large (Preview)452500013.9%
Writer: Palmyra X559700013.2%
GPT-5.150000010.0%
Z.AI GLM 4.7 Flash3900007.8%
GPT-4o, May 13th (temp=0)3500006.9%
Inception Mercury 22570006.4%
GPT-5.4 Mini (Reasoning)2500005.0%
Gemini 3 Flash (Preview, Reasoning)2500005.0%
Qwen 3.5 Plus (2026-02-15)2500005.0%
Nemotron 3 Nano2500005.0%
Gemini 3 Pro (Preview)700001.4%
GPT-4.1700001.4%
GPT-5.4 Mini (Reasoning, Low)700001.4%
Stealth: Aurora Alpha700001.4%
Z.AI GLM 4.7000000.0%
o4 Mini000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
Z.AI GLM 4.5000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Gemini 3 Flash (Preview)000000.0%
Nemotron 3 Super000000.0%
Grok 4.20 (Beta)000000.0%
GPT-5.4 Mini000000.0%
GPT-4o Mini (temp=1)000000.0%
Gemma 3 12B000000.0%
GPT-4o Mini (temp=0)000000.0%
Llama 3.1 Nemotron 70B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3B100100100100100100.0%
ByteDance Seed 1.6 Flash1001001001009999.8%
GPT-5.11001001001009999.8%
Claude Sonnet 41001001001009699.3%
Claude Sonnet 4.6 (Reasoning)1001001001009599.0%
Gemma 3 27B1001001001009198.2%
Qwen 3.5 122B1001001001009098.0%
MiniMax M2.71001001001008897.7%
GPT-5.2100100100968997.1%
Grok 4100100100949197.1%
GPT-4o, May 13th (temp=0)1001001001008597.0%
Qwen 3.5 35B1001001001008396.7%
GPT-4.11001001001008396.7%
Grok 4.1 Fast10010097949196.6%
Mistral Large1001001001008296.4%
GPT-4.1 Nano1001001001008196.2%
Stealth: Aurora Alpha1001001001008096.0%
Mistral Small Creative1001001001008096.0%
GPT-5 Mini100100100968396.0%
Inception Mercury1001001001007995.7%
Mistral Small 41001001001007995.7%
Ministral 3 8B1001001001007995.7%
GPT-5.4 Nano (Reasoning)100100100898895.4%
Ministral 3 3B1001001001007695.2%
MoonshotAI: Kimi K2.5100100100918294.6%
Mistral NeMO1001001001007194.2%
Z.AI GLM 51001001001007094.0%
Mistral Medium 3.11001001001007094.0%
MiniMax M2.5100100100967393.8%
Claude 3.5 Sonnet1001001007593.8%
o4 Mini100100100917693.4%
Mistral Large 31001001001006793.3%
Ministral 8B1001001001006593.0%
Qwen 3.5 Flash1001001001006492.9%
Qwen 3 32B1001001001006292.4%
Gemini 2.5 Flash10010097966792.1%
LFM2 24B100100100936791.9%
Mistral Small 4 (Reasoning)100100100916591.3%
Z.AI GLM 4.5100100100896791.1%
Arcee AI: Trinity Large (Preview)100100100975690.6%
Z.AI GLM 4.7100100100916190.4%
Aion 2.01001001001004689.2%
o4 Mini High100100100984789.0%
GPT-4.1 Mini1009391817387.6%
Qwen 3.5 9B1001001001003586.9%
Grok 4 Fast1009489854783.4%
Arcee AI: Trinity Mini100100100852882.7%
Stealth: Healer Alpha100100100991282.2%
Qwen 2.5 72B10010088853681.9%
Claude 3.7 Sonnet1008580766981.8%
Grok 4.20 (Beta)1009485834581.5%
Inception Mercury 2100100100713581.1%
DeepSeek V3.210010090645080.9%
Grok 4.20 (Beta, Reasoning)10010097575080.9%
Gemini 3.1 Flash Lite (Preview)100100100100080.0%
DeepSeek-V2 Chat100100100544179.0%
GPT-5.4 Nano10010075605878.5%
DeepSeek V3 (2025-03-24)10010010083778.1%
Gemini 2.5 Flash Lite868580646175.3%
GPT-5.4 Nano (Reasoning, Low)838178686775.3%
ByteDance Seed 1.6100100100502575.0%
Gemini 3 Flash (Preview)978280595675.0%
ByteDance Seed 2.0 Mini100100100393574.7%
Claude 3.5 Haiku100100100392572.8%
Hermes 3 70B10010010056772.7%
GPT-5 Nano1009664613771.6%
GPT-4o, Aug. 6th (temp=0)10010067503470.1%
Gemini 3.1 Pro (Preview)100100100251768.3%
Cohere Command R+ (Aug. 2024)10010010035066.9%
Gemini 3 Flash (Preview, Reasoning)10010069451766.1%
Z.AI GLM 4.610010082292066.0%
DeepSeek V3 (2024-12-26)10010010027065.4%
Z.AI GLM 4.7 Flash1008947413763.0%
ByteDance Seed 2.0 Lite1001005955062.8%
Rocinante 12B10010010014062.7%
WizardLM 2 8x22b10010050391260.2%
Nemotron 3 Nano1006153523259.8%
Llama 3.1 8B1001005939059.6%
Gemini 2.5 Pro1008665251858.8%
DeepSeek V3.11006855422558.0%
Gemma 3 12B827571322557.0%
Nemotron 3 Super1006943393456.9%
Gemini 2.5 Flash Lite (Reasoning)100100830056.7%
GPT-4o, May 13th (temp=1)79756262055.4%
GPT-4o Mini (temp=0)99915422053.4%
Llama 3.1 Nemotron 70B10088790053.2%
Gemma 3 4B100706222050.7%
Gemini 3 Pro (Preview)966443252049.5%
Hermes 3 405B100100257046.4%
Llama 3.1 70B100100250045.0%
Gemini 2.5 Flash (Reasoning)10073390042.4%
GPT-4o Mini (temp=1)9756150033.6%
Qwen 3.5 Plus (2026-02-15)77312925032.3%
GPT-4o, Aug. 6th (temp=1)10030120028.4%
Claude 3 Haiku100000020.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-51001001001009799.5%
Z.AI GLM 51001001001009799.5%
GPT-5.4 Mini (Reasoning, Low)1001001001009599.0%
GPT-5.21001001001009498.7%
Ministral 3 3B1001001001009398.6%
Qwen 3.5 Flash1001001001009198.2%
ByteDance Seed 1.61001001001008897.5%
Claude Sonnet 41001001001008897.5%
GPT-5.1100100100968996.9%
Z.AI GLM 5 Turbo1001001001008396.7%
Mistral Large100100100948896.4%
Claude Opus 41001001001008196.2%
Writer: Palmyra X5100100100938695.8%
Claude Sonnet 4.6100100100888594.6%
MiniMax M2.5100100100977594.4%
Qwen 3.5 397B A17B1001001001006793.3%
Ministral 3 8B1001001001006793.3%
Mistral Medium 3.1100100100937092.6%
GPT-5 Mini1001001001006292.5%
Inception Mercury1001001001005991.8%
ByteDance Seed 1.6 Flash10010089877990.9%
Qwen 3.5 122B100100100836790.0%
Mistral Large 3100100100945589.8%
Grok 4.20 (Beta)10010097945689.6%
Qwen3 235B A22B Instruct 2507100100100915288.6%
DeepSeek V3 (2024-12-26)10010094836287.9%
ByteDance Seed 2.0 Mini1001001001003987.8%
GPT-4o, May 13th (temp=0)10010099895087.6%
Grok 4.20 (Beta, Reasoning)1009791915386.6%
GPT-4.1100100100815086.2%
Aion 2.010010079796784.8%
o4 Mini High10010099625983.9%
DeepSeek V3 (2025-03-24)100100100595382.5%
Gemini 2.5 Pro1009779695980.8%
ByteDance Seed 2.0 Lite100100100100080.0%
Hermes 3 405B100100100851179.3%
GPT-5.4 Nano100100100831178.9%
Mistral Small 410010079694778.8%
Llama 3.1 Nemotron 70B10010088831777.5%
Ministral 3 14B10010083633977.1%
Mistral Small Creative100100100503476.7%
Llama 3.1 70B10010010079075.7%
o4 Mini1009680534775.5%
Claude Haiku 4.510010076732574.8%
Ministral 8B1009191761474.4%
GPT-4.1 Mini1001008379773.8%
Nemotron 3 Super978879673572.9%
Claude 3.5 Sonnet100100100323072.5%
Grok 410010070692172.0%
GPT-5.4 Nano (Reasoning, Low)10010010054070.8%
Qwen 3 32B10010069592570.7%
Qwen 3.5 27B1008077533969.8%
Stealth: Aurora Alpha1001009059069.8%
GPT-5.4 Nano (Reasoning)948684453668.9%
Inception Mercury 21009073631868.9%
Gemini 3.1 Pro (Preview)10010010044068.7%
Z.AI GLM 4.51007573534268.6%
Qwen 3.5 9B10010010042068.4%
Stealth: Healer Alpha978882423268.2%
DeepSeek-V2 Chat1001007367067.9%
GPT-4o Mini (temp=0)100887571066.7%
Gemma 3 27B1009365641166.6%
Claude 3.7 Sonnet1008264612566.3%
WizardLM 2 8x22b1001009139066.0%
MoonshotAI: Kimi K2.510010056451763.5%
Grok 4.1 Fast916456454460.0%
Mistral Small 3.2 24B1001001000060.0%
Qwen 3.5 35B1001006430058.9%
Arcee AI: Trinity Large (Preview)9796907058.1%
Claude 3 Haiku797350393555.0%
Llama 3.1 8B100795039754.9%
Rocinante 12B100100730054.6%
Mistral NeMO1001004325053.6%
Gemini 3 Flash (Preview, Reasoning)91626247052.4%
DeepSeek V3.29489672050.3%
Cohere Command R+ (Aug. 2024)100100500050.0%
Gemini 2.5 Flash (Reasoning)85676725048.6%
Grok 4 Fast10083520047.2%
GPT-4o, Aug. 6th (temp=0)73716720046.1%
Qwen 2.5 72B100100280045.6%
GPT-5 Nano816137221042.1%
Z.AI GLM 4.7 Flash65636110039.9%
Z.AI GLM 4.6893532301239.7%
LFM2 24B89563114038.0%
GPT-4.1 Nano10050390037.8%
GPT-4o Mini (temp=1)9150310034.4%
Z.AI GLM 4.751503215731.1%
DeepSeek V3.110035200030.8%
GPT-4o, May 13th (temp=1)1005300030.6%
Gemma 3 12B8342210029.2%
Gemini 2.5 Flash6255177028.2%
Gemini 2.5 Flash Lite7142280028.1%
Hermes 3 70B1001700023.3%
Claude 3.5 Haiku100770022.9%
Stealth: Hunter Alpha45282111021.0%
GPT-4o, Aug. 6th (temp=1)5035172020.6%
Gemma 3 4B891100020.1%
Gemini 2.5 Flash Lite (Reasoning)4730157019.9%
Arcee AI: Trinity Mini89700019.3%
Gemini 3 Flash (Preview)473900017.1%
Nemotron 3 Nano80000016.0%
Gemini 3 Pro (Preview)23107008.1%
Qwen 3.5 Plus (2026-02-15)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 3B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.9%
Claude Sonnet 4.5100100100999899.5%
Claude Opus 41001001001009799.4%
DeepSeek V3 (2025-03-24)100100100979798.9%
GPT-5 Mini1001001001009398.6%
Qwen 3.5 35B1001001001009398.6%
Gemma 3 27B100100100959397.8%
ByteDance Seed 2.0 Lite100100100979197.7%
Arcee AI: Trinity Large (Preview)100100100988296.1%
Claude Opus 4.61001001001007995.7%
Stealth: Aurora Alpha100100100978195.5%
MiniMax M2.7100100100898695.1%
Claude Opus 4.51001001001007394.6%
Ministral 3B1001001001007094.0%
Claude Opus 4.6 (Reasoning)1001001001006993.8%
MiniMax M2.5100100100837691.9%
Claude 3.7 Sonnet100100100975991.3%
GPT-5 Nano10010097886790.3%
GPT-510010097916089.7%
Ministral 3 8B100100100787089.5%
ByteDance Seed 1.6 Flash1001001001004689.2%
Qwen 3.5 Flash1001001001004488.7%
Mistral Small 4 (Reasoning)100100100884686.7%
Mistral Small 4100100100815286.4%
Claude Haiku 4.5999189856585.9%
Llama 3.1 70B1009491835985.6%
Claude Sonnet 410010097795285.5%
Qwen 2.5 72B100100100813984.0%
Mistral Large 2100100100942583.9%
GPT-5.4 Nano (Reasoning, Low)1009180776883.3%
GPT-4.1 Nano100100100635082.6%
ByteDance Seed 2.0 Mini100100100812581.2%
Cohere Command R+ (Aug. 2024)100100100673981.1%
Ministral 8B1009691843381.0%
Qwen 3 32B10010085695080.9%
Mistral Large 310010083734780.8%
Mistral Medium 3.110010097901480.4%
Writer: Palmyra X51009982803980.1%
GPT-4o, May 13th (temp=0)10010010091078.2%
GPT-4.110010010077776.8%
o4 Mini High10010094701575.9%
Claude 3.5 Sonnet10010097463675.8%
GPT-5.1969185821673.8%
Ministral 3 14B10010010067073.3%
ByteDance Seed 1.6100100100501172.2%
Qwen 3.5 122B1001008373071.3%
Stealth: Healer Alpha10010070443970.5%
Gemini 2.5 Flash Lite1001009656070.4%
Aion 2.01008963593970.1%
Stealth: Hunter Alpha1001007773070.0%
DeepSeek V3.21007973484168.2%
Arcee AI: Trinity Mini10010010039067.8%
Qwen3 235B A22B Instruct 2507976963614767.5%
DeepSeek V3.1979375541767.1%
GPT-4o, Aug. 6th (temp=0)1009679471467.1%
Nemotron 3 Super1001008152267.0%
Mistral Small 3.2 24B100100100201266.5%
Inception Mercury 2100856459763.0%
MoonshotAI: Kimi K2.51009472252262.6%
Gemini 2.5 Pro998261502162.5%
Qwen 3.5 9B1009170311561.5%
Llama 3.1 Nemotron 70B1007967391760.2%
Gemini 3.1 Flash Lite (Preview)1001001000060.0%
GPT-5.2797564592159.8%
GPT-5.4 Nano787457533158.6%
Inception Mercury100100910058.2%
GPT-5.4 Nano (Reasoning)979357311358.2%
GPT-4o Mini (temp=0)1008167271257.4%
DeepSeek V3 (2024-12-26)1001004739057.3%
Gemini 2.5 Flash100977312156.7%
o4 Mini88855652056.1%
Mistral NeMO1008939321755.5%
Z.AI GLM 4.693645628048.3%
GPT-4o Mini (temp=1)9986520047.5%
Hermes 3 405B99715214047.2%
Nemotron 3 Nano857639151145.1%
Rocinante 12B10079320042.2%
GPT-4.1 Mini81594714741.5%
Gemini 2.5 Flash Lite (Reasoning)10010020040.4%
Gemini 3.1 Pro (Preview)10010000040.0%
Z.AI GLM 4.5100671411038.4%
Grok 4.1 Fast81393921036.0%
Llama 3.1 8B1007300034.6%
Gemma 3 12B9159170033.4%
Z.AI GLM 4.783372718033.0%
DeepSeek-V2 Chat5955480032.4%
Grok 41005340031.4%
WizardLM 2 8x22b6762250030.7%
LFM2 24B63323225030.5%
Qwen 3.5 Plus (2026-02-15)8255150030.3%
GPT-4o, May 13th (temp=1)7728117024.6%
Z.AI GLM 4.7 Flash57302510024.5%
Gemini 3 Pro (Preview)5032250021.5%
Grok 4.20 (Beta, Reasoning)5339140021.2%
GPT-4o, Aug. 6th (temp=1)673200019.8%
Grok 4.20 (Beta)652571019.6%
Gemini 3 Flash (Preview)73000014.6%
Hermes 3 70B70000014.0%
Claude 3 Haiku352200011.3%
Gemma 3 4B3930008.4%
Grok 4 Fast2570006.4%
Gemini 2.5 Flash (Reasoning)2200004.4%
Gemini 3 Flash (Preview, Reasoning)2100004.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Ministral 8B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001007995.7%
Qwen 3.5 Flash1001001001007995.7%
GPT-51001001001007094.0%
GPT-5 Mini1001001001006793.3%
Ministral 3 3B1001001001005591.0%
Claude Opus 4.6 (Reasoning)100100100797390.3%
Ministral 3 8B100100100796789.0%
Z.AI GLM 51001001001004588.9%
GPT-5.4 Mini100100100756187.1%
Inception Mercury1001001001002585.0%
Qwen 3.5 9B100100100595081.8%
Arcee AI: Trinity Large (Preview)10010083735081.3%
Mistral Large 210010010097780.9%
Gemini 3.1 Pro (Preview)100100100100080.0%
Claude Opus 4.6100100100100080.0%
Claude Sonnet 4100100100100080.0%
Claude Sonnet 4.5100100100100080.0%
Qwen 3.5 35B100100100100080.0%
GPT-5.4 Nano (Reasoning, Low)1009188794279.9%
Ministral 3 14B10010010094078.9%
Claude Opus 4100100100791478.4%
Qwen3 235B A22B Instruct 250710010085812578.3%
Aion 2.0100100100761477.9%
Gemini 2.5 Pro10010010088077.5%
ByteDance Seed 1.6100100100671776.7%
Mistral Medium 3.110010086672976.3%
Mistral Small 3.2 24B10010010079075.7%
Mistral Large 3100100100393975.6%
DeepSeek V3 (2024-12-26)10010010073074.6%
GPT-5.410010076672974.3%
Writer: Palmyra X510010070693073.8%
Gemma 3 27B10010010067073.3%
GPT-5.4 Nano (Reasoning)1001008956069.1%
DeepSeek V3.11001007659067.0%
Ministral 3B10010010025766.4%
Qwen 3.5 27B1001008839065.3%
Mistral NeMO1001009225063.4%
Mistral Small Creative10010010012062.4%
Claude Opus 4.5100777353060.6%
Gemini 2.5 Flash (Reasoning)1001001000060.0%
Mistral Small 4 (Reasoning)1001001000060.0%
Gemini 2.5 Flash Lite1001001000060.0%
Llama 3.1 70B1001001000060.0%
Mistral Small 41001001000060.0%
WizardLM 2 8x22b1001001000060.0%
Qwen 3.5 122B100887339059.9%
ByteDance Seed 1.6 Flash100100990059.7%
Grok 4100897039059.7%
Stealth: Aurora Alpha100886342058.5%
GPT-5.4 (Reasoning)100837530057.7%
Grok 4.1 Fast100735755057.0%
Z.AI GLM 4.5100100790055.7%
Mistral Large100100790055.7%
DeepSeek V3.2100100677054.8%
GPT-5 Nano100706725052.3%
Hermes 3 405B100100590051.8%
GPT-4o, Aug. 6th (temp=1)100595050051.8%
Qwen 2.5 72B1001003225051.5%
GPT-5.2100884525051.4%
GPT-5.4 Mini (Reasoning, Low)10089597051.1%
GPT-5.4 (Reasoning, Low)10083700050.7%
GPT-4o, May 13th (temp=0)10091590050.1%
Rocinante 12B100100390047.8%
Claude 3.7 Sonnet89644635046.8%
Grok 4.20 (Beta)79615043046.5%
DeepSeek-V2 Chat100100257046.4%
Z.AI GLM 4.6100100250045.0%
Inception Mercury 2100575014044.1%
Cohere Command R+ (Aug. 2024)100791714743.2%
GPT-5.4 Mini (Reasoning)10073350041.5%
Llama 3.1 Nemotron 70B10010070041.4%
Llama 3.1 8B10010070041.4%
GPT-4.1 Nano10010070041.4%
Hermes 3 70B10073257041.0%
GPT-4.110088170040.8%
Stealth: Healer Alpha10010000040.0%
Gemini 3.1 Flash Lite (Preview)10010000040.0%
Claude Haiku 4.510010000040.0%
ByteDance Seed 2.0 Mini1009400038.9%
Nemotron 3 Super10050390037.8%
Stealth: Hunter Alpha10059250036.8%
Claude 3.5 Sonnet1008300036.7%
Gemma 3 12B55474532035.7%
Grok 4 Fast1005000030.0%
DeepSeek V3 (2025-03-24)10025250030.0%
Gemma 3 4B767000029.2%
Arcee AI: Trinity Mini1003900027.8%
GPT-5.4 Nano6739147726.7%
LFM2 24B1002570026.4%
GPT-5.16350150025.7%
Z.AI GLM 4.7 Flash7925180024.4%
Grok 4.20 (Beta, Reasoning)594570022.2%
ByteDance Seed 2.0 Lite100700021.4%
o4 Mini High100700021.4%
Claude 3 Haiku732570021.0%
o4 Mini100000020.0%
Nemotron 3 Nano100000020.0%
GPT-4o Mini (temp=1)88000017.5%
GPT-4.1 Mini67770016.2%
Qwen 3 32B631700015.9%
Gemini 3 Flash (Preview)453000014.9%
GPT-4o, May 13th (temp=1)73000014.6%
Qwen 3.5 Plus (2026-02-15)4500008.9%
GPT-4o Mini (temp=0)4300008.6%
Z.AI GLM 4.7700001.4%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Gemini 3 Pro (Preview)000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Qwen 3.5 35B1001001001009899.7%
Gemini 3.1 Pro (Preview)1001001001009498.9%
Claude Opus 4.61001001001008997.9%
Claude Opus 41001001001008997.8%
Claude Opus 4.5100100100968996.9%
Mistral Large 21001001001008396.7%
GPT-5.4100100100978596.6%
Qwen 3.5 Flash100100100978496.2%
Claude Sonnet 4.6100100100888895.0%
Claude Opus 4.6 (Reasoning)100100100977594.5%
Qwen 3.5 27B10010089898893.0%
Mistral Large100100100966391.8%
Z.AI GLM 4.710010091867790.9%
GPT-5.4 (Reasoning)10010092747187.5%
Mistral Small 4 (Reasoning)100100100835387.3%
GPT-5.4 (Reasoning, Low)100100100655684.1%
Qwen 3.5 122B100100100705084.0%
Qwen 3.5 9B1001001001001883.7%
Claude Sonnet 4.510010099882582.2%
ByteDance Seed 1.6 Flash10010085774280.8%
Ministral 3B949183735980.2%
ByteDance Seed 2.0 Lite100100100100080.0%
Llama 3.1 70B1008379715076.6%
Grok 4.1 Fast10010010081076.2%
Ministral 3 8B10010079505075.7%
GPT-5.110010010074074.8%
MoonshotAI: Kimi K2.51001009180074.3%
Mistral Small Creative1009281732073.2%
MiniMax M2.7978867534770.4%
DeepSeek V3 (2024-12-26)1001007370770.0%
Mistral Medium 3.110010067631869.6%
MiniMax M2.510010073453069.6%
Ministral 8B100888367067.5%
GPT-51001009734066.1%
GPT-5.4 Nano (Reasoning, Low)10010050412062.2%
Gemini 2.5 Pro1001001007061.4%
GPT-5.4 Nano (Reasoning)100887941061.3%
Mistral NeMO1001007325761.0%
Gemini 3 Flash (Preview, Reasoning)100100940058.9%
GPT-5.4 Mini (Reasoning, Low)988357272558.2%
Grok 4 Fast1009436322557.5%
Grok 41001006317055.9%
Ministral 3 3B97835939055.7%
Inception Mercury1007848231452.7%
Mistral Small 4100864332052.3%
GPT-4.1100855025052.1%
DeepSeek V3 (2025-03-24)100834532052.1%
Z.AI GLM 4.6100100590051.8%
Claude Sonnet 4776347392049.0%
GPT-5.4 Mini80786512948.9%
Qwen 3 32B10085470046.4%
Claude 3.5 Sonnet100593925044.6%
GPT-4o, Aug. 6th (temp=0)10083350043.6%
GPT-5.4 Nano524646392741.8%
Ministral 3 14B8863550041.1%
o4 Mini High83593920040.2%
Gemini 3 Flash (Preview)10010000040.0%
DeepSeek-V2 Chat9769129037.5%
Nemotron 3 Super100352525036.9%
LFM2 24B10073110036.8%
Aion 2.056484732036.6%
GPT-4.1 Nano948800036.4%
Claude Haiku 4.563473532035.3%
GPT-4o, May 13th (temp=0)50484731035.1%
WizardLM 2 8x22b898000033.9%
o4 Mini10036170030.5%
Qwen3 235B A22B Instruct 2507807000030.0%
GPT-5 Mini75411714029.3%
Stealth: Healer Alpha1004700029.3%
GPT-5.4 Mini (Reasoning)54393713028.5%
Llama 3.1 Nemotron 70B676700026.7%
Writer: Palmyra X5755500026.0%
Hermes 3 405B676300025.9%
Nemotron 3 Nano795000025.7%
Qwen 2.5 72B6929280025.2%
Stealth: Hunter Alpha7039110024.0%
DeepSeek V3.2595000021.8%
Inception Mercury 2565200021.7%
GPT-4.1 Mini100000020.0%
GPT-5 Nano88700018.9%
DeepSeek V3.1761700018.5%
Llama 3.1 8B88000017.5%
Rocinante 12B88000017.5%
Mistral Small 3.2 24B4900009.7%
Stealth: Aurora Alpha4300008.6%
GPT-4o, May 13th (temp=1)25170008.3%
Cohere Command R+ (Aug. 2024)25140007.7%
Z.AI GLM 4.53600007.3%
Arcee AI: Trinity Large (Preview)3600007.1%
GPT-4o, Aug. 6th (temp=1)3200006.5%
GPT-5.22500005.0%
Claude 3 Haiku1470004.2%
Gemma 3 12B1200002.4%
Z.AI GLM 4.7 Flash720001.8%
Gemini 2.5 Flash700001.4%
Gemma 3 4B200000.4%
Gemini 3 Pro (Preview)000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemini 2.5 Flash Lite (Reasoning)000000.0%
Grok 4.20 (Beta)000000.0%
Claude 3.5 Haiku000000.0%
Claude 3.7 Sonnet000000.0%
Gemini 2.5 Flash Lite000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 27B000000.0%
Hermes 3 70B000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude Opus 4.5100100100945990.7%
GPT-510010098835387.0%
GPT-5.4 Nano (Reasoning)100100100676786.7%
Qwen 3.5 35B1001001001002985.8%
GPT-5.4 Nano (Reasoning, Low)1009473737382.7%
GPT-5.4 Nano10010081705681.5%
Z.AI GLM 5100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Gemini 2.5 Pro100100100100080.0%
Stealth: Healer Alpha100100100100080.0%
Claude Haiku 4.5100100100100080.0%
ByteDance Seed 2.0 Lite100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Mistral Small Creative100100100100080.0%
Ministral 3 8B100100100100080.0%
Ministral 3B100100100100080.0%
Gemini 3 Flash (Preview, Reasoning)100100100593979.6%
Inception Mercury10010010091078.2%
ByteDance Seed 1.6 Flash1009494792077.4%
MiniMax M2.710010083503974.4%
Qwen 3.5 397B A17B10010010050070.0%
Writer: Palmyra X510010010050070.0%
GPT-5 Mini10010060572969.3%
Qwen 3.5 27B1001008859069.3%
Mistral Large10010010035066.9%
Z.AI GLM 5 Turbo10010010025065.0%
Mistral Small 4 (Reasoning)1001001007762.9%
ByteDance Seed 2.0 Mini1001001007061.4%
Mistral Large 31001001007061.4%
Qwen 3.5 122B1001001000060.0%
Aion 2.01001001000060.0%
Stealth: Hunter Alpha1001005050060.0%
Gemini 3.1 Flash Lite (Preview)1001001000060.0%
DeepSeek V3.11001001000060.0%
DeepSeek V3 (2025-03-24)1001001000060.0%
GPT-4.1 Nano1001001000060.0%
WizardLM 2 8x22b1001001000060.0%
Ministral 3 3B1001001000060.0%
Claude Sonnet 4100100940058.9%
Claude Sonnet 4.51001005039057.8%
Gemini 3 Pro (Preview)1001005925056.8%
Claude Opus 410091830054.9%
Llama 3.1 8B100100730054.6%
GPT-4.1100100500050.0%
MoonshotAI: Kimi K2.5100100390047.8%
GPT-5.4 (Reasoning, Low)91735514046.5%
Ministral 8B100100257046.4%
Z.AI GLM 4.7 Flash100100250045.0%
GPT-5.410083250041.7%
Z.AI GLM 4.5100502525741.4%
Grok 4 Fast10010070041.4%
Qwen 3 32B10079250040.7%
Claude Opus 4.6 (Reasoning)10010000040.0%
o4 Mini High10010000040.0%
Gemini 2.5 Flash (Reasoning)10010000040.0%
Qwen 3.5 Flash10010000040.0%
Qwen 3.5 Plus (2026-02-15)10010000040.0%
Gemini 2.5 Flash Lite (Reasoning)10010000040.0%
Nemotron 3 Super10010000040.0%
Gemini 2.5 Flash Lite10010000040.0%
Llama 3.1 70B10010000040.0%
Nemotron 3 Nano10010000040.0%
Mistral Small 410010000040.0%
Ministral 3 14B10010000040.0%
DeepSeek V3 (2024-12-26)1008800037.5%
GPT-5.4 (Reasoning)1007900035.7%
DeepSeek V3.21007300034.6%
Cohere Command R+ (Aug. 2024)1005900031.8%
Claude Opus 4.61005070031.4%
Z.AI GLM 4.71005000030.0%
Gemini 3 Flash (Preview)1003270027.9%
Hermes 3 405B1002570026.4%
GPT-5.1943100025.2%
GPT-5.4 Mini (Reasoning)735000024.6%
o4 Mini100700021.4%
GPT-4o, Aug. 6th (temp=0)100700021.4%
Qwen3 235B A22B Instruct 2507100700021.4%
Claude 3.5 Sonnet100000020.0%
Grok 4.20 (Beta)100000020.0%
Mistral Medium 3.1100000020.0%
Arcee AI: Trinity Large (Preview)100000020.0%
Claude 3 Haiku100000020.0%
Mistral NeMO100000020.0%
Rocinante 12B100000020.0%
GPT-4o, May 13th (temp=1)83000016.7%
Inception Mercury 273700016.0%
Mistral Large 273700016.0%
DeepSeek-V2 Chat67000013.3%
GPT-4.1 Mini252500010.0%
Claude 3.7 Sonnet3900007.8%
GPT-5.4 Mini (Reasoning, Low)3900007.8%
Stealth: Aurora Alpha3500006.9%
Grok 43200006.5%
GPT-5 Nano2500005.0%
GPT-5.4 Mini2100004.2%
Grok 4.1 Fast700001.4%
Gemma 3 12B700001.4%
Llama 3.1 Nemotron 70B700001.4%
GPT-5.2000000.0%
GPT-4o, May 13th (temp=0)000000.0%
GPT-4o, Aug. 6th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Gemma 3 27B000000.0%
Qwen 2.5 72B000000.0%
Hermes 3 70B000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.7%
Stealth: Healer Alpha1001001001009899.6%
Gemini 3.1 Pro (Preview)1001001001009799.5%
GPT-4.11001001001009799.5%
Gemma 3 27B1001001001009799.5%
Ministral 3 8B1001001001009799.5%
MiniMax M2.51001001001009699.3%
GPT-5.4 Nano1001001001009699.2%
Z.AI GLM 4.7 Flash1001001001009699.2%
Mistral Small 41001001001009699.2%
Gemini 3 Flash (Preview, Reasoning)1001001001009498.9%
Stealth: Aurora Alpha1001001001009198.2%
LFM2 24B1001001001009198.2%
Qwen 3.5 Plus (2026-02-15)1001001001009098.0%
WizardLM 2 8x22b100100100979097.5%
Nemotron 3 Super1001001001008897.5%
Aion 2.01001001001008797.3%
GPT-5 Mini1001001001008597.1%
DeepSeek V3 (2024-12-26)1001001001008396.7%
Mistral Small 3.2 24B1001001001008396.7%
Grok 4.1 Fast1001001001008396.7%
Gemini 3 Flash (Preview)10010096918995.3%
Claude Sonnet 4100100100938295.0%
GPT-4.1 Mini1001001001006292.4%
DeepSeek V3.1100100100857792.3%
Qwen 3 32B1001001001005991.8%
GPT-4o, May 13th (temp=0)100100100827391.1%
Gemini 3.1 Flash Lite (Preview)1001001001005090.0%
GPT-4.1 Nano10010097975389.6%
Qwen 2.5 72B1009493897089.3%
Grok 4.20 (Beta)10010099767189.1%
Claude Haiku 4.5100100100716988.0%
GPT-4o Mini (temp=0)100100100983987.4%
Rocinante 12B1001001001003586.9%
Mistral NeMO1001001001003186.2%
Llama 3.1 Nemotron 70B10010094706385.5%
ByteDance Seed 2.0 Lite100100100675985.2%
GPT-5 Nano100100100625583.5%
DeepSeek-V2 Chat1001001001001482.9%
Grok 41001001001001382.5%
Gemini 2.5 Flash Lite10010096674681.7%
Hermes 3 405B100100100594781.1%
Gemini 3 Pro (Preview)100100100473376.0%
Arcee AI: Trinity Mini10010076691070.9%
Nemotron 3 Nano938569674170.8%
Cohere Command R+ (Aug. 2024)10010010041068.3%
Llama 3.1 8B1001001007061.4%
ByteDance Seed 1.61001001000060.0%
Llama 3.1 70B1001001000060.0%
Gemini 2.5 Flash Lite (Reasoning)935952504359.4%
Gemini 2.5 Flash100975045058.4%
GPT-5.21006957342857.5%
Gemini 2.5 Flash (Reasoning)1001006718056.9%
Hermes 3 70B100817617054.7%
Claude 3 Haiku10093710052.8%
GPT-4o, Aug. 6th (temp=0)855957201847.8%
Claude 3.5 Haiku100100250045.0%
GPT-4o Mini (temp=1)68545043042.9%
Gemma 3 12B7667520038.9%
GPT-4o, Aug. 6th (temp=1)9750360036.6%
GPT-4o, May 13th (temp=1)1002120201434.9%
Gemma 3 4B32313012021.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Mistral Large 31001001001009498.9%
DeepSeek V3 (2025-03-24)1001001001009498.9%
MiniMax M2.51001001001009298.5%
Claude Opus 4.61001001001009198.2%
Gemini 2.5 Pro1001001001008897.5%
Writer: Palmyra X5100100100939397.3%
Claude Opus 41001001001008697.2%
Stealth: Healer Alpha1001001001008697.2%
MiniMax M2.71001001001008597.1%
DeepSeek V3 (2024-12-26)1001001001008396.7%
Grok 4.20 (Beta)10010099948896.1%
o4 Mini1001001001007995.7%
ByteDance Seed 1.6 Flash1001001001007995.7%
Mistral Large1001001001007695.2%
Aion 2.01001001001007595.0%
Qwen 3.5 397B A17B100100100987694.8%
Claude Sonnet 4.6 (Reasoning)1001001001007094.0%
Ministral 8B1001001001007094.0%
Claude 3.7 Sonnet1001001001006593.1%
Gemini 3.1 Pro (Preview)1001001001005991.8%
MoonshotAI: Kimi K2.51001001001005791.4%
Z.AI GLM 4.51001001001005490.8%
Qwen 3.5 Flash1001001001005290.4%
Ministral 3 8B100100100836790.0%
WizardLM 2 8x22b100100100935389.3%
Z.AI GLM 4.6100100100816489.1%
Z.AI GLM 4.7100100100816388.8%
GPT-51001001001004488.7%
LFM2 24B100100100934788.1%
Claude 3.5 Sonnet1001001001003987.8%
Stealth: Hunter Alpha100100100864786.7%
Claude 3 Haiku100100100676285.7%
DeepSeek V3.1100100100814785.5%
Grok 4.1 Fast1001001001002885.5%
Claude Haiku 4.5100100100705685.3%
Qwen3 235B A22B Instruct 2507100100100814585.2%
Gemini 3.1 Flash Lite (Preview)100100100595582.8%
Qwen 3 32B1009583795682.7%
Ministral 3B100100100594781.1%
Z.AI GLM 5100100100100080.0%
ByteDance Seed 1.6100100100100080.0%
GPT-5 Mini1008975636277.8%
GPT-5 Nano100979483976.9%
Nemotron 3 Nano10010073674376.5%
DeepSeek-V2 Chat100100100611875.8%
o4 Mini High100100100621174.6%
Inception Mercury1001008883074.2%
Arcee AI: Trinity Large (Preview)1001008883074.2%
GPT-5.210010010070074.0%
Gemma 3 27B1001008367070.0%
Ministral 3 14B100100100252068.9%
GPT-4.1 Mini998171593468.7%
GPT-4o, May 13th (temp=0)10010010039468.5%
Gemini 2.5 Flash100977965068.2%
Llama 3.1 8B1001007959067.5%
GPT-4o, Aug. 6th (temp=0)1001007948766.7%
Inception Mercury 21001009725264.8%
Hermes 3 405B1009794171464.4%
Qwen 3.5 35B1008568382563.3%
GPT-4.1 Nano94936753061.4%
Grok 4918383281760.5%
Cohere Command R+ (Aug. 2024)1001001000060.0%
Llama 3.1 70B100100970059.5%
Mistral NeMO100100970059.5%
GPT-4.1917069392859.4%
Ministral 3 3B100736359059.0%
Grok 4 Fast99795956058.5%
Qwen 2.5 72B100100810056.2%
Gemini 3 Pro (Preview)1006138372552.0%
Gemini 3 Flash (Preview, Reasoning)10099590051.7%
Z.AI GLM 4.7 Flash99775125050.4%
Hermes 3 70B97884515048.9%
Gemini 3 Flash (Preview)91895011048.2%
Arcee AI: Trinity Mini10073592046.8%
Gemini 2.5 Flash Lite (Reasoning)10079550046.7%
Llama 3.1 Nemotron 70B10094207044.2%
Stealth: Aurora Alpha97752512041.8%
Gemini 2.5 Flash (Reasoning)10091144041.7%
Gemma 3 12B9176280039.0%
Rocinante 12B10039307035.2%
GPT-4o Mini (temp=0)724817151232.7%
Qwen 3.5 Plus (2026-02-15)7152200028.5%
GPT-4o Mini (temp=1)7728180024.7%
Gemini 2.5 Flash Lite565070022.7%
Gemma 3 4B73100014.8%
GPT-4o, May 13th (temp=1)18187008.6%
GPT-4o, Aug. 6th (temp=1)700001.4%
Claude 3.5 Haiku000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Ministral 3 14B100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009799.5%
Gemini 3 Flash (Preview, Reasoning)1001001001009799.5%
DeepSeek V3 (2025-03-24)1001001001009799.5%
Mistral Small 4 (Reasoning)1001001001009799.3%
Claude Opus 41001001001009498.9%
Ministral 3 3B1001001001009498.7%
Z.AI GLM 4.71001001001009398.7%
Mistral Small Creative1001001001009398.6%
Grok 4.1 Fast1001001001009298.4%
Mistral Small 41001001001009198.2%
Gemini 3.1 Pro (Preview)1001001001009198.1%
o4 Mini1001001001009098.0%
Claude Sonnet 4.51001001001008897.5%
Claude Haiku 4.51001001001008897.5%
WizardLM 2 8x22b1001001001008897.5%
GPT-4o Mini (temp=0)10010097969497.5%
Stealth: Hunter Alpha10010099969197.3%
ByteDance Seed 1.6 Flash1001001001008597.0%
MiniMax M2.5100100100968596.4%
MiniMax M2.71001001001008096.0%
GPT-5.4 Mini (Reasoning)1001001001008095.9%
o4 Mini High100100100958395.8%
GPT-5 Mini1001001001007995.7%
Claude Sonnet 41001001001007795.4%
GPT-4.1 Nano1001001001007394.6%
LFM2 24B100100100908294.4%
Qwen 3.5 Plus (2026-02-15)100100100908094.1%
Gemini 3 Pro (Preview)10010097908193.6%
Claude 3.5 Sonnet1001001001006793.3%
Grok 4.20 (Beta, Reasoning)1001001001006292.4%
GPT-4.1100100100976492.3%
Qwen3 235B A22B Instruct 2507100100100946491.7%
Nemotron 3 Super100100100946291.2%
GPT-5 Nano1001001001005591.0%
DeepSeek-V2 Chat100100100797690.9%
GPT-5.4 Nano (Reasoning)1001001001005490.9%
Gemini 2.5 Pro100100100836990.5%
Claude 3.7 Sonnet100100100737289.1%
Stealth: Aurora Alpha1009491737085.7%
Llama 3.1 70B10010088795985.0%
Qwen 2.5 72B100100100892883.5%
GPT-4o, May 13th (temp=0)1001001001001583.1%
Gemma 3 27B10010093764683.0%
Grok 4100100100833182.9%
Ministral 3B1001001001001082.0%
Ministral 8B100100100100781.4%
Llama 3.1 8B100100100100781.4%
Grok 4 Fast100100100614280.5%
ByteDance Seed 2.0 Mini100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Ministral 3 8B100100100851279.4%
Inception Mercury 2100100100742279.2%
GPT-4o, May 13th (temp=1)1009791802578.8%
Qwen 3 32B10010094643578.7%
DeepSeek V3 (2024-12-26)1001009992078.2%
GPT-4o Mini (temp=1)10010010085077.0%
GPT-4.1 Mini1009479763676.9%
Gemini 3 Flash (Preview)10010010083076.7%
DeepSeek V3.110010010069474.6%
Gemini 2.5 Flash10010068443970.2%
Rocinante 12B10010010040769.5%
Z.AI GLM 4.7 Flash100996960065.7%
Arcee AI: Trinity Large (Preview)100917656064.7%
Cohere Command R+ (Aug. 2024)1009376322164.4%
Mistral Small 3.2 24B1001007743063.9%
Mistral NeMO10010063281460.9%
Inception Mercury1008762282460.3%
Gemma 3 12B100945741058.5%
Grok 4.20 (Beta)100592820943.1%
Arcee AI: Trinity Mini8870507042.9%
GPT-5.280605024042.8%
Gemini 2.5 Flash Lite8867317739.9%
Nemotron 3 Nano100373717539.2%
GPT-4o, Aug. 6th (temp=0)100621717039.0%
Llama 3.1 Nemotron 70B1007970037.1%
Gemini 2.5 Flash (Reasoning)8059412036.4%
Gemini 2.5 Flash Lite (Reasoning)1005600031.3%
Hermes 3 405B10025180028.7%
Gemma 3 4B8518157025.0%
GPT-4o, Aug. 6th (temp=1)1002500025.0%
Claude 3 Haiku635700024.0%
Hermes 3 70B63000012.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 3.5 122B1001001001009899.7%
DeepSeek-V2 Chat1001001001009799.5%
Claude Haiku 4.51001001001008897.5%
Mistral NeMO1001001001008897.5%
Claude Opus 4.51001001001008396.7%
Qwen 3.5 35B100100100948896.4%
Grok 4.1 Fast100100100938896.1%
GPT-5.4 Nano1001001001008096.0%
GPT-5 Mini100100100918595.2%
Mistral Small 3.2 24B1001001001007394.6%
MoonshotAI: Kimi K2.51001001001006793.3%
Gemini 3.1 Flash Lite (Preview)1001001001006793.3%
WizardLM 2 8x22b1001001001006793.3%
Writer: Palmyra X51001001001005591.0%
GPT-5.4 Nano (Reasoning, Low)1001001001005490.8%
GPT-5.4 Nano (Reasoning)100100100935890.2%
Mistral Small 41001001001005090.0%
Mistral Small Creative100100100885989.3%
Claude Opus 4.61001001001003987.8%
Ministral 3 8B1001001001003987.8%
MiniMax M2.7100100100885087.5%
DeepSeek V3 (2024-12-26)100100100924587.4%
Qwen3 235B A22B Instruct 2507100100100833984.4%
DeepSeek V3.21009494794382.1%
Grok 4.20 (Beta, Reasoning)100100100100781.4%
Claude Opus 4100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100080.0%
GPT-5.4 Mini (Reasoning, Low)1009971674776.8%
Z.AI GLM 4.710010010073776.0%
Ministral 3B1001009483075.6%
GPT-5.11009493632574.9%
Z.AI GLM 4.51001009679074.9%
DeepSeek V3.110010010067073.3%
GPT-5.4 Mini1009183453971.6%
Ministral 8B10010073592571.4%
Mistral Large 3100100100252570.0%
Inception Mercury10010010050070.0%
Ministral 3 14B10010067503269.8%
o4 Mini High100918373069.5%
Aion 2.010010010039769.2%
GPT-5.4 Mini (Reasoning)100100100252169.2%
Z.AI GLM 510010010039067.8%
ByteDance Seed 2.0 Mini10010010039067.8%
Stealth: Healer Alpha10010010039067.8%
GPT-4o, May 13th (temp=0)10010010039067.8%
Gemini 3 Flash (Preview, Reasoning)1001008552067.6%
Gemma 3 27B10010073452067.5%
DeepSeek V3 (2025-03-24)1001006759766.6%
Llama 3.1 70B10010010025065.0%
Grok 41001008835064.4%
Z.AI GLM 4.7 Flash1001009121062.4%
Gemma 3 12B100817647060.7%
Claude Sonnet 41001001000060.0%
Nemotron 3 Super1001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Qwen 2.5 72B1001001000060.0%
Ministral 3 3B100100940058.9%
GPT-5 Nano100856730056.3%
GPT-5.21009141311154.8%
GPT-4.11008839251753.6%
GPT-4o, Aug. 6th (temp=0)10073730049.2%
Gemini 2.5 Flash Lite100100390047.8%
Qwen 3 32B10091470047.7%
ByteDance Seed 1.6 Flash10010077042.9%
Qwen 3.5 Flash10010000040.0%
Gemini 3 Flash (Preview)9663390039.7%
Grok 4 Fast10067250038.3%
GPT-4.1 Nano1008800037.5%
Arcee AI: Trinity Mini1008300036.7%
Grok 4.20 (Beta)1007600035.2%
Claude 3.7 Sonnet1007600035.2%
o4 Mini1005900031.8%
Gemini 3 Pro (Preview)8150250031.2%
Arcee AI: Trinity Large (Preview)1004370030.0%
Llama 3.1 8B1005000030.0%
GPT-4.1 Mini736700027.9%
Inception Mercury 21003600027.3%
Cohere Command R+ (Aug. 2024)735500025.6%
LFM2 24B1001400022.7%
Claude 3 Haiku733900022.4%
Stealth: Aurora Alpha100700021.4%
Gemma 3 4B5045100020.9%
Hermes 3 405B100000020.0%
Nemotron 3 Nano100000020.0%
Qwen 3.5 Plus (2026-02-15)89000017.9%
GPT-4o, Aug. 6th (temp=1)452570015.4%
Llama 3.1 Nemotron 70B700001.4%
GPT-4o, May 13th (temp=1)000000.0%
GPT-4o Mini (temp=1)000000.0%
GPT-4o Mini (temp=0)000000.0%
Hermes 3 70B000000.0%
Rocinante 12B000000.0%