Subject-first sentence starts

Test: Bad Writing Habits

Avg. Score
35.7%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Writer: Palmyra X583.2%$0.01122.0s50%
2Qwen3 235B A22B Instruct 250781.0%$0.001159.2s48%
3Rocinante 12B77.8%$0.001438.4s38%
4Llama 3.1 8B71.0%$0.00031.3m29%
5Mistral Small 4 (Reasoning)60.6%$0.002230.2s25%
6GPT-5.470.2%$0.0491.4m39%
7Mistral Small 456.1%$0.001418.2s24%
8Ministral 3 14B50.5%$0.000711.7s25%
9Claude Sonnet 4.563.6%$0.03538.1s27%
10Grok 4.1 Fast54.6%$0.001837.8s24%
11GPT-5.4 (Reasoning, Low)67.7%$0.0551.4m36%
12Mistral Small Creative49.3%$0.00079.1s22%
13GPT-5.4 Mini50.0%$0.01516.8s25%
14Z.AI GLM 558.9%$0.00841.2m26%
15Claude Haiku 4.552.5%$0.01121.6s21%
16GPT-5.4 Mini (Reasoning, Low)47.8%$0.01516.8s23%
17Claude 3.5 Haiku51.4%$0.003510.8s17%
18Z.AI GLM 5 Turbo49.6%$0.008133.2s22%
19Mistral Medium 3.143.3%$0.004836.5s25%
20Hermes 3 70B57.9%$0.00101.2m20%
21Grok 4 Fast45.4%$0.001724.1s20%
22MiniMax M2.552.7%$0.00341.3m23%
23Hermes 3 405B53.5%$0.003253.2s19%
24Grok 4.20 (Beta)38.9%$0.01815.8s25%
25Claude Sonnet 4.653.3%$0.03139.3s21%
26Claude Sonnet 453.0%$0.03243.7s22%
27Llama 3.1 Nemotron 70B45.5%$0.003831.7s18%
28Arcee AI: Trinity Large (Preview)44.0%$0.000043.6s19%
29Llama 3.1 70B44.4%$0.001529.4s17%
30Claude 3 Haiku45.6%$0.002514.9s14%
31Claude Opus 4.558.8%$0.07053.4s27%
32Gemini 2.5 Flash Lite34.5%$0.00099.5s18%
33GPT-5.4 Mini (Reasoning)45.0%$0.02228.1s19%
34ByteDance Seed 1.6 Flash38.0%$0.001327.3s18%
35Ministral 8B37.6%$0.000410.4s15%
36GPT-4o, Aug. 6th (temp=1)44.8%$0.01824.4s17%
37Stealth: Healer Alpha34.3%$0.000023.7s18%
38MiniMax M2.746.3%$0.00401.1m17%
39Ministral 3 8B38.5%$0.000819.6s14%
40DeepSeek V3 (2025-03-24)42.0%$0.001439.4s14%
41GPT-4.139.4%$0.01844.7s21%
42Gemma 3 12B36.6%$0.000441.3s17%
43Mistral Large 240.1%$0.01329.4s16%
44Mistral Large 336.9%$0.003330.3s16%
45Stealth: Hunter Alpha40.0%$0.000055.0s17%
46Gemini 2.5 Flash Lite (Reasoning)33.3%$0.002830.8s17%
47GPT-4o Mini (temp=1)38.9%$0.001234.8s13%
48Gemma 3 27B39.8%$0.000652.6s14%
49GPT-4.1 Nano34.2%$0.000713.3s12%
50Claude Sonnet 4.6 (Reasoning)53.7%$0.0601.2m22%
51Claude Opus 4.656.4%$0.0781.2m25%
52Claude Opus 4.6 (Reasoning)58.5%$0.0881.4m26%
53Cohere Command R+ (Aug. 2024)45.9%$0.02052.5s13%
54Mistral Large37.3%$0.01430.9s13%
55GPT-5.4 Nano28.8%$0.005726.3s16%
56Gemini 2.5 Flash28.1%$0.005210.6s13%
57LFM2 24B28.6%$0.000228.4s13%
58GPT-5.4 Nano (Reasoning, Low)26.5%$0.005520.6s15%
59Gemma 3 4B29.4%$0.000220.0s11%
60Mistral NeMO28.6%$0.000510.1s10%
61GPT-5.4 Nano (Reasoning)26.8%$0.006124.5s15%
62Z.AI GLM 4.628.9%$0.006551.5s17%
63Grok 4.20 (Beta, Reasoning)34.7%$0.03934.0s18%
64Gemini 2.5 Flash (Reasoning)32.2%$0.01121.5s11%
65Qwen 3 32B32.1%$0.001554.6s13%
66GPT-5.151.4%$0.0541.8m22%
67GPT-5.4 (Reasoning)65.0%$0.0892.6m29%
68GPT-4o, May 13th (temp=1)30.5%$0.03314.4s15%
69Ministral 3B23.3%$0.00018.1s10%
70o4 Mini25.8%$0.01525.7s14%
71WizardLM 2 8x22b38.5%$0.00261.8m16%
72Z.AI GLM 4.529.0%$0.005142.1s12%
73GPT-4.1 Mini24.8%$0.002719.0s10%
74DeepSeek V3.233.7%$0.00141.9m17%
75Gemini 2.5 Pro29.5%$0.03636.2s16%
76Aion 2.028.8%$0.00641.3m15%
77Claude 3.7 Sonnet32.1%$0.04246.7s17%
78Gemini 3.1 Flash Lite (Preview)21.3%$0.00308.4s8%
79Qwen 3.5 Plus (2026-02-15)22.7%$0.006031.5s12%
80o4 Mini High28.1%$0.02547.2s14%
81MoonshotAI: Kimi K2.543.7%$0.0193.2m22%
82Grok 443.0%$0.0481.7m16%
83Ministral 3 3B18.7%$0.000511.1s5%
84Claude 3.5 Sonnet32.1%$0.04835.5s11%
85DeepSeek V3.126.6%$0.00201.8m13%
86Gemini 3 Flash (Preview)17.4%$0.007819.6s7%
87DeepSeek V3 (2024-12-26)22.0%$0.002154.6s7%
88DeepSeek-V2 Chat21.7%$0.002153.3s6%
89GPT-4o, Aug. 6th (temp=0)17.5%$0.02322.7s8%
90GPT-4o, May 13th (temp=0)18.6%$0.03514.1s8%
91Gemini 3 Flash (Preview, Reasoning)17.3%$0.01230.1s5%
92Z.AI GLM 4.7 Flash16.8%$0.00171.2m8%
93GPT-5 Mini17.7%$0.010057.4s7%
94Arcee AI: Trinity Mini11.6%$0.00039.2s0%
95Qwen 3.5 397B A17B33.8%$0.0143.0m15%
96Qwen 3.5 Flash12.9%$0.002547.5s2%
97Z.AI GLM 4.716.6%$0.0101.4m8%
98Qwen 2.5 72B8.8%$0.001036.7s2%
99GPT-4o Mini (temp=0)9.5%$0.001234.8s0%
100Nemotron 3 Super13.3%$0.00001.4m5%
101Gemini 3 Pro (Preview)20.1%$0.05554.4s9%
102Claude Opus 457.5%$0.2091.4m28%
103Stealth: Aurora Alpha0.7%$0.00009.8s0%
104Inception Mercury 21.1%$0.00327.0s0%
105GPT-5.222.9%$0.0561.5m11%
106ByteDance Seed 2.0 Lite22.0%$0.0122.2m6%
107GPT-5 Nano13.5%$0.00421.4m2%
108Qwen 3.5 35B14.4%$0.0181.0m1%
109Inception Mercury1.2%$0.01117.6s0%
110Nemotron 3 Nano7.5%$0.00101.1m0%
111Qwen 3.5 9B7.6%$0.00111.4m0%
112GPT-529.2%$0.0652.8m13%
113Qwen 3.5 122B8.2%$0.0251.1m0%
114Qwen 3.5 27B6.7%$0.0201.6m0%
115ByteDance Seed 1.611.8%$0.0132.5m0%
116Gemini 3.1 Pro (Preview)23.9%$0.1071.8m6%
117ByteDance Seed 2.0 Mini22.7%$0.00454.9m9%
118Mistral Small 3.2 24B25.5%$0.00695.7m9%
35.67%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X510010096888092.7%
Rocinante 12B10010098837891.8%
GPT-5.4 (Reasoning)1009187866084.7%
GPT-5.4 (Reasoning, Low)1008980717182.5%
Qwen3 235B A22B Instruct 250710010098872782.5%
Claude Sonnet 4.6 (Reasoning)979281746281.2%
Claude Sonnet 4.6967978776378.4%
Claude Sonnet 4.51008771676477.9%
Gemma 3 4B1007876423967.1%
Hermes 3 70B1009660562366.9%
GPT-5.4817567605166.8%
Claude Opus 4.5838172652965.9%
Claude Opus 4.6 (Reasoning)1007271393563.2%
Llama 3.1 Nemotron 70B857363413559.2%
MiniMax M2.7886051504258.2%
Cohere Command R+ (Aug. 2024)998240363458.2%
Claude Opus 4757471422457.3%
Claude Opus 4.6767050464457.3%
WizardLM 2 8x22b976953392757.0%
Gemma 3 12B877353413156.9%
GPT-5.4 Mini777055414156.6%
Z.AI GLM 51007867201355.5%
Llama 3.1 8B1001005618054.8%
GPT-5.4 Mini (Reasoning)756455463454.6%
Claude 3.5 Haiku1006737373154.5%
Claude Haiku 4.51006143363354.4%
Hermes 3 405B10010045141154.0%
MoonshotAI: Kimi K2.5746054482953.2%
GPT-5.4 Mini (Reasoning, Low)696356463053.1%
Aion 2.0645955433350.8%
Z.AI GLM 5 Turbo786747431650.2%
Gemini 2.5 Pro636161451148.1%
Gemini 2.5 Flash704842413046.3%
Gemma 3 27B595655431846.2%
GPT-4o Mini (temp=1)904542381546.0%
Arcee AI: Trinity Large (Preview)685047392245.0%
Gemini 2.5 Flash Lite524943423744.7%
Ministral 8B796046231544.5%
Gemini 2.5 Flash (Reasoning)81773714943.7%
Ministral 3 14B755043281843.1%
Claude Sonnet 4685449291142.1%
Stealth: Hunter Alpha883433312041.4%
DeepSeek V3 (2025-03-24)94562626040.3%
GPT-5.1544938362540.2%
DeepSeek V3.2545243391240.1%
Gemini 2.5 Flash Lite (Reasoning)773832302039.5%
LFM2 24B86612221037.9%
Mistral Small 4 (Reasoning)555544171537.3%
GPT-4o, Aug. 6th (temp=1)75582918036.2%
Grok 4.20 (Beta)644539211035.9%
Mistral Small Creative534643171334.6%
ByteDance Seed 1.6 Flash79342923834.6%
Mistral Small 3.2 24B81352928034.6%
GPT-4.162533027034.4%
Stealth: Healer Alpha80403814034.4%
Gemini 3 Flash (Preview, Reasoning)483935281633.2%
Ministral 3B6560370032.4%
Z.AI GLM 4.6494128232232.4%
Mistral Large 2393732291831.2%
Claude 3 Haiku732621211430.9%
Llama 3.1 70B10031107029.6%
Mistral Medium 3.1464231151329.2%
GPT-5.242413716528.2%
MiniMax M2.573262516028.0%
Grok 4 Fast433131211328.0%
Gemini 3 Pro (Preview)49362720727.6%
Claude 3.7 Sonnet50402820027.6%
Gemini 3 Flash (Preview)38373624027.0%
Mistral Large 3463120191726.4%
GPT-4o, May 13th (temp=1)602616141225.7%
Mistral Large48312720125.4%
Grok 4.1 Fast48272617624.8%
Claude 3.5 Sonnet7023148724.1%
ByteDance Seed 2.0 Lite51312316024.1%
DeepSeek V3.133322520924.0%
DeepSeek-V2 Chat542218111022.9%
GPT-5.4 Nano432818141122.9%
Ministral 3 8B6039122022.6%
Ministral 3 3B45411310021.8%
Qwen 3.5 397B A17B4031288021.4%
GPT-5.4 Nano (Reasoning, Low)27272420420.4%
GPT-5.4 Nano (Reasoning)37351411520.3%
Z.AI GLM 4.728272314218.9%
ByteDance Seed 2.0 Mini552580017.4%
o4 Mini High462597017.4%
DeepSeek V3 (2024-12-26)3825213017.4%
Mistral Small 434171512917.3%
GPT-54018109716.9%
GPT-4.1 Mini4123180016.2%
Grok 43028106215.3%
Gemini 3.1 Flash Lite (Preview)3717135315.1%
GPT-4o, Aug. 6th (temp=0)2823174014.3%
Mistral NeMO322780013.5%
GPT-4.1 Nano2520162012.3%
Arcee AI: Trinity Mini431700012.0%
Qwen 3.5 9B391800011.4%
Z.AI GLM 4.7 Flash2515140010.9%
Qwen 2.5 72B241953210.7%
Gemini 3.1 Pro (Preview)2316113010.5%
Grok 4.20 (Beta, Reasoning)231310209.7%
Qwen 3.5 Plus (2026-02-15)21147309.0%
Z.AI GLM 4.523135409.0%
Qwen 3.5 Flash4500009.0%
GPT-4o, May 13th (temp=0)24165008.8%
GPT-5 Mini2385007.2%
Qwen 3.5 35B3120006.5%
GPT-5 Nano1796006.3%
Qwen 3 32B1884106.3%
o4 Mini11106216.3%
GPT-4o Mini (temp=0)2731006.1%
Nemotron 3 Super17130006.0%
Nemotron 3 Nano1570004.4%
Qwen 3.5 27B1900003.8%
ByteDance Seed 1.61400002.8%
Stealth: Aurora Alpha500001.0%
Inception Mercury 2400000.9%
Qwen 3.5 122B000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 27B1001001001008597.0%
MiniMax M2.7100100100949297.0%
Claude Sonnet 4.5100100100997695.1%
Claude Opus 4.6 (Reasoning)1001001001006993.9%
Claude Haiku 4.510010099947793.9%
GPT-4o Mini (temp=1)10010095957693.0%
Claude Opus 4100100100885789.1%
Rocinante 12B100100100856088.9%
Claude Sonnet 4.6999089847687.5%
Claude 3.5 Haiku10010092905487.2%
GPT-5.4 (Reasoning)979382807986.2%
Mistral Small 4 (Reasoning)1009889786285.3%
Claude Opus 4.61001001001002184.3%
Z.AI GLM 5 Turbo979389766684.2%
GPT-4o, Aug. 6th (temp=1)10010095685784.1%
GPT-5.41008681777483.8%
GPT-5.4 (Reasoning, Low)948584777482.8%
Claude 3 Haiku10010090852980.7%
Claude Sonnet 410010088595480.0%
Grok 41009690654679.5%
Claude Sonnet 4.6 (Reasoning)10010074595878.0%
Stealth: Hunter Alpha908483745677.5%
DeepSeek V3 (2025-03-24)10010090603777.4%
Llama 3.1 8B1001009977876.8%
Z.AI GLM 51008574625675.4%
WizardLM 2 8x22b958969635974.7%
Claude Opus 4.510010082444273.7%
Mistral Large 21008973564873.1%
Gemma 3 4B998181515072.4%
Z.AI GLM 4.5949481553171.0%
DeepSeek V3.2907266665870.6%
Mistral Small Creative998579523770.3%
MiniMax M2.5927368665170.1%
GPT-5.4 Mini (Reasoning, Low)857367645969.8%
Arcee AI: Trinity Large (Preview)1009177481666.4%
Ministral 3 14B966866504665.3%
GPT-5.1907171524165.2%
Mistral Large 310010048452663.8%
Mistral Large937169532762.7%
Aion 2.0816762604362.7%
Ministral 8B827465603162.4%
Gemini 2.5 Flash Lite827460414159.6%
Cohere Command R+ (Aug. 2024)100827243059.5%
Mistral Medium 3.1876057484459.1%
Hermes 3 70B100926238158.5%
Gemma 3 12B716763593258.3%
GPT-5.4 Mini766051484856.8%
Grok 4.1 Fast914947474455.6%
GPT-4.1865951383754.1%
Gemini 2.5 Pro676060493454.0%
Gemini 3 Pro (Preview)875751482453.5%
MoonshotAI: Kimi K2.5776159402853.0%
Llama 3.1 70B776560491353.0%
GPT-5.4 Mini (Reasoning)805350483352.7%
Claude 3.7 Sonnet676454473252.6%
DeepSeek V3.1928039312152.6%
Gemini 2.5 Flash (Reasoning)81805843052.4%
Gemini 2.5 Flash1004843363351.8%
DeepSeek V3 (2024-12-26)1007247191851.2%
ByteDance Seed 1.6 Flash635552513551.1%
LFM2 24B79715343650.7%
GPT-4.1 Nano725650393550.4%
GPT-4o, May 13th (temp=1)685953363249.6%
Grok 4.20 (Beta)554848464548.4%
Grok 4 Fast71696529648.2%
Ministral 3 8B754848373448.2%
GPT-4.1 Mini665958322648.1%
Claude 3.5 Sonnet100484741147.4%
Mistral NeMO100593935347.2%
GPT-5.4 Nano (Reasoning, Low)695746322946.5%
Mistral Small 468645640045.6%
GPT-5.4 Nano664947441945.2%
Gemini 2.5 Flash Lite (Reasoning)85803524044.9%
Hermes 3 405B93843311044.2%
Stealth: Healer Alpha615347411743.7%
Llama 3.1 Nemotron 70B100532917641.1%
DeepSeek-V2 Chat100641913439.8%
Z.AI GLM 4.6514836322939.2%
ByteDance Seed 2.0 Mini554340292738.9%
Nemotron 3 Super73504616938.8%
Z.AI GLM 4.7743232282838.7%
Qwen 3 32B52494539137.2%
GPT-5554636331537.0%
GPT-5.2574336321736.9%
Arcee AI: Trinity Mini10041236334.5%
Ministral 3 3B694025161432.6%
Ministral 3B54383624932.0%
GPT-5.4 Nano (Reasoning)51333232931.4%
Qwen 3.5 Plus (2026-02-15)582725251329.7%
Grok 4.20 (Beta, Reasoning)553818151227.4%
Gemini 3.1 Pro (Preview)423025201927.4%
Gemini 3 Flash (Preview)39322823926.3%
o4 Mini403423161625.8%
GPT-4o Mini (temp=0)373028181525.4%
Z.AI GLM 4.7 Flash333022201724.4%
o4 Mini High392925151123.7%
ByteDance Seed 1.6252323151219.3%
ByteDance Seed 2.0 Lite64121010019.3%
Gemini 3 Flash (Preview, Reasoning)353370015.1%
GPT-5 Mini30121010012.6%
GPT-4o, May 13th (temp=0)54520012.4%
GPT-4o, Aug. 6th (temp=0)351642011.4%
Qwen 2.5 72B36966011.2%
GPT-5 Nano2317122111.1%
Qwen 3.5 397B A17B161586610.1%
Gemini 3.1 Flash Lite (Preview)1662004.9%
Mistral Small 3.2 24B1311003.1%
Qwen 3.5 35B1500003.0%
Qwen 3.5 122B1310002.9%
Qwen 3.5 Flash900001.8%
Nemotron 3 Nano900001.8%
Qwen 3.5 27B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X51001001001007895.7%
Qwen3 235B A22B Instruct 250710010099826789.7%
Llama 3.1 8B100100100834084.5%
Claude Opus 4100100100625082.5%
Rocinante 12B10010010052270.8%
Mistral Small 4959368601366.1%
Mistral Small 4 (Reasoning)79776857958.1%
GPT-5.4686762443454.8%
Claude Opus 4.5836643433153.5%
Claude Haiku 4.51005346353453.4%
Claude Sonnet 41005647372553.0%
Mistral Large 2856747342852.3%
Ministral 3 8B1009129231952.3%
Claude Sonnet 4.5757261291750.9%
Claude Sonnet 4.6 (Reasoning)655646454150.7%
Hermes 3 405B88664539047.5%
Hermes 3 70B74694742046.4%
Mistral Medium 3.168615545045.8%
MiniMax M2.571565045645.8%
Mistral Large 31006224191744.6%
Ministral 3 14B85584331544.2%
Llama 3.1 70B824841241742.4%
Claude Sonnet 4.674585512040.0%
GPT-5.4 (Reasoning, Low)595044281739.8%
Claude Opus 4.6574848242339.8%
Mistral Large75454227438.6%
Z.AI GLM 5544844271537.6%
Llama 3.1 Nemotron 70B69434131036.9%
Grok 4.1 Fast71484811035.8%
MoonshotAI: Kimi K2.5544128272635.0%
GPT-5975483032.4%
ByteDance Seed 1.6 Flash553130232332.3%
Qwen 3.5 397B A17B50453429031.6%
Ministral 8B46453320930.8%
Arcee AI: Trinity Large (Preview)886050030.6%
Claude 3.5 Haiku48433427030.4%
Z.AI GLM 5 Turbo413733221930.3%
Gemma 3 12B61352920029.2%
DeepSeek V3 (2025-03-24)5549310027.1%
Gemma 3 27B493725121127.0%
MiniMax M2.740342824726.9%
GPT-4.16041270025.5%
GPT-4o, Aug. 6th (temp=1)42392717025.0%
Qwen 3 32B4536356024.5%
WizardLM 2 8x22b4339313023.3%
Grok 4 Fast5331186522.8%
Grok 4.20 (Beta)47351610021.5%
GPT-5.4 Nano (Reasoning, Low)39241815720.7%
GPT-4o Mini (temp=1)4323237620.6%
GPT-5.4 (Reasoning)3933310020.6%
Gemini 2.5 Flash (Reasoning)100000020.0%
Qwen 3.5 35B4440140019.6%
Claude 3 Haiku7114130019.6%
ByteDance Seed 2.0 Lite3930263019.5%
Claude Opus 4.6 (Reasoning)474170019.0%
Mistral Small Creative4437112019.0%
DeepSeek V3 (2024-12-26)6020100018.1%
GPT-5.14525143017.2%
Claude 3.5 Sonnet4816119016.8%
LFM2 24B4916135016.8%
Gemma 3 4B25231414516.2%
Mistral NeMO3918160014.7%
GPT-4.1 Nano3022192014.5%
GPT-4o, May 13th (temp=1)531900014.3%
DeepSeek-V2 Chat3814117013.8%
GPT-5.4 Nano2318129513.3%
GPT-5.4 Mini2620127013.1%
GPT-5.4 Mini (Reasoning)431800012.2%
Claude 3.7 Sonnet342600012.0%
Ministral 3B2518130011.1%
Gemini 2.5 Flash Lite311154010.3%
GPT-4o, May 13th (temp=0)311630010.2%
GPT-5 Nano51000010.2%
Cohere Command R+ (Aug. 2024)411000010.2%
GPT-5.4 Nano (Reasoning)2713100010.0%
GPT-4.1 Mini28145009.6%
Mistral Small 3.2 24B4150009.2%
GPT-5.4 Mini (Reasoning, Low)23166009.1%
DeepSeek V3.223106007.9%
Z.AI GLM 4.62396007.7%
Z.AI GLM 4.53600007.3%
Ministral 3 3B3410007.1%
ByteDance Seed 1.63130007.0%
Aion 2.019131006.6%
Qwen 3.5 27B19140006.4%
Qwen 3.5 Plus (2026-02-15)17150006.4%
Gemini 2.5 Flash3100006.3%
Nemotron 3 Super3100006.1%
o4 Mini15115006.1%
Stealth: Healer Alpha1594105.8%
Grok 42600005.1%
o4 Mini High1294005.0%
Qwen 3.5 Flash2230004.9%
Gemini 3 Pro (Preview)13110004.7%
GPT-5.22300004.5%
Qwen 2.5 72B1700003.4%
Arcee AI: Trinity Mini1160003.4%
GPT-4o, Aug. 6th (temp=0)1600003.1%
Gemini 2.5 Flash Lite (Reasoning)753002.9%
Gemini 3.1 Pro (Preview)1300002.6%
GPT-4o Mini (temp=0)920002.2%
Gemini 2.5 Pro1000002.1%
Z.AI GLM 4.7 Flash1000002.1%
Grok 4.20 (Beta, Reasoning)710001.7%
GPT-5 Mini711001.6%
Gemini 3 Flash (Preview, Reasoning)700001.4%
Gemini 3.1 Flash Lite (Preview)200000.5%
ByteDance Seed 2.0 Mini110000.3%
Qwen 3.5 9B100000.2%
Stealth: Hunter Alpha100000.2%
Nemotron 3 Nano000000.1%
Qwen 3.5 122B000000.0%
Z.AI GLM 4.7000000.0%
Gemini 3 Flash (Preview)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
DeepSeek V3.1000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Writer: Palmyra X5100100100100100100.0%
Qwen3 235B A22B Instruct 25071001001001008396.7%
GPT-5.4 (Reasoning, Low)100100100927393.0%
Claude Opus 4.6 (Reasoning)10010090864684.6%
Claude Sonnet 4.510010094804583.7%
Stealth: Hunter Alpha1009379664576.8%
Claude Sonnet 4.61009877604676.2%
GPT-5.11009668675076.1%
Hermes 3 405B1009672585476.0%
Claude Sonnet 41007772625773.7%
Mistral Small 4 (Reasoning)1009565575273.7%
GPT-5.41009465634573.4%
Rocinante 12B10010088463173.0%
Grok 4.1 Fast958474535171.4%
Claude Sonnet 4.6 (Reasoning)959074484871.1%
Claude Opus 4.610010059514571.0%
Z.AI GLM 51007265645270.5%
GPT-5.4 (Reasoning)977271634469.4%
Claude Opus 4938273613568.8%
MoonshotAI: Kimi K2.51007065633967.4%
Mistral Small Creative1007167573466.0%
GPT-5.4 Mini827362625166.0%
Llama 3.1 8B10010060501665.0%
Mistral Small 41006657574164.2%
MiniMax M2.7817979453363.4%
Mistral Large92756663760.7%
Z.AI GLM 5 Turbo796663603059.8%
Ministral 3 14B95776354959.6%
Hermes 3 70B1001006530059.0%
Claude Opus 4.5909045392958.5%
Gemma 3 27B1007157442058.4%
MiniMax M2.5818167312457.1%
Claude 3.5 Haiku926555471654.9%
GPT-5.4 Mini (Reasoning)595959514554.6%
Arcee AI: Trinity Large (Preview)876256422454.2%
GPT-5.4 Mini (Reasoning, Low)746252412951.5%
Stealth: Healer Alpha676665362351.4%
Qwen 3 32B94604939048.3%
DeepSeek V3.1854739363247.9%
Mistral Medium 3.1726959251347.6%
GPT-5846935331246.4%
Claude 3.7 Sonnet814140383146.2%
WizardLM 2 8x22b86784025046.0%
Claude 3.5 Sonnet100574519745.7%
Gemma 3 4B605344432545.0%
Mistral Large 279625330044.8%
Claude Haiku 4.564595148144.5%
Grok 4665936312843.9%
GPT-4.1 Nano67574943043.1%
DeepSeek V3 (2025-03-24)1004336211342.5%
Gemma 3 12B88504828042.5%
ByteDance Seed 1.6 Flash1004823212042.3%
Grok 4.20 (Beta)585234333342.0%
Gemini 2.5 Flash Lite785143191741.6%
Qwen 3.5 397B A17B73675711041.6%
Gemini 2.5 Flash (Reasoning)68664724341.5%
Aion 2.0665043251540.0%
GPT-5.4 Nano594138322338.7%
Gemini 2.5 Flash Lite (Reasoning)674137301738.4%
GPT-4o, Aug. 6th (temp=1)544238381437.1%
Mistral Small 3.2 24B8455440036.6%
Mistral Large 3843428201636.5%
Gemini 2.5 Flash45453734032.1%
GPT-5.4 Nano (Reasoning, Low)585318161331.8%
Ministral 3 8B593228251131.1%
GPT-4o Mini (temp=1)57503712031.1%
Ministral 8B8742145029.8%
LFM2 24B623024141228.3%
GPT-4o, May 13th (temp=1)69342215027.9%
Gemini 3 Pro (Preview)59402512127.3%
o4 Mini High7922188025.6%
DeepSeek V3.240383415125.6%
Cohere Command R+ (Aug. 2024)635344024.9%
Mistral NeMO423022191124.9%
GPT-4.14139380023.6%
Grok 4 Fast48332311123.2%
Claude 3 Haiku7711109322.0%
Gemini 3.1 Flash Lite (Preview)37262216921.9%
Arcee AI: Trinity Mini752390021.2%
Z.AI GLM 4.76128122020.7%
Z.AI GLM 4.65232115220.5%
Ministral 3B5226169020.4%
Gemini 3 Flash (Preview)37311813019.9%
Ministral 3 3B38241612819.5%
Gemini 2.5 Pro51231111219.4%
GPT-4o, May 13th (temp=0)4429194019.1%
o4 Mini3530217018.6%
GPT-5.4 Nano (Reasoning)443841017.5%
Gemini 3 Flash (Preview, Reasoning)3823188017.4%
Llama 3.1 70B3129250017.1%
Z.AI GLM 4.7 Flash3932103017.0%
Z.AI GLM 4.5472592016.5%
GPT-5.22826199016.3%
DeepSeek V3 (2024-12-26)4025100015.0%
Qwen 3.5 Plus (2026-02-15)2726201014.8%
Grok 4.20 (Beta, Reasoning)3421108014.7%
GPT-4o, Aug. 6th (temp=0)3521100013.3%
Qwen 2.5 72B47840011.9%
Gemini 3.1 Pro (Preview)2317153111.7%
Llama 3.1 Nemotron 70B46500010.3%
GPT-5 Mini22127008.1%
ByteDance Seed 1.620180007.5%
Qwen 3.5 35B2800005.5%
Nemotron 3 Super2200004.4%
DeepSeek-V2 Chat984004.1%
Qwen 3.5 122B1600003.3%
ByteDance Seed 2.0 Mini1200002.4%
Qwen 3.5 27B1000002.0%
GPT-4.1 Mini900001.7%
GPT-4o Mini (temp=0)300000.6%
ByteDance Seed 2.0 Lite300000.6%
Qwen 3.5 Flash000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5 Nano000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100786187.8%
Writer: Palmyra X51009682764379.4%
Llama 3.1 8B10010066605776.7%
Hermes 3 70B1009393772076.6%
Cohere Command R+ (Aug. 2024)1001008074070.8%
Claude 3 Haiku838171673868.1%
Mistral Medium 3.1868474513766.4%
Claude Opus 4987854474364.0%
Claude 3.5 Haiku1001006554063.8%
GPT-5.4 (Reasoning, Low)746861585362.7%
Claude Opus 4.5868660433562.1%
Claude Opus 4.6 (Reasoning)756761574661.2%
Claude Sonnet 41007261422059.0%
Qwen3 235B A22B Instruct 2507806161602357.1%
GPT-5.4806138373449.9%
Claude Opus 4.6826248351949.4%
Grok 4.1 Fast886932292749.0%
Mistral NeMO75665145648.7%
MiniMax M2.5605149373446.1%
Mistral Small Creative100584210142.3%
Z.AI GLM 5814332322142.0%
Ministral 3 8B81694510041.0%
Grok 4 Fast615339331940.9%
Mistral Small 4 (Reasoning)585837292340.8%
Mistral Small 3.2 24B100353430140.0%
Llama 3.1 70B100432817137.7%
Claude Sonnet 4.6 (Reasoning)58564329037.2%
Claude Sonnet 4.566483633036.6%
GPT-4o, Aug. 6th (temp=1)60514131036.6%
Mistral Large8176164035.5%
Grok 4.20 (Beta)57433838035.3%
Qwen 3.5 397B A17B544336231333.8%
Claude Haiku 4.5574137191333.5%
Mistral Large 360523815033.0%
Ministral 3 14B71453118032.9%
Qwen 3.5 Plus (2026-02-15)8948260032.5%
MiniMax M2.774463211032.4%
GPT-4o, May 13th (temp=0)474629271432.4%
Ministral 8B63473119032.0%
Mistral Small 47155290031.1%
Claude Sonnet 4.610035119031.0%
Gemma 3 27B55432320930.0%
GPT-5.4 Mini413634261129.7%
Gemini 2.5 Flash Lite49402722929.5%
Hermes 3 405B54502813029.0%
Stealth: Hunter Alpha5045389128.6%
Z.AI GLM 5 Turbo6357105528.1%
Gemma 3 12B363431211627.7%
GPT-4.1 Nano48332925027.0%
o4 Mini712119121126.7%
Stealth: Healer Alpha8527174026.6%
MoonshotAI: Kimi K2.54643329026.1%
Arcee AI: Trinity Large (Preview)46413112025.9%
GPT-4o, May 13th (temp=1)423123171325.2%
Gemma 3 4B61282311024.5%
ByteDance Seed 1.6 Flash48292520024.4%
Llama 3.1 Nemotron 70B56251816624.1%
Ministral 3 3B932620024.1%
DeepSeek V3.24541330023.9%
LFM2 24B54302214023.9%
GPT-5.4 (Reasoning)502916141023.7%
ByteDance Seed 2.0 Mini6523179022.6%
GPT-5.4 Nano (Reasoning, Low)4826199922.3%
Gemini 2.5 Flash6030160021.2%
Gemini 2.5 Flash Lite (Reasoning)4637176021.1%
Qwen 3 32B90640020.1%
WizardLM 2 8x22b40231918019.8%
GPT-5 Nano741184019.3%
Grok 434321212519.1%
DeepSeek V3 (2025-03-24)38271513018.3%
GPT-5.14233115018.2%
Mistral Large 235241614017.9%
Arcee AI: Trinity Mini4518176017.1%
Claude 3.7 Sonnet562061016.7%
Claude 3.5 Sonnet3731150016.6%
GPT-4o, Aug. 6th (temp=0)37171414016.3%
GPT-5.4 Nano24201713415.5%
Qwen 3.5 Flash471594015.0%
GPT-5.4 Mini (Reasoning, Low)2823116013.6%
Z.AI GLM 4.6402030012.6%
Ministral 3B51730012.2%
ByteDance Seed 1.6303000012.0%
GPT-4.1281765011.2%
Grok 4.20 (Beta, Reasoning)391430011.1%
GPT-5.4 Mini (Reasoning)2120130010.9%
Qwen 3.5 27B252500010.0%
Gemini 2.5 Flash (Reasoning)4440009.6%
Aion 2.031140009.0%
GPT-5.4 Nano (Reasoning)20126428.8%
o4 Mini High4210008.6%
Gemini 3 Flash (Preview, Reasoning)3254008.3%
GPT-5 Mini27140008.1%
Gemini 3 Pro (Preview)15128107.2%
GPT-4o Mini (temp=1)2560006.2%
Z.AI GLM 4.519120006.1%
DeepSeek V3 (2024-12-26)16101005.4%
Gemini 3.1 Pro (Preview)2430005.3%
Qwen 3.5 35B2700005.3%
Gemini 2.5 Pro2020004.5%
GPT-51074004.4%
GPT-4.1 Mini955003.7%
Z.AI GLM 4.7 Flash1331003.5%
Nemotron 3 Super1700003.4%
Gemini 3 Flash (Preview)1400002.7%
Z.AI GLM 4.7750002.6%
Qwen 2.5 72B533002.5%
DeepSeek-V2 Chat1100002.1%
DeepSeek V3.1630001.8%
Gemini 3.1 Flash Lite (Preview)430001.6%
ByteDance Seed 2.0 Lite500001.1%
Qwen 3.5 122B500001.1%
GPT-4o Mini (temp=0)500000.9%
GPT-5.2300000.6%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 25071001001001009498.7%
Writer: Palmyra X51001001001008196.2%
Rocinante 12B100100100938295.0%
Z.AI GLM 5100100100844986.6%
Llama 3.1 8B100100100784885.2%
Hermes 3 405B10010093913684.1%
Claude Sonnet 4.6 (Reasoning)10010083765582.7%
Claude Haiku 4.51009084694377.2%
Claude Opus 4.5928475646475.9%
Hermes 3 70B10010085741875.3%
Claude Sonnet 4.61009078564674.2%
Claude Sonnet 4.51009465574071.4%
Claude 3.5 Haiku1009071484370.6%
GPT-5.4 (Reasoning, Low)897770635069.6%
Claude Sonnet 4978264544769.0%
Mistral Small 410010078432368.7%
MiniMax M2.71009349494867.7%
GPT-5.4847660595767.2%
Mistral Medium 3.1918867483666.2%
Claude Opus 4.6 (Reasoning)838357534964.9%
Cohere Command R+ (Aug. 2024)1007573512364.6%
GPT-4o, Aug. 6th (temp=1)1008355424264.5%
Z.AI GLM 5 Turbo797069594464.3%
Claude Opus 4.6827672533764.0%
Arcee AI: Trinity Large (Preview)858570572063.5%
WizardLM 2 8x22b866960504962.8%
Qwen 3.5 397B A17B1006755454261.9%
Grok 4.1 Fast747061594561.6%
GPT-4o Mini (temp=1)967053483860.9%
Mistral Large 3726765642859.2%
Gemma 3 27B886052444056.8%
GPT-4.1876561412756.2%
Mistral Small 4 (Reasoning)98755748256.1%
Claude 3 Haiku1001004529054.8%
MiniMax M2.51007143302754.1%
GPT-5.1785951483454.1%
Gemini 2.5 Flash Lite706853502753.9%
GPT-4.1 Nano767368272653.7%
Mistral Large99866120053.4%
Gemini 2.5 Flash Lite (Reasoning)635353534453.2%
Claude Opus 4796850382652.3%
GPT-4o, May 13th (temp=1)755857462351.7%
DeepSeek V3.2695752413450.5%
Mistral Large 2905331312946.8%
MoonshotAI: Kimi K2.5625440383846.2%
Grok 4 Fast605651312645.0%
Mistral Small Creative816536281344.6%
DeepSeek V3 (2025-03-24)696743271644.4%
Qwen 3 32B75524742043.3%
GPT-5.4 (Reasoning)544940373442.8%
Z.AI GLM 4.6645235342642.3%
Gemini 2.5 Flash675343311541.9%
Grok 4675532312441.7%
Z.AI GLM 4.5604747282641.6%
Stealth: Healer Alpha69654528041.4%
Llama 3.1 70B604340382441.1%
Ministral 3B754831271940.2%
o4 Mini High605242281539.5%
Aion 2.0663635312638.9%
GPT-5575044261638.6%
Ministral 3 14B912927252038.4%
GPT-4.1 Mini605232301738.1%
Ministral 3 3B9749319337.9%
Claude 3.5 Sonnet6865560037.7%
GPT-5.4 Mini (Reasoning, Low)704825232337.7%
Stealth: Hunter Alpha584533262336.9%
Gemini 2.5 Pro56554911234.5%
GPT-5.4 Nano564335211634.2%
Gemma 3 4B753431161534.1%
Llama 3.1 Nemotron 70B574837141333.8%
Z.AI GLM 4.7 Flash575423191333.2%
Grok 4.20 (Beta)523634281533.1%
Ministral 3 8B8166130032.2%
o4 Mini63413117731.7%
Claude 3.7 Sonnet68373214831.6%
LFM2 24B403834272031.6%
DeepSeek V3.1474126231630.7%
Mistral Small 3.2 24B60452222029.7%
Grok 4.20 (Beta, Reasoning)483523212029.2%
ByteDance Seed 1.6 Flash45413513427.5%
Gemma 3 12B44343423227.4%
Qwen 3.5 Plus (2026-02-15)43363213125.0%
Ministral 8B5337311024.6%
GPT-5.4 Nano (Reasoning, Low)31303021223.1%
GPT-5.4 Mini372522161523.1%
GPT-5.4 Nano (Reasoning)40332216021.9%
Gemini 3.1 Pro (Preview)333121131021.7%
DeepSeek-V2 Chat42252219021.6%
Mistral NeMO4340210020.9%
Gemini 3 Pro (Preview)34212018018.6%
GPT-4o, Aug. 6th (temp=0)5918150018.3%
Gemini 2.5 Flash (Reasoning)25242213016.8%
GPT-5.4 Mini (Reasoning)3429135116.5%
Nemotron 3 Super2924150013.7%
Qwen 3.5 9B67000013.4%
Z.AI GLM 4.720171613013.2%
Gemini 3 Flash (Preview, Reasoning)2519192012.8%
GPT-4o Mini (temp=0)2516150011.4%
ByteDance Seed 2.0 Mini282040010.4%
DeepSeek V3 (2024-12-26)2199408.7%
Qwen 2.5 72B19166008.1%
GPT-5.23070007.4%
GPT-5 Nano12109406.9%
Gemini 3 Flash (Preview)1875306.4%
Arcee AI: Trinity Mini13101004.7%
ByteDance Seed 1.61751004.7%
ByteDance Seed 2.0 Lite1640004.0%
GPT-5 Mini1820003.9%
Gemini 3.1 Flash Lite (Preview)760002.6%
GPT-4o, May 13th (temp=0)930002.4%
Nemotron 3 Nano700001.4%
Qwen 3.5 Flash200000.4%
Qwen 3.5 122B000000.0%
Qwen 3.5 27B000000.0%
Qwen 3.5 35B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning, Low)1009089887989.3%
GPT-5.4 (Reasoning)1009486857788.4%
GPT-5.4 Mini (Reasoning, Low)1009184837185.8%
Rocinante 12B100100100645082.7%
GPT-5.4 Mini (Reasoning)888679767681.2%
Llama 3.1 8B10010098623879.6%
Llama 3.1 Nemotron 70B10010091514978.2%
GPT-5.4887575716374.4%
GPT-5.4 Mini848166666672.6%
GPT-5.11008563575271.5%
Writer: Palmyra X51008863594470.9%
Llama 3.1 70B898278484368.0%
Mistral Small 4987572452562.8%
WizardLM 2 8x22b878253533862.5%
GPT-4o Mini (temp=1)936564503060.4%
GPT-4o, Aug. 6th (temp=1)937666481459.5%
MiniMax M2.5917052483559.2%
Claude Sonnet 4.5796856524159.2%
Grok 4 Fast907752403759.1%
Qwen 3.5 397B A17B857869382559.0%
Gemini 3 Flash (Preview, Reasoning)836557464058.4%
Gemini 3.1 Pro (Preview)787454503658.3%
Claude Sonnet 4.6 (Reasoning)706861414156.5%
Gemini 3.1 Flash Lite (Preview)765453504956.4%
GPT-4.1656359553956.2%
Qwen3 235B A22B Instruct 2507926358531255.5%
Claude Sonnet 4.6786449473855.1%
Hermes 3 405B737251502855.0%
ByteDance Seed 1.6 Flash100776529054.4%
DeepSeek V3 (2025-03-24)1005045433454.4%
Z.AI GLM 4.5786451413854.4%
Claude Opus 4.6 (Reasoning)666654483453.8%
Z.AI GLM 5 Turbo746747471750.5%
Claude Sonnet 4816337343149.4%
DeepSeek V3.2646258441648.9%
Grok 4.20 (Beta)705343423448.3%
GPT-4o, May 13th (temp=1)874638383248.1%
GPT-5.4 Nano585651502447.6%
Gemini 2.5 Flash (Reasoning)685949402147.6%
Gemini 2.5 Pro684943423447.1%
GPT-5 Mini564645444046.3%
Claude Opus 4.672714731946.1%
Gemini 3 Flash (Preview)796637242345.9%
Cohere Command R+ (Aug. 2024)70604843845.8%
Qwen 3.5 Plus (2026-02-15)665939313045.0%
Grok 4695043422145.0%
Z.AI GLM 5824746311544.1%
Grok 4.1 Fast644543402744.0%
Grok 4.20 (Beta, Reasoning)535251481643.9%
GPT-5.2525146383043.2%
Claude 3.7 Sonnet64545444043.2%
GPT-5634539353242.8%
GPT-5.4 Nano (Reasoning)504745442742.6%
MoonshotAI: Kimi K2.5754336322642.5%
Hermes 3 70B534340383441.7%
Claude 3 Haiku66535330040.4%
Stealth: Hunter Alpha574438372540.2%
ByteDance Seed 2.0 Lite69624325040.0%
Mistral Small 3.2 24B10052420038.8%
Mistral Medium 3.1504136362637.8%
DeepSeek-V2 Chat74572523737.1%
Claude Haiku 4.564623016936.1%
Ministral 3 14B685432161035.9%
GPT-4.1 Nano6863376035.0%
LFM2 24B593734281634.7%
GPT-5.4 Nano (Reasoning, Low)413939282634.5%
Stealth: Healer Alpha504331271633.5%
GPT-4.1 Mini70362928032.7%
Claude Opus 4693023231932.7%
Mistral Large 3413928262531.9%
Gemini 2.5 Flash Lite (Reasoning)593923231731.9%
Qwen 3 32B57503911031.6%
GPT-4o, May 13th (temp=0)554228181331.2%
Gemma 3 12B46393931131.0%
Z.AI GLM 4.6643029201130.8%
DeepSeek V3 (2024-12-26)535227111130.7%
Gemini 2.5 Flash574519161530.5%
Qwen 3.5 35B55512716029.7%
Claude Opus 4.554523011029.3%
ByteDance Seed 1.66360230029.0%
Gemini 3 Pro (Preview)62382914028.6%
Gemma 3 4B46432623328.2%
Mistral Large 24945318728.0%
o4 Mini51471716727.7%
Z.AI GLM 4.7423128261127.7%
Nemotron 3 Super40402424927.5%
Aion 2.059332619027.4%
Mistral Small 4 (Reasoning)52302817826.9%
GPT-4o Mini (temp=0)363229231226.8%
GPT-4o, Aug. 6th (temp=0)42383119426.7%
Claude 3.5 Sonnet7931184126.6%
Mistral Large363524231226.1%
Mistral NeMO423519161325.0%
DeepSeek V3.14342229824.9%
o4 Mini High323128221124.6%
Gemma 3 27B36332821424.5%
Nemotron 3 Nano36302919423.9%
ByteDance Seed 2.0 Mini45412310023.9%
Claude 3.5 Haiku51282516023.8%
Mistral Small Creative46371610923.6%
Qwen 3.5 122B5747120023.2%
Qwen 3.5 Flash4237233021.2%
Gemini 2.5 Flash Lite312519161320.9%
Ministral 8B4232240019.5%
Z.AI GLM 4.7 Flash32231514618.0%
Ministral 3B4334111017.9%
Qwen 2.5 72B35221611016.8%
Ministral 3 8B413040014.9%
Arcee AI: Trinity Large (Preview)372600012.7%
Qwen 3.5 9B3577009.9%
Inception Mercury 2181413008.9%
MiniMax M2.71698006.6%
Qwen 3.5 27B2020004.4%
Ministral 3 3B1400002.8%
GPT-5 Nano553002.7%
Arcee AI: Trinity Mini1200002.3%
Inception Mercury400000.8%
Stealth: Aurora Alpha000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 (Reasoning)1001001001009699.2%
GPT-5.4 (Reasoning, Low)100100100979197.5%
GPT-5.1100100100988696.8%
Grok 4.1 Fast100100100986292.0%
Claude Opus 4.6 (Reasoning)10010098818191.8%
Rocinante 12B100100100817591.3%
Claude Sonnet 4.51009886857488.6%
Claude Sonnet 4.6 (Reasoning)10010095736887.3%
GPT-5.4 Mini (Reasoning)10010086836486.5%
Claude Opus 4.5100100100795286.2%
Claude Sonnet 41009590766585.3%
MiniMax M2.5100100100883083.6%
Claude Haiku 4.51009790795183.5%
GPT-4o, Aug. 6th (temp=1)10010097764383.3%
Claude Opus 4.61009675747183.3%
Mistral Small 4 (Reasoning)10010084834983.0%
DeepSeek V3 (2025-03-24)1009087874982.7%
GPT-4o Mini (temp=1)928585767382.3%
Z.AI GLM 510010088675582.0%
Z.AI GLM 5 Turbo1008481736580.6%
GPT-5.4 Mini (Reasoning, Low)10010079774279.8%
GPT-5.4 Mini948684736279.8%
Claude 3.5 Haiku100100100603478.8%
DeepSeek V3.2918980686578.8%
Mistral Small 4908274717077.4%
Mistral Small Creative979181793376.1%
Grok 41009089633776.0%
Llama 3.1 8B100100100472774.8%
Ministral 3 8B10010066594273.5%
Gemma 3 27B828174656573.4%
LFM2 24B1008483633773.4%
GPT-5.4 Nano (Reasoning)1007968625673.0%
GPT-5.2837776666372.8%
Grok 4 Fast1009272702571.7%
Gemma 3 12B797871646170.4%
Claude 3.5 Sonnet1009360564070.0%
GPT-4.1 Nano10010076393469.6%
Gemini 2.5 Flash1007265574868.2%
Arcee AI: Trinity Large (Preview)1009758532967.5%
Z.AI GLM 4.5796964646267.4%
GPT-4.1979070483167.3%
Stealth: Healer Alpha858280463866.3%
Claude Sonnet 4.6846963595566.0%
Ministral 3 14B878555534965.8%
Qwen 3 32B1009057552365.0%
Cohere Command R+ (Aug. 2024)1007764542764.3%
Grok 4.20 (Beta)867659584164.0%
Aion 2.0906560525263.8%
Gemma 3 4B836557555463.0%
Ministral 8B908272462562.9%
GPT-5.4 Nano (Reasoning, Low)736967624262.5%
MiniMax M2.71009151392962.1%
Hermes 3 70B948757363461.7%
Ministral 3B100977731060.9%
Gemini 2.5 Flash (Reasoning)917862482560.9%
Claude Opus 4787760542659.1%
GPT-5.4 Nano796052524657.6%
Qwen 3.5 Plus (2026-02-15)736554494657.4%
Hermes 3 405B100855843057.3%
GPT-5716362543356.8%
Gemini 3 Flash (Preview, Reasoning)747158522856.5%
ByteDance Seed 1.6 Flash1007556282256.2%
GPT-4o, May 13th (temp=1)847859332856.2%
MoonshotAI: Kimi K2.5836160502656.1%
GPT-5 Mini615955554955.9%
Llama 3.1 Nemotron 70B100755251055.6%
Grok 4.20 (Beta, Reasoning)866145444255.5%
Gemini 2.5 Pro885551493154.8%
Claude 3 Haiku1009131281853.7%
Gemini 3.1 Pro (Preview)897240373053.6%
Z.AI GLM 4.7846554382653.5%
Claude 3.7 Sonnet805653413653.3%
Gemini 2.5 Flash Lite (Reasoning)676256513053.3%
DeepSeek-V2 Chat100746320652.6%
Mistral Medium 3.1656149473451.4%
Mistral NeMO95795032051.3%
Gemini 2.5 Flash Lite636053413851.2%
ByteDance Seed 2.0 Mini767467181850.6%
Mistral Large76585749849.4%
GPT-4.1 Mini595553413648.8%
Mistral Large 2625448413547.8%
Stealth: Hunter Alpha706354351447.2%
DeepSeek V3.1525145424046.0%
Gemini 3 Pro (Preview)825350331345.9%
Llama 3.1 70B1004038341745.6%
Gemini 3 Flash (Preview)736444251945.0%
o4 Mini High727038261744.6%
Qwen 3.5 397B A17B575349313144.2%
o4 Mini575637292641.0%
Ministral 3 3B74713616239.8%
Z.AI GLM 4.7 Flash514743322439.5%
Mistral Large 365463836337.5%
GPT-4o Mini (temp=0)645636181337.5%
DeepSeek V3 (2024-12-26)10049316037.3%
Z.AI GLM 4.6584734241635.7%
Nemotron 3 Super564034281735.0%
ByteDance Seed 2.0 Lite72373625034.2%
WizardLM 2 8x22b694530151133.9%
Qwen 3.5 Flash73393516533.8%
GPT-4o, May 13th (temp=0)58482910930.9%
Arcee AI: Trinity Mini54423812830.8%
Mistral Small 3.2 24B4847446028.9%
GPT-5 Nano353531181426.5%
ByteDance Seed 1.642353118025.5%
Gemini 3.1 Flash Lite (Preview)4643259525.5%
Qwen 3.5 122B45372114724.8%
GPT-4o, Aug. 6th (temp=0)6620129322.0%
Qwen 3.5 35B38242412520.4%
Qwen 3.5 9B36261913018.8%
Qwen 3.5 27B3626104015.2%
Nemotron 3 Nano2418119112.5%
Qwen 2.5 72B3910007.9%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100975690.7%
Qwen3 235B A22B Instruct 25071009281655478.5%
Mistral Large979082683274.1%
Mistral Small 4928785594172.7%
GPT-5.4908574643168.9%
Writer: Palmyra X51009657523768.3%
Grok 4.20 (Beta, Reasoning)897468624667.9%
GPT-5.4 Mini797669654767.4%
GPT-5.4 Mini (Reasoning, Low)786967565464.9%
Llama 3.1 Nemotron 70B858054535164.4%
Mistral Large 21007370522764.3%
Grok 4 Fast878166493563.4%
Llama 3.1 70B10010063401263.2%
GPT-5.4 (Reasoning)756863585162.9%
GPT-5.4 (Reasoning, Low)846160584762.0%
GPT-5.4 Mini (Reasoning)666361575660.7%
Claude 3.5 Haiku1007364382660.1%
MiniMax M2.5886858551957.8%
Mistral Small Creative817150473657.2%
Ministral 3 14B726760511953.7%
Grok 4745752443853.0%
GPT-5.1656161502352.1%
Claude Opus 4726558461851.7%
Grok 4.1 Fast925340393451.6%
DeepSeek V3 (2025-03-24)716643423651.6%
Ministral 8B897641262150.6%
Mistral Medium 3.1856554341450.5%
Claude Sonnet 4.51006053221750.3%
Grok 4.20 (Beta)716460342049.8%
Claude 3 Haiku1007033291148.5%
Rocinante 12B100774617148.3%
Mistral Small 4 (Reasoning)100634234047.9%
Claude Opus 4.583754535247.7%
GPT-5.4 Nano (Reasoning)695747352947.3%
Hermes 3 70B604847383545.5%
Hermes 3 405B1004034252544.6%
Ministral 3 8B1004234271543.5%
Gemini 3.1 Pro (Preview)86563830342.6%
MoonshotAI: Kimi K2.5767227181742.1%
Claude Opus 4.6604241392841.9%
Z.AI GLM 5 Turbo88534217040.2%
Claude Haiku 4.578503524839.1%
GPT-5 Mini604641271938.6%
Qwen 3.5 397B A17B48474743838.5%
Z.AI GLM 5575149211438.3%
Qwen 3 32B9153279035.9%
o4 Mini47474633535.6%
ByteDance Seed 1.6 Flash534237271935.5%
Arcee AI: Trinity Large (Preview)10040245334.4%
GPT-5.4 Nano (Reasoning, Low)503831282333.8%
LFM2 24B51433431833.6%
Claude Sonnet 4.6692827212033.3%
WizardLM 2 8x22b10040149032.7%
Claude Sonnet 449444029032.4%
Gemma 3 12B53503515832.3%
GPT-4.151433716931.4%
DeepSeek V3 (2024-12-26)54383620029.8%
ByteDance Seed 2.0 Mini10030190029.8%
Claude Sonnet 4.6 (Reasoning)63362816429.3%
Mistral NeMO582520181727.7%
Z.AI GLM 4.5353525211827.0%
Nemotron 3 Nano5048308026.9%
Ministral 3B54342119726.7%
GPT-5.2482724201526.6%
GPT-4o Mini (temp=1)342727232126.1%
GPT-4o, Aug. 6th (temp=1)433626121125.6%
Mistral Large 36141251025.6%
o4 Mini High56262511524.4%
Claude Opus 4.6 (Reasoning)34302820723.8%
Cohere Command R+ (Aug. 2024)6437117023.8%
Stealth: Hunter Alpha42272221022.4%
MiniMax M2.745332012021.9%
GPT-4o, May 13th (temp=0)46252116021.4%
GPT-5.4 Nano33262119520.8%
Gemini 2.5 Flash Lite (Reasoning)33292411019.3%
Mistral Small 3.2 24B464133018.6%
Gemini 2.5 Flash (Reasoning)562280017.3%
Qwen 3.5 Plus (2026-02-15)3720147115.7%
Gemini 2.5 Pro29221612015.6%
GPT-52525195415.6%
Qwen 2.5 72B2721189015.1%
DeepSeek-V2 Chat342585014.5%
GPT-4.1 Mini3324160014.4%
Qwen 3.5 35B2524200013.8%
Gemini 2.5 Flash Lite2716147413.7%
Z.AI GLM 4.62423174013.5%
Qwen 3.5 27B412700013.5%
Z.AI GLM 4.7 Flash29141312013.4%
GPT-4o, Aug. 6th (temp=0)3220140013.3%
Stealth: Healer Alpha2319169013.2%
Ministral 3 3B382232013.0%
GPT-4o, May 13th (temp=1)282590012.4%
Claude 3.5 Sonnet47550011.6%
Gemma 3 4B232366011.4%
Claude 3.7 Sonnet2916120011.2%
DeepSeek V3.230143009.3%
GPT-4.1 Nano2496508.8%
Gemini 2.5 Flash23143208.6%
Gemini 3.1 Flash Lite (Preview)17118608.5%
Nemotron 3 Super18117708.4%
Qwen 3.5 Flash3143007.6%
GPT-5 Nano14108337.5%
GPT-4o Mini (temp=0)17140006.1%
Aion 2.01398006.1%
Inception Mercury 23000006.0%
Qwen 3.5 122B1770004.9%
Gemini 3 Pro (Preview)1670004.6%
DeepSeek V3.11093004.5%
Gemma 3 27B1432003.7%
Z.AI GLM 4.71700003.5%
ByteDance Seed 2.0 Lite750002.5%
Gemini 3 Flash (Preview)210000.7%
Gemini 3 Flash (Preview, Reasoning)300000.6%
Arcee AI: Trinity Mini100000.2%
ByteDance Seed 1.6000000.0%
Qwen 3.5 9B000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4100100100907592.9%
Claude Sonnet 4.510010096957092.4%
Grok 4 Fast1009898887892.3%
Writer: Palmyra X51001001001005891.5%
GPT-5.4 (Reasoning)100100100935890.2%
GPT-5.4 (Reasoning, Low)10010091867189.7%
Mistral Small 4 (Reasoning)1008989757585.7%
Mistral Small 4998982766281.7%
Grok 410010074695780.1%
Ministral 3 14B10010084634077.4%
GPT-5.4 Mini977975645373.5%
Qwen3 235B A22B Instruct 25071009874593573.1%
Grok 4.1 Fast1007571625272.0%
Qwen 3.5 397B A17B1008368663169.9%
GPT-5.1967058585467.3%
Claude Opus 4917566653867.0%
Claude Sonnet 4.61008371473166.5%
Gemini 3.1 Flash Lite (Preview)827271575066.2%
Gemini 2.5 Flash (Reasoning)888472444165.8%
MiniMax M2.51006560554765.4%
GPT-5.4 Mini (Reasoning, Low)706867625865.1%
MoonshotAI: Kimi K2.5866563555264.3%
Claude 3.5 Sonnet978556493364.0%
Llama 3.1 8B10010010016463.9%
Rocinante 12B100968825863.6%
GPT-5.4 Mini (Reasoning)928461503063.4%
Z.AI GLM 5 Turbo926864504062.7%
Hermes 3 70B10010076201662.3%
Gemini 2.5 Flash Lite (Reasoning)1007049463259.4%
Grok 4.20 (Beta, Reasoning)786457563357.6%
Claude Opus 4.6 (Reasoning)726859513657.3%
Llama 3.1 Nemotron 70B968243412156.7%
Claude Opus 4.51005744423154.9%
GPT-5716057533454.8%
Mistral NeMO925949403354.5%
Gemma 3 12B96706129953.3%
Stealth: Hunter Alpha835448463052.1%
GPT-4.1 Nano835651462351.8%
Mistral Large 294756022851.7%
Qwen 3 32B1008036251751.6%
Claude Sonnet 4736954431751.4%
Claude Sonnet 4.6 (Reasoning)1008933181450.7%
Stealth: Healer Alpha765552383150.6%
Mistral Large 3925348451350.2%
Claude Opus 4.686624743849.3%
Mistral Large64625753849.0%
GPT-4.1767255201948.3%
Gemini 3 Flash (Preview)806944271847.6%
GPT-5.4 Nano635149482647.4%
Mistral Small Creative71575553047.1%
Mistral Medium 3.1774747343147.0%
GPT-5.4 Nano (Reasoning, Low)655956262546.2%
Hermes 3 405B1005527242345.6%
Gemini 3.1 Pro (Preview)100754012045.3%
DeepSeek V3 (2025-03-24)79706214045.1%
GPT-5.4 Nano (Reasoning)494843424044.5%
MiniMax M2.7944231262343.2%
Claude 3.7 Sonnet815049221142.7%
GPT-4o, Aug. 6th (temp=1)565537343042.5%
Z.AI GLM 5554745323242.3%
DeepSeek V3.280524037041.8%
Arcee AI: Trinity Large (Preview)624636313041.1%
Claude Haiku 4.5605452221440.2%
GPT-5.2625438281840.0%
Ministral 8B634837341539.4%
Gemma 3 27B635832241638.7%
Claude 3.5 Haiku100352825037.5%
Mistral Small 3.2 24B7169434037.4%
Grok 4.20 (Beta)524847191736.8%
Ministral 3 8B7769380036.7%
GPT-4o Mini (temp=1)69494618136.7%
WizardLM 2 8x22b654838161135.4%
Z.AI GLM 4.6593429272635.1%
ByteDance Seed 2.0 Mini52514625034.7%
Claude 3 Haiku71464016034.6%
ByteDance Seed 1.687492011033.5%
DeepSeek-V2 Chat75433311032.6%
ByteDance Seed 1.6 Flash57383821832.3%
Qwen 3.5 122B53484312031.3%
Gemini 2.5 Pro673123201431.0%
Ministral 3B53432927030.4%
Gemini 2.5 Flash Lite55482420229.8%
o4 Mini High393932231529.4%
Gemini 2.5 Flash64322619529.1%
Qwen 3.5 Flash46462725028.8%
o4 Mini40363528628.8%
ByteDance Seed 2.0 Lite8741100027.7%
Aion 2.05343329027.7%
GPT-5 Mini453523201427.5%
Qwen 3.5 Plus (2026-02-15)47352718726.7%
Ministral 3 3B48392712025.2%
Z.AI GLM 4.752392312025.1%
Gemini 3 Flash (Preview, Reasoning)8026170024.6%
Llama 3.1 70B5130287524.2%
Nemotron 3 Nano7920152023.1%
GPT-4.1 Mini36353111022.6%
LFM2 24B6226205022.5%
Cohere Command R+ (Aug. 2024)57241810021.9%
GPT-4o, Aug. 6th (temp=0)4938145421.8%
Nemotron 3 Super45291612521.3%
DeepSeek V3.15435134021.1%
Qwen 3.5 35B711895020.7%
Z.AI GLM 4.556211211019.9%
GPT-4o, May 13th (temp=1)31212014117.2%
GPT-4o, May 13th (temp=0)4325160017.0%
Qwen 3.5 27B4022145016.2%
Gemini 3 Pro (Preview)2625139515.5%
Gemma 3 4B3022197015.5%
DeepSeek V3 (2024-12-26)4611102013.7%
Qwen 2.5 72B3117137013.6%
Z.AI GLM 4.7 Flash22171311513.5%
Qwen 3.5 9B262630011.0%
GPT-5 Nano23109809.8%
GPT-4o Mini (temp=0)21140007.0%
Arcee AI: Trinity Mini3400006.7%
Inception Mercury900001.9%
Inception Mercury 2700001.4%
Stealth: Aurora Alpha000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100100100876991.3%
GPT-5.41009284807987.0%
Llama 3.1 8B10010010093980.4%
GPT-5.4 Mini (Reasoning, Low)1008684735479.5%
GPT-5.4 (Reasoning)928382755677.5%
GPT-5.4 Mini (Reasoning)939380724275.9%
Qwen3 235B A22B Instruct 2507868577704672.7%
GPT-5.4 Mini928872525070.9%
Claude 3.5 Haiku10010075511969.0%
Mistral Small 41009554494668.8%
GPT-5.4 (Reasoning, Low)767066605364.7%
Ministral 3 14B887765523864.0%
MiniMax M2.7796862554962.8%
Mistral Small 4 (Reasoning)926358434259.9%
Llama 3.1 70B1008049471357.9%
Grok 4877942403857.3%
Mistral Large 31007245363156.8%
Writer: Palmyra X5100876032055.9%
GPT-4.1716560493355.5%
Grok 4.1 Fast816050503655.3%
Claude Sonnet 4.5857268391054.9%
Grok 4 Fast756952413454.4%
Claude Opus 4.5806153423253.8%
Hermes 3 405B1005248451852.7%
Claude Opus 4.6904949432851.9%
GPT-5.1695349423850.3%
Claude Opus 4925943271948.1%
ByteDance Seed 1.6 Flash635756422047.5%
DeepSeek V3 (2025-03-24)100683434047.3%
Gemini 3.1 Pro (Preview)90813327046.2%
Grok 4.20 (Beta)726246262546.1%
Qwen 3.5 397B A17B685956301044.5%
Claude Haiku 4.5776037341444.4%
Claude 3 Haiku645856222044.0%
Z.AI GLM 5775244311544.0%
Grok 4.20 (Beta, Reasoning)615538363043.8%
Cohere Command R+ (Aug. 2024)664339373343.6%
MoonshotAI: Kimi K2.560565136040.5%
Qwen 3.5 35B724341241940.0%
Mistral Small Creative74454238039.7%
Claude Opus 4.6 (Reasoning)585237292239.5%
Z.AI GLM 5 Turbo545342351439.4%
GPT-5.4 Nano (Reasoning)624742301439.2%
Ministral 3 8B69564327039.2%
MiniMax M2.590423427038.7%
Ministral 8B66595110437.8%
Gemini 2.5 Pro584634242437.2%
GPT-5 Mini504031302735.7%
GPT-4o Mini (temp=1)583633301935.5%
Claude Sonnet 463552824034.3%
Mistral Medium 3.1434139272034.0%
GPT-5.4 Nano (Reasoning, Low)463937361133.9%
Qwen 3 32B56523721033.2%
Arcee AI: Trinity Large (Preview)56463021731.9%
Gemini 2.5 Flash (Reasoning)653122211731.5%
DeepSeek-V2 Chat78272420530.8%
DeepSeek V3 (2024-12-26)513030261330.1%
GPT-4o, May 13th (temp=0)612928201029.8%
Mistral Large5047466029.6%
Gemini 2.5 Flash Lite56393811029.0%
Hermes 3 70B72332711028.8%
Mistral Large 269362611128.7%
Mistral Small 3.2 24B494121181428.3%
Z.AI GLM 4.548322624928.0%
Claude 3.5 Sonnet313028252427.6%
Gemini 3.1 Flash Lite (Preview)4944440027.5%
LFM2 24B423826171327.4%
GPT-5.4 Nano353426231927.4%
o4 Mini4947279327.1%
GPT-4.1 Nano7535190026.0%
Gemini 2.5 Flash Lite (Reasoning)373426181426.0%
Llama 3.1 Nemotron 70B512919171125.5%
Stealth: Hunter Alpha373422171525.0%
Aion 2.051411914024.9%
DeepSeek V3.16128258024.4%
o4 Mini High5434270023.0%
Ministral 3B4038289022.9%
GPT-4.1 Mini45361716122.8%
GPT-5.2352919161322.4%
DeepSeek V3.24740174322.2%
Gemma 3 27B48312111022.0%
GPT-4o, May 13th (temp=1)5636170021.8%
Ministral 3 3B3733228019.9%
GPT-4o, Aug. 6th (temp=1)3430259019.6%
GPT-5362218111019.3%
Mistral NeMO5520190018.8%
Gemini 2.5 Flash4523157318.5%
Z.AI GLM 4.63128257018.3%
Claude Sonnet 4.6 (Reasoning)3828109618.2%
Claude 3.7 Sonnet3827201017.0%
Stealth: Healer Alpha453090016.8%
WizardLM 2 8x22b581282016.1%
ByteDance Seed 1.63718139315.9%
Qwen 3.5 Flash442190014.8%
GPT-4o, Aug. 6th (temp=0)382760014.3%
ByteDance Seed 2.0 Lite3016125212.8%
ByteDance Seed 2.0 Mini2725102012.7%
GPT-4o Mini (temp=0)432000012.7%
Stealth: Aurora Alpha60000012.0%
Gemma 3 4B311863011.8%
Gemini 3 Flash (Preview)2815100010.5%
Qwen 2.5 72B311380010.5%
Qwen 3.5 27B4280009.9%
Nemotron 3 Super30160009.2%
Z.AI GLM 4.73173008.3%
Inception Mercury4100008.2%
Claude Sonnet 4.63154008.1%
Gemma 3 12B23125008.0%
Gemini 3 Flash (Preview, Reasoning)17127307.7%
Gemini 3 Pro (Preview)14109607.7%
Qwen 3.5 122B2084307.0%
Nemotron 3 Nano1464004.9%
Z.AI GLM 4.7 Flash1087004.8%
Qwen 3.5 9B1900003.9%
Arcee AI: Trinity Mini1900003.8%
GPT-5 Nano1032003.0%
Inception Mercury 21200002.5%
Qwen 3.5 Plus (2026-02-15)1010002.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100100100100.0%
GPT-5.4 (Reasoning)100100100938595.6%
GPT-5.41009896938293.8%
Writer: Palmyra X51001001001006492.6%
Qwen3 235B A22B Instruct 2507100100100807891.7%
Mistral Small 4 (Reasoning)10010080787586.5%
GPT-5.4 (Reasoning, Low)1008775716679.8%
Claude Sonnet 4.51009369676679.1%
GPT-5.11008781626278.4%
Claude 3 Haiku958870696777.9%
Hermes 3 70B10010090631774.2%
Claude 3.5 Haiku1008568585773.6%
GPT-4o Mini (temp=1)1007971665273.5%
Z.AI GLM 4.5857979665772.9%
Z.AI GLM 51009163624572.2%
Claude Opus 4.51009068633972.1%
Gemini 2.5 Flash Lite1008276663571.9%
Llama 3.1 70B908277565171.2%
Claude Sonnet 4959381434070.7%
GPT-5.4 Mini (Reasoning)1008467673570.7%
MoonshotAI: Kimi K2.5877970635170.1%
Claude Haiku 4.51007162605268.9%
GPT-4.1 Mini1008756524467.9%
MiniMax M2.51007765514667.7%
Z.AI GLM 5 Turbo1008068523767.5%
MiniMax M2.7898966523365.7%
Claude Sonnet 4.6 (Reasoning)876860535364.1%
Rocinante 12B1006257524863.6%
Mistral Small 41009651422963.5%
DeepSeek V3 (2025-03-24)996766463462.4%
GPT-5.4 Mini817875413662.4%
GPT-4.1996357514162.1%
Gemini 2.5 Flash Lite (Reasoning)797461593661.8%
GPT-5.4 Mini (Reasoning, Low)786857545061.3%
Claude Opus 41006955433660.6%
Stealth: Hunter Alpha836459494760.6%
Grok 4.20 (Beta, Reasoning)736761514759.9%
Hermes 3 405B100885347859.2%
Ministral 3 8B887448454058.9%
GPT-4.1 Nano998373261158.4%
o4 Mini High797748444157.8%
Qwen 3.5 397B A17B846855493357.7%
GPT-4o, Aug. 6th (temp=1)928554401857.7%
Llama 3.1 Nemotron 70B786258483956.9%
Gemini 3.1 Pro (Preview)856051464056.4%
Grok 4 Fast817857431755.2%
Claude 3.5 Sonnet916553362353.6%
Grok 4855949363452.5%
Gemini 2.5 Flash (Reasoning)747044403452.3%
GPT-4o, Aug. 6th (temp=0)927741321651.7%
Claude Opus 4.6 (Reasoning)755843413851.0%
Ministral 8B1007058161151.0%
o4 Mini696551363150.6%
GPT-5.4 Nano695351493050.5%
Gemma 3 12B755348403650.3%
LFM2 24B776349362049.2%
Grok 4.1 Fast655348413849.1%
GPT-5.4 Nano (Reasoning)585654413649.1%
Claude Sonnet 4.6965240302748.9%
GPT-5.2615945433648.8%
Claude 3.7 Sonnet706849332148.2%
GPT-5.4 Nano (Reasoning, Low)625651432647.6%
Mistral Medium 3.1635743413247.1%
Grok 4.20 (Beta)605650392946.9%
Claude Opus 4.6614646432544.0%
Ministral 3 3B90504522742.9%
Z.AI GLM 4.6715437311641.8%
Gemini 2.5 Pro514848362641.8%
Z.AI GLM 4.7 Flash645645311241.7%
Mistral NeMO846238121241.6%
Qwen 3.5 Plus (2026-02-15)535146382041.6%
GPT-4o, May 13th (temp=1)654940341239.9%
Ministral 3 14B804236291139.7%
Mistral Large 3614946291239.5%
Gemini 2.5 Flash71633619839.2%
Mistral Small Creative64484334639.0%
Qwen 3.5 9B67584623038.8%
Gemma 3 27B66473935838.8%
DeepSeek V3.2594436322038.1%
Cohere Command R+ (Aug. 2024)773332281938.0%
DeepSeek V3.1545436351037.8%
Aion 2.055534239037.8%
Arcee AI: Trinity Large (Preview)55504537037.5%
DeepSeek-V2 Chat763434231235.8%
ByteDance Seed 1.6 Flash543635272535.4%
Stealth: Healer Alpha524237311134.5%
Qwen 3.5 35B86441713833.6%
Qwen 3 32B51494323033.2%
Mistral Large51433727332.3%
ByteDance Seed 2.0 Lite10039202032.0%
Nemotron 3 Super54373630031.5%
WizardLM 2 8x22b1005340031.4%
GPT-556333025930.6%
GPT-4o, May 13th (temp=0)7054253030.5%
Mistral Large 243423431030.0%
DeepSeek V3 (2024-12-26)96191717029.8%
Qwen 3.5 Flash49373514828.5%
GPT-5 Mini53392616928.4%
Ministral 3B68201817225.0%
Mistral Small 3.2 24B7221190022.5%
Gemma 3 4B48291814021.8%
Gemini 3 Flash (Preview)332419171321.3%
Arcee AI: Trinity Mini4031209521.1%
Gemini 3 Flash (Preview, Reasoning)431817161020.7%
Qwen 3.5 27B34271817720.6%
Nemotron 3 Nano31272015519.6%
Gemini 3.1 Flash Lite (Preview)30272215018.6%
GPT-4o Mini (temp=0)252314131117.2%
ByteDance Seed 2.0 Mini3321169616.7%
Gemini 3 Pro (Preview)3224184015.5%
Qwen 2.5 72B4916110015.4%
ByteDance Seed 1.64512106014.5%
Qwen 3.5 122B371576012.9%
GPT-5 Nano2315128011.7%
Z.AI GLM 4.72814122011.2%
Inception Mercury600001.3%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Hermes 3 70B10010094814283.4%
Qwen3 235B A22B Instruct 2507848277757077.4%
Rocinante 12B1009373606077.2%
GPT-5.4958684625476.4%
Writer: Palmyra X51008378654374.0%
Llama 3.1 8B10010074562370.6%
GPT-5.4 (Reasoning)858071664970.1%
Claude Sonnet 4.61007977462364.9%
GPT-5.4 (Reasoning, Low)828276433263.0%
Claude Opus 4.6 (Reasoning)847358494862.5%
Llama 3.1 Nemotron 70B10010073271362.4%
Claude Opus 4.6766766464159.2%
ByteDance Seed 2.0 Lite858579211256.3%
Hermes 3 405B988263231656.1%
WizardLM 2 8x22b806257552155.2%
Mistral Small Creative947048332955.0%
GPT-5.1725549434152.3%
Claude 3 Haiku736443403551.2%
Claude Sonnet 4.5694946454250.2%
Gemma 3 12B1005143391650.0%
Claude Sonnet 4.6 (Reasoning)776350312448.9%
Gemma 3 4B636053472048.6%
Llama 3.1 70B1004341372148.4%
Mistral Small 4605951322345.0%
GPT-5.4 Mini634642393344.4%
Claude Haiku 4.5785743271644.3%
Z.AI GLM 5855036252544.1%
Claude Opus 4.5825534262143.5%
Gemini 3.1 Flash Lite (Preview)605645272642.8%
GPT-5.4 Mini (Reasoning, Low)714139352341.9%
Claude Opus 467644234041.6%
ByteDance Seed 1.6 Flash675843251240.8%
Claude Sonnet 4724239331840.7%
Z.AI GLM 4.5665036271739.4%
Grok 4.1 Fast755325212039.0%
MiniMax M2.510048395339.0%
Gemini 2.5 Flash58515026437.7%
DeepSeek V3.1635137211537.4%
Ministral 3 14B604336252036.9%
Claude 3.7 Sonnet705034201036.7%
GPT-4o, Aug. 6th (temp=1)656221211035.8%
Grok 4.20 (Beta, Reasoning)454235332335.4%
Mistral Small 4 (Reasoning)584837201134.7%
DeepSeek V3 (2025-03-24)7171145332.8%
Cohere Command R+ (Aug. 2024)10041210032.5%
Gemini 2.5 Flash (Reasoning)533934211432.4%
GPT-5.4 Mini (Reasoning)554930141332.1%
Stealth: Healer Alpha50474117231.3%
Gemini 3 Flash (Preview, Reasoning)493931231230.9%
Gemini 2.5 Pro7040322029.0%
Gemini 2.5 Flash Lite46443121228.9%
GPT-4o, Aug. 6th (temp=0)7434257328.5%
DeepSeek V3.2373327252028.3%
Grok 451332723728.2%
ByteDance Seed 2.0 Mini49393319028.0%
Stealth: Hunter Alpha46343321527.8%
GPT-4o, May 13th (temp=1)393326251527.3%
GPT-4o Mini (temp=1)474521131027.0%
Gemini 2.5 Flash Lite (Reasoning)47312926026.7%
GPT-4.159361919026.6%
Ministral 3 8B8134160026.3%
Grok 4.20 (Beta)403425161626.3%
Mistral Medium 3.1462928161226.1%
Z.AI GLM 4.6452523201726.1%
MoonshotAI: Kimi K2.543372916025.0%
Mistral Large 248382513024.6%
o4 Mini40382321024.4%
GPT-5.4 Nano342925181424.2%
Qwen 3.5 397B A17B43312712724.1%
o4 Mini High64221715324.1%
GPT-4o, May 13th (temp=0)41352615023.3%
Grok 4 Fast4827259622.9%
GPT-5.245212019522.2%
Z.AI GLM 5 Turbo40342311021.6%
Arcee AI: Trinity Large (Preview)4229209821.4%
Gemini 3 Flash (Preview)5129129320.7%
Claude 3.5 Sonnet3838260020.4%
Claude 3.5 Haiku37312010019.7%
MiniMax M2.74227138318.7%
GPT-5.4 Nano (Reasoning)31242414018.5%
Qwen 3.5 35B573500018.5%
Mistral NeMO5030110018.1%
GPT-4.1 Mini452897218.1%
Mistral Large 34523165218.1%
Qwen 3 32B4328190017.9%
Gemini 3 Pro (Preview)3426147417.1%
GPT-531231714017.0%
Mistral Large3827190016.6%
Gemma 3 27B562130016.1%
GPT-4.1 Nano23171616715.7%
Ministral 8B3623160014.9%
LFM2 24B3920110013.9%
Z.AI GLM 4.72923120012.9%
GPT-5 Mini371870012.5%
DeepSeek V3 (2024-12-26)2714138012.3%
GPT-5.4 Nano (Reasoning, Low)2116147211.9%
Aion 2.0401032010.8%
Qwen 2.5 72B24159009.5%
Ministral 3B27181009.0%
Mistral Small 3.2 24B24136008.7%
Qwen 3.5 Plus (2026-02-15)4210008.6%
Qwen 3.5 Flash16117006.7%
Z.AI GLM 4.7 Flash2440005.5%
DeepSeek-V2 Chat960003.0%
Qwen 3.5 9B1320003.0%
Gemini 3.1 Pro (Preview)1021002.7%
GPT-4o Mini (temp=0)650002.2%
Nemotron 3 Nano700001.5%
Qwen 3.5 27B700001.4%
Nemotron 3 Super700001.3%
Ministral 3 3B320001.0%
Qwen 3.5 122B310000.8%
ByteDance Seed 1.6000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5 Nano000000.0%
Inception Mercury000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Rocinante 12B1001001001009799.4%
Writer: Palmyra X51001001001009198.2%
Claude Sonnet 4.5100100100998697.0%
Cohere Command R+ (Aug. 2024)10010095918594.2%
Claude Opus 4.5999291848489.9%
GPT-4o, Aug. 6th (temp=1)1009489877889.4%
Claude 3 Haiku10010097875587.8%
Claude Opus 4.6 (Reasoning)1009793776987.0%
Claude Sonnet 4.6 (Reasoning)10010085806986.8%
Gemma 3 27B999189866586.2%
Grok 4.1 Fast10010089845786.2%
Llama 3.1 8B100100100795186.0%
GPT-5.41008987836685.1%
Claude Sonnet 41009083836484.0%
GPT-5.4 (Reasoning)1008886856083.9%
Hermes 3 70B100100100694783.2%
Mistral Small 4 (Reasoning)100100100793482.5%
Claude Opus 4.610010074726482.0%
Z.AI GLM 5 Turbo1009380726481.8%
MiniMax M2.5969381685779.2%
GPT-5.4 (Reasoning, Low)1008377745577.8%
Hermes 3 405B1009771595676.6%
Claude Opus 410010092573176.0%
MiniMax M2.71007268686173.8%
Stealth: Hunter Alpha1007675734473.6%
Ministral 8B100998167369.9%
Z.AI GLM 5888266634468.6%
GPT-5.11008165543967.7%
Claude 3.5 Haiku969279472367.3%
Claude Haiku 4.51009857423466.1%
Mistral Small 41008460541963.5%
Stealth: Healer Alpha897171393661.3%
GPT-5.4 Mini817155544561.3%
Claude Sonnet 4.692717061059.1%
Mistral Large 2837065472257.3%
GPT-5.4 Mini (Reasoning, Low)696357514657.2%
Llama 3.1 Nemotron 70B797058453156.6%
ByteDance Seed 1.6 Flash736563552756.5%
Grok 4886447443856.2%
Arcee AI: Trinity Large (Preview)756052474656.0%
DeepSeek V3 (2025-03-24)896354403055.1%
Ministral 3 14B84706641753.5%
Mistral Small Creative876646353153.0%
WizardLM 2 8x22b817644342952.9%
GPT-4.1 Nano615752513451.2%
Grok 4 Fast696648413151.0%
ByteDance Seed 2.0 Mini706058353250.9%
Z.AI GLM 4.6846448312350.1%
Gemma 3 12B655349423849.5%
Grok 4.20 (Beta)676150412749.1%
Aion 2.0565050474048.6%
Mistral Large 3796449232347.6%
GPT-4.1504949484047.4%
Ministral 3 8B785248401245.9%
Mistral Large96514133745.6%
Gemini 2.5 Flash Lite (Reasoning)626151341344.4%
Gemini 2.5 Flash Lite565145363344.2%
DeepSeek V3.2544541403843.7%
GPT-4.1 Mini774639381743.6%
GPT-5.4 Nano535147343343.4%
Mistral Medium 3.1595251431143.4%
GPT-4o, May 13th (temp=1)515042363643.0%
GPT-5.4 Nano (Reasoning)545049342742.8%
Claude 3.7 Sonnet81633724842.7%
GPT-4o Mini (temp=1)706031232140.9%
Qwen 3.5 Plus (2026-02-15)524036322937.8%
Gemini 2.5 Pro78443818937.5%
GPT-5.4 Mini (Reasoning)393635353435.9%
DeepSeek V3 (2024-12-26)10053233035.7%
Claude 3.5 Sonnet60454524034.6%
DeepSeek-V2 Chat8560280034.5%
GPT-58440357634.3%
Llama 3.1 70B605724161133.5%
Z.AI GLM 4.7514629201832.7%
Ministral 3B605124111031.2%
Z.AI GLM 4.7 Flash5942368530.3%
Arcee AI: Trinity Mini5551358029.8%
Gemini 3 Pro (Preview)463128281128.8%
DeepSeek V3.1502926241328.5%
Gemini 3.1 Flash Lite (Preview)7143200026.8%
GPT-5.240402919626.8%
Gemini 2.5 Flash (Reasoning)433524201226.7%
Mistral NeMO352726232126.4%
Qwen 3 32B39362821225.3%
Gemma 3 4B45382118024.5%
o4 Mini High5243178024.1%
Mistral Small 3.2 24B6733210024.1%
Gemini 2.5 Flash52272120024.1%
ByteDance Seed 2.0 Lite48331613523.1%
GPT-5.4 Nano (Reasoning, Low)282521191421.5%
Qwen 3.5 9B5427260021.3%
MoonshotAI: Kimi K2.54330158720.7%
Z.AI GLM 4.53530260018.1%
LFM2 24B27261311917.0%
Ministral 3 3B611680016.9%
o4 Mini3924147016.7%
GPT-5 Mini3328130014.8%
Gemini 3 Flash (Preview)302855414.4%
Grok 4.20 (Beta, Reasoning)541053014.4%
Qwen 3.5 397B A17B2020107011.5%
Gemini 3 Flash (Preview, Reasoning)331361010.7%
GPT-4o, Aug. 6th (temp=0)28220009.9%
Qwen 3.5 Flash25202009.5%
Qwen 2.5 72B4061009.5%
Nemotron 3 Super19178008.8%
Qwen 3.5 35B26160008.4%
Gemini 3.1 Pro (Preview)15132005.9%
Qwen 3.5 122B2900005.8%
GPT-5 Nano2170005.6%
GPT-4o, May 13th (temp=0)1250003.3%
GPT-4o Mini (temp=0)753003.0%
ByteDance Seed 1.6300000.5%
Nemotron 3 Nano100000.2%
Qwen 3.5 27B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B10010072514072.6%
Writer: Palmyra X5998266561864.2%
Qwen3 235B A22B Instruct 2507876962613562.9%
Llama 3.1 Nemotron 70B10010074251162.0%
Ministral 3 14B1008358362460.3%
GPT-5 Nano1001001000060.1%
Mistral Large 2987460511359.2%
Llama 3.1 70B836362454259.0%
Grok 4.1 Fast100925940058.2%
Grok 4.20 (Beta, Reasoning)906862452257.4%
Mistral Small 4786849492854.5%
Z.AI GLM 51005856352154.0%
Mistral Small 4 (Reasoning)846748411951.6%
Mistral Small Creative856454381250.6%
Claude Opus 4.5565447412143.8%
Mistral NeMO100100160043.1%
Rocinante 12B100100110042.2%
Cohere Command R+ (Aug. 2024)8568560041.9%
Hermes 3 405B8068490039.2%
o4 Mini High7975354038.4%
Claude Sonnet 4.679582424638.2%
Claude Sonnet 4.595482423038.0%
Ministral 3 8B94462418437.5%
Mistral Medium 3.158474137337.1%
Arcee AI: Trinity Large (Preview)10067160036.5%
Claude Sonnet 46655535036.0%
Grok 4 Fast574629232135.1%
Mistral Large 3524340251434.5%
Hermes 3 70B89352320033.2%
GPT-5.4393428241828.5%
Claude 3.5 Haiku756500028.1%
Grok 4.20 (Beta)43412814225.8%
Claude Opus 4353325221125.1%
Claude Sonnet 4.6 (Reasoning)40272424524.1%
MiniMax M2.750302514023.8%
Claude Haiku 4.5524996023.1%
Qwen 3 32B712790021.4%
Claude 3.7 Sonnet5325213020.2%
MiniMax M2.57015141020.0%
Claude Opus 4.640241916119.8%
Mistral Small 3.2 24B801900019.8%
MoonshotAI: Kimi K2.55120164318.7%
Z.AI GLM 5 Turbo4231182018.7%
GPT-4o, May 13th (temp=1)731610018.1%
LFM2 24B4033161018.0%
Z.AI GLM 4.65323121018.0%
Claude Opus 4.6 (Reasoning)3527242017.6%
DeepSeek V3 (2024-12-26)4816133015.9%
Qwen 3.5 27B70600015.2%
GPT-4o, Aug. 6th (temp=1)2926193015.2%
Gemma 3 27B551450014.8%
o4 Mini31161610014.4%
GPT-5.4 (Reasoning)29131310614.2%
Stealth: Healer Alpha541330014.1%
GPT-5363300013.9%
GPT-4.12322220013.1%
GPT-5.4 Mini2619103211.9%
Z.AI GLM 4.548520010.9%
Mistral Large281933010.6%
ByteDance Seed 2.0 Lite49300010.6%
DeepSeek V3 (2025-03-24)351700010.5%
GPT-4o, May 13th (temp=0)411100010.3%
Stealth: Hunter Alpha361400010.0%
GPT-5.4 (Reasoning, Low)181711009.0%
DeepSeek-V2 Chat25200009.0%
Ministral 8B35100008.9%
Gemini 3.1 Flash Lite (Preview)30110008.2%
ByteDance Seed 1.6 Flash25160008.1%
DeepSeek V3.13900007.7%
Grok 43520007.3%
Ministral 3B16146007.2%
GPT-5.4 Mini (Reasoning, Low)23100006.6%
GPT-4o Mini (temp=1)20120006.4%
Gemini 2.5 Pro2084006.3%
Gemini 3 Pro (Preview)20110006.3%
Qwen 2.5 72B3000006.0%
GPT-5.4 Mini (Reasoning)15112005.7%
Aion 2.016100005.1%
Qwen 3.5 397B A17B1465005.1%
Claude 3 Haiku2330005.0%
Gemma 3 4B1480004.3%
Qwen 3.5 35B2000004.0%
Z.AI GLM 4.71900003.8%
Gemma 3 12B1620003.7%
WizardLM 2 8x22b870003.1%
Z.AI GLM 4.7 Flash1600003.1%
Claude 3.5 Sonnet1600003.1%
GPT-4.1 Mini1600003.1%
GPT-5.11131003.1%
DeepSeek V3.2653003.0%
Gemini 3 Flash (Preview)1400002.9%
Nemotron 3 Super1400002.7%
Gemini 2.5 Flash Lite (Reasoning)920002.1%
GPT-4.1 Nano540001.9%
ByteDance Seed 1.6900001.8%
GPT-5.4 Nano (Reasoning, Low)900001.8%
Ministral 3 3B800001.6%
Qwen 3.5 Plus (2026-02-15)800001.5%
Qwen 3.5 Flash600001.2%
Gemini 2.5 Flash Lite310000.8%
Gemini 3 Flash (Preview, Reasoning)300000.6%
Qwen 3.5 122B210000.5%
GPT-5 Mini300000.5%
Nemotron 3 Nano200000.3%
Inception Mercury000000.1%
Gemini 3.1 Pro (Preview)000000.0%
GPT-5.2000000.0%
ByteDance Seed 2.0 Mini000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-4o, Aug. 6th (temp=0)000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
Gemini 2.5 Flash000000.0%
GPT-4o Mini (temp=0)000000.0%
GPT-5.4 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 25071009392797688.0%
Writer: Palmyra X51001001001003987.8%
Rocinante 12B100100100952283.2%
GPT-5.4 (Reasoning)10010088464576.0%
Cohere Command R+ (Aug. 2024)100100100641074.9%
GPT-5.4887774725973.8%
Claude Opus 4.6 (Reasoning)838383514869.4%
Z.AI GLM 51008478463869.2%
Claude Sonnet 4.61009070363666.5%
GPT-5.1878167494365.3%
Grok 410010078371065.0%
Mistral Small 4 (Reasoning)997364473463.4%
Claude Opus 4.6948151513462.3%
GPT-5.4 (Reasoning, Low)696462624861.1%
Claude Opus 4827956541156.4%
Arcee AI: Trinity Large (Preview)856560561556.3%
Mistral Small Creative756357463855.9%
Claude Haiku 4.592716825552.1%
Stealth: Healer Alpha816552401750.9%
Stealth: Hunter Alpha895348342750.2%
ByteDance Seed 2.0 Lite80765236249.2%
MiniMax M2.7936441231547.1%
MoonshotAI: Kimi K2.5757455221047.0%
Claude Opus 4.5766636342046.5%
GPT-5.4 Mini (Reasoning)705744332546.0%
Claude Sonnet 4.6 (Reasoning)885039361645.7%
GPT-5946726221945.6%
Claude 3.5 Sonnet100554823045.0%
Z.AI GLM 5 Turbo81673432042.8%
Claude Sonnet 483564035042.8%
Grok 4.1 Fast71635425042.5%
Hermes 3 405B9071428042.3%
GPT-4.1625143312242.1%
Grok 4.20 (Beta, Reasoning)68604528040.3%
o4 Mini100463417239.9%
WizardLM 2 8x22b655631242339.8%
Mistral Large 272483434839.4%
ByteDance Seed 2.0 Mini9880180039.2%
Grok 4 Fast80413733038.3%
DeepSeek V3.172514616037.1%
Ministral 3 14B545134232336.8%
Mistral Medium 3.164564124036.8%
Gemini 2.5 Flash (Reasoning)78493720036.7%
o4 Mini High57514723536.7%
MiniMax M2.5514539262136.3%
ByteDance Seed 1.6 Flash100452115036.0%
GPT-4o, May 13th (temp=0)54544327035.7%
Grok 4.20 (Beta)524535331535.6%
GPT-5.4 Mini (Reasoning, Low)62403933035.1%
DeepSeek V3.2624627231133.9%
Z.AI GLM 4.647424026933.0%
Claude Sonnet 4.55150495231.5%
Mistral Small 47742219731.3%
Hermes 3 70B7443363031.2%
Gemini 2.5 Pro353431292430.6%
Gemini 2.5 Flash Lite61423314030.2%
DeepSeek V3 (2025-03-24)423931271130.0%
GPT-5.4 Mini43403428329.7%
Llama 3.1 70B1004121028.7%
Qwen 3.5 Flash473629171028.0%
Ministral 3B6548270028.0%
Claude 3.7 Sonnet43402921126.9%
Qwen 3 32B5833315426.5%
Claude 3.5 Haiku7534230026.4%
Qwen 3.5 Plus (2026-02-15)53272723026.0%
GPT-4o, Aug. 6th (temp=1)4742401025.9%
Ministral 3 8B65291817025.8%
Mistral Large823980025.7%
Ministral 8B6050170025.4%
ByteDance Seed 1.653401713024.6%
Gemma 3 27B5534205524.0%
Mistral Small 3.2 24B713150021.6%
Gemini 2.5 Flash43242319021.5%
Qwen 3.5 397B A17B5326250020.8%
Gemini 3.1 Flash Lite (Preview)55181512020.1%
Aion 2.04527250019.5%
Qwen 3.5 122B3829236019.3%
LFM2 24B3633270019.2%
Claude 3 Haiku3331280018.4%
Gemini 2.5 Flash Lite (Reasoning)2927249018.0%
Gemini 3 Pro (Preview)4231160017.7%
Gemma 3 4B531987017.5%
Nemotron 3 Super412799017.2%
Llama 3.1 Nemotron 70B2827265017.1%
GPT-5.4 Nano2825189016.0%
DeepSeek V3 (2024-12-26)552300015.5%
Gemini 3 Flash (Preview)32231111015.4%
Z.AI GLM 4.7 Flash551332014.4%
Gemma 3 12B3222160013.9%
GPT-4o Mini (temp=1)282675313.9%
GPT-4o, Aug. 6th (temp=0)58730013.7%
GPT-4o, May 13th (temp=1)2517167213.1%
Llama 3.1 8B2821104012.7%
GPT-5.22315129011.9%
GPT-4.1 Nano232291010.8%
GPT-5.4 Nano (Reasoning, Low)161616209.8%
Inception Mercury4600009.3%
Mistral NeMO28104308.9%
Gemini 3 Flash (Preview, Reasoning)18137328.9%
Mistral Large 328142008.8%
GPT-5.4 Nano (Reasoning)171212208.7%
Z.AI GLM 4.717117006.9%
GPT-4o Mini (temp=0)3400006.7%
DeepSeek-V2 Chat2580006.6%
Qwen 3.5 35B3200006.3%
Gemini 3.1 Pro (Preview)1398006.1%
GPT-5 Mini1251003.6%
Arcee AI: Trinity Mini1700003.4%
GPT-4.1 Mini1420003.1%
Ministral 3 3B1300002.6%
GPT-5 Nano920002.2%
Z.AI GLM 4.5300000.6%
Qwen 2.5 72B100000.2%
Qwen 3.5 27B000000.0%
Qwen 3.5 9B000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Rocinante 12B100989669072.6%
Mistral Small Creative1007265623166.2%
Mistral Small 4 (Reasoning)1007251352556.6%
Claude Opus 4975846403655.4%
Claude Opus 4.6 (Reasoning)856749482053.8%
Ministral 8B834747423951.8%
Ministral 3 14B75635946750.2%
Hermes 3 405B1005633282548.4%
Qwen 3 32B776844371147.3%
Z.AI GLM 5965441251746.8%
Claude Opus 4.6714542383345.8%
Arcee AI: Trinity Large (Preview)78685710744.0%
Qwen3 235B A22B Instruct 2507705249301943.9%
Claude 3.5 Haiku10060512042.5%
Writer: Palmyra X5825940151241.4%
MiniMax M2.5554739362540.5%
MiniMax M2.7714837271038.6%
Mistral Small 410071165038.6%
GPT-5.4 Mini463838343237.7%
GPT-5 Nano1008350037.5%
Claude Opus 4.5584931242336.9%
Llama 3.1 70B56453834836.1%
Llama 3.1 Nemotron 70B66454117835.6%
ByteDance Seed 1.6 Flash594136281435.4%
GPT-5.4 (Reasoning, Low)414139272634.7%
Ministral 3 8B76383119032.9%
Llama 3.1 8B6969125031.2%
Z.AI GLM 5 Turbo90421310031.0%
Claude 3 Haiku45453523530.7%
o4 Mini High50433219028.9%
GPT-5.4 (Reasoning)403131251328.0%
Mistral Large 37744170027.8%
GPT-5.4 Mini (Reasoning)51472120027.6%
Hermes 3 70B71312112027.2%
Grok 4 Fast66312414027.2%
Grok 4.20 (Beta, Reasoning)5942260025.5%
DeepSeek V3 (2024-12-26)6329276025.1%
ByteDance Seed 2.0 Lite45362218024.2%
Z.AI GLM 4.664241911023.6%
Grok 4.1 Fast3535328823.5%
MoonshotAI: Kimi K2.54939168022.3%
GPT-5.43936286021.7%
Claude Sonnet 4.6 (Reasoning)49231812020.5%
GPT-4.1 Nano332317161220.2%
Gemini 3.1 Flash Lite (Preview)39252117020.1%
GPT-4o, Aug. 6th (temp=1)3730267020.1%
o4 Mini4328270019.8%
Claude Sonnet 4.55628140019.7%
Gemma 3 27B31242220219.5%
WizardLM 2 8x22b791160019.3%
Claude Haiku 4.55023156018.8%
GPT-4o Mini (temp=1)46171610218.3%
DeepSeek V3 (2025-03-24)35252010018.0%
GPT-5454320018.0%
DeepSeek V3.1661750017.6%
Qwen 3.5 397B A17B26232117017.4%
DeepSeek V3.23030240016.7%
GPT-4o, May 13th (temp=1)27221717016.5%
GPT-5.1421498716.0%
Grok 4.20 (Beta)3427161015.6%
Mistral Small 3.2 24B501890015.5%
Gemma 3 12B3223165015.4%
Stealth: Healer Alpha4814110014.6%
Claude Sonnet 42827113214.2%
Mistral Large2925160013.9%
Mistral Medium 3.12622210013.7%
Cohere Command R+ (Aug. 2024)3419160013.6%
ByteDance Seed 2.0 Mini55540012.8%
Gemini 2.5 Flash Lite3118113012.7%
Gemini 2.5 Pro391860012.6%
Grok 42114137512.0%
GPT-5.4 Mini (Reasoning, Low)341186011.7%
Gemini 2.5 Flash Lite (Reasoning)1716149011.4%
GPT-4.1 Mini302610011.3%
Mistral Large 257000011.3%
Stealth: Hunter Alpha2614123011.1%
GPT-4o, Aug. 6th (temp=0)391050010.9%
Qwen 3.5 35B312300010.8%
GPT-4.12216142010.7%
Mistral NeMO421110010.7%
Gemini 3 Pro (Preview)411000010.1%
Qwen 3.5 122B26181009.0%
Qwen 3.5 Flash3852009.0%
Claude Sonnet 4.6171513008.9%
Claude 3.5 Sonnet161311007.9%
Z.AI GLM 4.526110007.3%
Aion 2.016153207.2%
Gemini 2.5 Flash3400006.7%
Claude 3.7 Sonnet16140005.9%
Z.AI GLM 4.72900005.7%
Gemma 3 4B1690004.9%
GPT-4o, May 13th (temp=0)1070003.5%
ByteDance Seed 1.61330003.1%
DeepSeek-V2 Chat1500002.9%
Gemini 2.5 Flash (Reasoning)1300002.6%
GPT-4o Mini (temp=0)1300002.5%
Ministral 3 3B1200002.3%
GPT-5.21100002.2%
LFM2 24B810001.9%
Arcee AI: Trinity Mini900001.8%
Qwen 3.5 9B600001.2%
Nemotron 3 Super320001.0%
Qwen 2.5 72B400000.8%
Ministral 3B200000.5%
GPT-5.4 Nano100000.3%
Z.AI GLM 4.7 Flash100000.2%
Gemini 3.1 Pro (Preview)000000.0%
GPT-5 Mini000000.0%
Qwen 3.5 27B000000.0%
Gemini 3 Flash (Preview, Reasoning)000000.0%
Qwen 3.5 Plus (2026-02-15)000000.0%
Gemini 3 Flash (Preview)000000.0%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
GPT-5.4 Nano (Reasoning)000000.0%
Inception Mercury000000.0%
GPT-5.4 Nano (Reasoning, Low)000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3 235B A22B Instruct 250710010096896289.4%
Writer: Palmyra X5100100100816188.3%
Z.AI GLM 51009996905487.8%
Hermes 3 70B10010093727187.3%
Rocinante 12B10010071585576.7%
Claude Haiku 4.510010075602571.9%
Llama 3.1 8B10010065623171.8%
Grok 4.1 Fast1008378574071.6%
Claude Sonnet 4.5897575574969.1%
Arcee AI: Trinity Large (Preview)878759535367.7%
GPT-5.4867664545166.0%
GPT-4.1 Nano917368464564.4%
Claude Opus 4.61007973392964.0%
Claude Opus 4.5887372444163.7%
MiniMax M2.5736958544860.6%
Mistral Small 4 (Reasoning)100848227960.5%
GPT-5.4 (Reasoning, Low)796864543459.6%
Claude Opus 4.6 (Reasoning)796750504458.2%
Mistral Small 4978464311457.8%
Claude Sonnet 4.6857955471957.1%
MoonshotAI: Kimi K2.5948955281656.4%
GPT-5.4 (Reasoning)777448443956.4%
Claude 3 Haiku838267191052.3%
Hermes 3 405B706649472351.0%
Ministral 3 14B876847312050.4%
Gemini 2.5 Flash Lite786142382348.2%
GPT-4o, Aug. 6th (temp=1)695640383647.9%
Mistral Small Creative825639392047.3%
Claude Opus 4635441393646.7%
Gemini 2.5 Flash Lite (Reasoning)774139353445.3%
Cohere Command R+ (Aug. 2024)716542281544.4%
GPT-4.1564544373643.7%
Claude 3.7 Sonnet594947392142.9%
GPT-5.4 Mini714641362042.8%
Llama 3.1 70B904835231341.8%
Mistral Small 3.2 24B85653029041.7%
Llama 3.1 Nemotron 70B655037302841.7%
GPT-4o, May 13th (temp=1)67634135041.2%
GPT-4.1 Mini74574821240.5%
Z.AI GLM 4.6504240373240.4%
Mistral Medium 3.1665131282339.9%
Grok 4.20 (Beta)614543272439.8%
Z.AI GLM 5 Turbo57554235839.5%
Gemma 3 27B625638241539.0%
Gemma 3 12B614944211738.6%
Grok 4 Fast684238222238.5%
Stealth: Hunter Alpha65653526038.1%
Claude 3.5 Haiku73514710837.7%
Grok 4634642201336.9%
GPT-5.4 Mini (Reasoning, Low)484736311836.1%
Claude Sonnet 4100551410035.8%
DeepSeek V3.157494415734.6%
ByteDance Seed 1.6 Flash60523418834.4%
Gemini 3.1 Pro (Preview)74473213534.3%
Claude Sonnet 4.6 (Reasoning)494331262134.2%
GPT-5.1483938261833.7%
Aion 2.0404030301831.4%
o4 Mini High503826211429.7%
Ministral 3 8B96261412029.6%
Grok 4.20 (Beta, Reasoning)423728211729.0%
Qwen 3.5 Plus (2026-02-15)59432319028.8%
DeepSeek-V2 Chat43403416327.2%
Stealth: Healer Alpha63421511326.8%
Z.AI GLM 4.7 Flash433027191326.5%
WizardLM 2 8x22b46312524626.5%
Gemini 2.5 Pro512922141225.6%
Claude 3.5 Sonnet4342367025.5%
GPT-4o, Aug. 6th (temp=0)39372220825.1%
Ministral 3 3B43352412824.6%
Gemini 3 Pro (Preview)4634309224.3%
DeepSeek V3.2412623191224.2%
MiniMax M2.7473117131224.0%
Z.AI GLM 4.5422725161124.0%
o4 Mini51331615523.8%
GPT-5.4 Mini (Reasoning)41302914523.8%
GPT-5 Nano1001400022.9%
Mistral Large7017168122.5%
GPT-4o Mini (temp=1)402219191122.3%
Gemini 2.5 Flash35282719121.9%
Mistral Large 336292511821.8%
Gemini 2.5 Flash (Reasoning)38342115021.5%
Ministral 8B5831101019.9%
Gemma 3 4B262514141418.7%
Mistral NeMO33281715018.6%
DeepSeek V3 (2025-03-24)3431172016.9%
ByteDance Seed 2.0 Mini3030240016.8%
Arcee AI: Trinity Mini28231614016.2%
Gemini 3.1 Flash Lite (Preview)402672015.2%
LFM2 24B3620118015.1%
ByteDance Seed 2.0 Lite3914119215.0%
Qwen 3.5 397B A17B353550014.9%
Nemotron 3 Nano491870014.9%
Z.AI GLM 4.72815118513.4%
Qwen 3.5 35B302590012.9%
GPT-5.4 Nano332441112.6%
Mistral Large 2311698012.6%
GPT-5 Mini371654012.5%
GPT-4o, May 13th (temp=0)2820120012.0%
Ministral 3B2721112012.0%
Qwen 3.5 Flash381560011.7%
Qwen 3 32B211910009.9%
ByteDance Seed 1.64180009.9%
GPT-5.4 Nano (Reasoning, Low)19157439.4%
GPT-5.4 Nano (Reasoning)2490006.4%
Gemini 3 Flash (Preview)12104305.9%
Qwen 2.5 72B15111005.1%
Qwen 3.5 9B1900003.8%
DeepSeek V3 (2024-12-26)1800003.6%
Gemini 3 Flash (Preview, Reasoning)1310002.8%
Nemotron 3 Super1200002.3%
GPT-4o Mini (temp=0)1100002.2%
GPT-51100002.2%
GPT-5.2900001.8%
Qwen 3.5 27B500001.1%
Qwen 3.5 122B310000.8%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%