Purple prose (modifier overload)

Test: Bad Writing Habits

Avg. Score
95.8%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Stealth: Aurora Alpha99.4%$0.00009.8s95%
2Inception Mercury 298.9%$0.00327.0s95%
3ByteDance Seed 1.6 Flash99.1%$0.001327.3s95%
4GPT-4o Mini (temp=0)99.0%$0.001234.8s95%
5Grok 4 Fast98.3%$0.001724.1s94%
6GPT-4o, Aug. 6th (temp=0)98.6%$0.02322.7s95%
7Mistral NeMO97.8%$0.000510.1s93%
8Nemotron 3 Nano98.9%$0.00101.1m95%
9LFM2 24B98.2%$0.000228.4s93%
10Ministral 3B97.8%$0.00018.1s91%
11o4 Mini97.9%$0.01525.7s94%
12Qwen 3.5 Flash98.3%$0.002547.5s93%
13Qwen 2.5 72B98.0%$0.001036.7s92%
14Z.AI GLM 5 Turbo97.8%$0.008133.2s93%
15Mistral Large98.0%$0.01430.9s93%
16Nemotron 3 Super98.6%$0.00001.4m93%
17Grok 4.1 Fast97.5%$0.001837.8s92%
18o4 Mini High98.2%$0.02547.2s94%
19Qwen 3.5 9B98.5%$0.00111.4m93%
20GPT-5 Mini97.7%$0.010057.4s93%
21Cohere Command R+ (Aug. 2024)98.0%$0.02052.5s93%
22Hermes 3 70B97.5%$0.00101.2m93%
23DeepSeek V3 (2025-03-24)97.3%$0.001439.4s90%
24Qwen 3.5 Plus (2026-02-15)96.7%$0.006031.5s91%
25GPT-4o Mini (temp=1)97.4%$0.001234.8s90%
26Arcee AI: Trinity Large (Preview)96.9%$0.000043.6s91%
27Arcee AI: Trinity Mini96.3%$0.00039.2s90%
28Mistral Large 397.0%$0.003330.3s90%
29Qwen 3.5 35B98.0%$0.0181.0m92%
30Mistral Small Creative96.0%$0.00079.1s90%
31Mistral Large 296.7%$0.01329.4s90%
32Qwen 3.5 122B97.7%$0.0251.1m92%
33GPT-5.4 Nano (Reasoning, Low)96.3%$0.005520.6s89%
34Qwen 3 32B97.4%$0.001554.6s88%
35Ministral 8B95.6%$0.000410.4s88%
36Mistral Small 495.9%$0.001418.2s88%
37Claude 3 Haiku96.3%$0.002514.9s87%
38Llama 3.1 Nemotron 70B96.1%$0.003831.7s89%
39Grok 499.0%$0.0481.7m94%
40Ministral 3 14B96.0%$0.000711.7s87%
41Z.AI GLM 597.3%$0.00841.2m90%
42Hermes 3 405B96.8%$0.003253.2s89%
43Aion 2.096.4%$0.00641.3m91%
44GPT-4.196.5%$0.01844.7s90%
45Mistral Medium 3.195.8%$0.004836.5s89%
46Ministral 3 8B95.6%$0.000819.6s87%
47GPT-4o, Aug. 6th (temp=1)96.6%$0.01824.4s87%
48MiniMax M2.796.6%$0.00401.1m89%
49Qwen 3.5 27B97.0%$0.0201.6m92%
50GPT-4o, May 13th (temp=0)96.4%$0.03514.1s89%
51Ministral 3 3B95.9%$0.000511.1s86%
52Grok 4.20 (Beta, Reasoning)96.6%$0.03934.0s90%
53DeepSeek V3 (2024-12-26)96.1%$0.002154.6s88%
54DeepSeek-V2 Chat95.8%$0.002153.3s88%
55WizardLM 2 8x22b96.9%$0.00261.8m90%
56Writer: Palmyra X595.6%$0.01122.0s87%
57Qwen3 235B A22B Instruct 250796.0%$0.001159.2s88%
58Claude Haiku 4.595.3%$0.01121.6s87%
59Rocinante 12B95.8%$0.001438.4s86%
60MiniMax M2.596.0%$0.00341.3m88%
61Z.AI GLM 4.7 Flash96.1%$0.00171.2m87%
62Claude Sonnet 4.596.0%$0.03538.1s89%
63ByteDance Seed 2.0 Lite97.2%$0.0122.2m91%
64Qwen 3.5 397B A17B97.7%$0.0143.0m94%
65Claude 3.5 Sonnet97.0%$0.04835.5s88%
66GPT-5.4 Mini (Reasoning)95.6%$0.02228.1s87%
67GPT-5.4 Nano95.3%$0.005726.3s85%
68ByteDance Seed 1.697.4%$0.0132.5m91%
69Stealth: Hunter Alpha95.3%$0.000055.0s87%
70MoonshotAI: Kimi K2.598.3%$0.0193.2m93%
71Stealth: Healer Alpha94.9%$0.000023.7s85%
72Claude Sonnet 495.9%$0.03243.7s88%
73Llama 3.1 70B96.2%$0.001529.4s83%
74GPT-5.4 Nano (Reasoning)95.0%$0.006124.5s85%
75GPT-5 Nano95.7%$0.00421.4m88%
76Claude Opus 4.697.3%$0.0781.2m92%
77GPT-5.4 Mini94.8%$0.01516.8s85%
78GPT-5.296.7%$0.0561.5m91%
79Z.AI GLM 4.796.0%$0.0101.4m86%
80GPT-4o, May 13th (temp=1)95.3%$0.03314.4s85%
81Mistral Small 4 (Reasoning)94.5%$0.002230.2s83%
82GPT-5.4 Mini (Reasoning, Low)94.6%$0.01516.8s83%
83Z.AI GLM 4.694.9%$0.006551.5s85%
84Gemini 3 Pro (Preview)95.9%$0.05554.4s88%
85GPT-4.1 Mini93.7%$0.002719.0s83%
86Claude Opus 4.6 (Reasoning)96.9%$0.0881.4m91%
87Z.AI GLM 4.594.7%$0.005142.1s83%
88GPT-597.7%$0.0652.8m93%
89Grok 4.20 (Beta)94.4%$0.01815.8s82%
90Gemini 3 Flash (Preview)94.1%$0.007819.6s81%
91Inception Mercury95.9%$0.01117.6s78%
92Gemini 3.1 Flash Lite (Preview)93.5%$0.00308.4s81%
93Gemini 2.5 Flash Lite93.4%$0.00099.5s80%
94Claude 3.5 Haiku93.6%$0.003510.8s80%
95Claude Sonnet 4.694.2%$0.03139.3s83%
96DeepSeek V3.295.2%$0.00141.9m83%
97Gemini 2.5 Pro94.6%$0.03636.2s83%
98GPT-5.495.3%$0.0491.4m86%
99Claude Opus 4.594.8%$0.07053.4s85%
100Gemini 2.5 Flash91.5%$0.005210.6s80%
101Gemma 3 12B92.9%$0.000441.3s79%
102Claude 3.7 Sonnet93.6%$0.04246.7s82%
103GPT-5.195.1%$0.0541.8m85%
104Claude Sonnet 4.6 (Reasoning)94.4%$0.0601.2m83%
105DeepSeek V3.193.3%$0.00201.8m81%
106Gemini 2.5 Flash Lite (Reasoning)90.9%$0.002830.8s79%
107GPT-5.4 (Reasoning, Low)93.9%$0.0551.4m82%
108ByteDance Seed 2.0 Mini95.9%$0.00454.9m86%
109Gemma 3 27B91.7%$0.000652.6s75%
110Claude Opus 496.9%$0.2091.4m91%
111Gemini 3 Flash (Preview, Reasoning)90.0%$0.01230.1s74%
112Llama 3.1 8B91.1%$0.00031.3m72%
113GPT-5.4 (Reasoning)94.4%$0.0892.6m82%
114Mistral Small 3.2 24B95.3%$0.00685.6m83%
115GPT-4.1 Nano87.6%$0.000713.3s73%
116Gemma 3 4B88.7%$0.000220.0s71%
117Gemini 2.5 Flash (Reasoning)88.2%$0.01121.5s72%
118Gemini 3.1 Pro (Preview)86.8%$0.1071.8m73%
95.82%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Mistral Large100100100989799.0%
Z.AI GLM 4.510010099989798.9%
Qwen 3.5 122B1001001001009498.8%
Grok 4.20 (Beta, Reasoning)1001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
o4 Mini High1001001001009498.8%
DeepSeek-V2 Chat1001001001009498.8%
Inception Mercury 21001001001009498.8%
GPT-4o, Aug. 6th (temp=1)1001001001009498.8%
GPT-5.4 Nano (Reasoning, Low)1001001001009498.8%
Mistral Small 3.2 24B1001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
GPT-5.4 Nano1001001001009498.8%
Ministral 3 14B1001001001009498.8%
Mistral NeMO1001001001009498.8%
Mistral Medium 3.1100100100999498.6%
Claude Opus 4100100100989498.5%
GPT-5.4 Mini (Reasoning, Low)100100100989498.4%
Z.AI GLM 5 Turbo100100100979498.2%
GPT-4o, May 13th (temp=1)100100100969498.0%
Arcee AI: Trinity Large (Preview)100100100969498.0%
Ministral 8B1001001001008997.9%
Ministral 3B100100100999097.8%
Mistral Large 3100100100949497.7%
Stealth: Healer Alpha100100100949497.6%
GPT-5.4 (Reasoning)100100100949497.6%
GPT-5 Mini100100100949497.6%
Claude Opus 4.6100100100949497.6%
GPT-5100100100949497.6%
MoonshotAI: Kimi K2.51001001001008897.6%
Grok 4.20 (Beta)1001001001008897.6%
DeepSeek V3 (2024-12-26)1001001001008897.6%
Cohere Command R+ (Aug. 2024)100100100949497.6%
Mistral Small 41009997979497.5%
ByteDance Seed 2.0 Mini100100100998897.3%
MiniMax M2.7100100100959297.3%
Qwen3 235B A22B Instruct 250710010098949497.1%
LFM2 24B100100100949096.8%
Hermes 3 70B100100100949096.8%
GPT-5.4 Mini10010096949496.8%
Llama 3.1 8B10010096959296.5%
ByteDance Seed 1.610010094949496.4%
GPT-5.210010094949496.4%
Z.AI GLM 4.7100100100948896.4%
o4 Mini10010094949496.4%
Qwen 3.5 35B100100100948896.4%
Qwen 3.5 Flash10010094949496.4%
ByteDance Seed 2.0 Lite100100100948896.4%
Stealth: Aurora Alpha1001001001008296.4%
DeepSeek V3 (2025-03-24)10010094949496.4%
ByteDance Seed 1.6 Flash100100100948896.4%
GPT-4.110010096949296.3%
Mistral Small 4 (Reasoning)10010099948896.2%
Ministral 3 8B100100100948796.2%
Llama 3.1 Nemotron 70B100100100948696.1%
Hermes 3 405B1001001001008096.0%
Claude Sonnet 4.6 (Reasoning)10010094929195.6%
Writer: Palmyra X5100100100948395.5%
Gemma 3 12B1009594949495.4%
Gemini 3 Pro (Preview)1009494949495.2%
Qwen 3.5 9B100100100948295.2%
Qwen 3.5 Plus (2026-02-15)1009494949495.2%
Z.AI GLM 4.7 Flash1009494949495.2%
GPT-5.410010094948895.2%
Mistral Large 21009494949495.2%
GPT-5.4 Nano (Reasoning)10010094948895.2%
Claude Sonnet 4.5999797948895.0%
Claude Opus 4.6 (Reasoning)10010099948295.0%
Gemma 3 27B100100100888795.0%
Claude Sonnet 4.610010098888894.8%
Claude Sonnet 410010094938594.4%
Mistral Small Creative10010094948294.0%
Qwen 3.5 397B A17B1009494948894.0%
GPT-5.4 Mini (Reasoning)1009494948894.0%
Stealth: Hunter Alpha10010094888894.0%
Qwen 2.5 72B10010094888894.0%
GPT-5.4 (Reasoning, Low)949494949494.0%
Gemini 2.5 Flash Lite1009894908893.9%
Claude 3.7 Sonnet100100100897893.4%
Claude Haiku 4.51009594918793.4%
Gemini 2.5 Flash1009493928893.4%
GPT-4.1 Nano1009493918993.4%
GPT-4.1 Mini949494949093.1%
MiniMax M2.51009897927792.9%
DeepSeek V3.2100100100828292.8%
Gemini 3.1 Flash Lite (Preview)1009494888892.8%
Claude Opus 4.51009694927892.0%
Llama 3.1 70B1001001001006092.0%
WizardLM 2 8x22b949494908791.7%
Z.AI GLM 4.610010088888291.6%
Z.AI GLM 51009897966691.5%
Aion 2.01009488888791.4%
GPT-5 Nano10010099827691.4%
GPT-4o, May 13th (temp=0)100100100946090.8%
Ministral 3 3B10010094887090.4%
Claude 3.5 Haiku100100100787390.2%
GPT-5.11009488887689.2%
Gemini 2.5 Pro1008888888289.2%
Gemini 3 Flash (Preview)1009488827688.0%
Gemini 2.5 Flash (Reasoning)949087868287.8%
Gemini 2.5 Flash Lite (Reasoning)1008887867887.8%
Gemini 3 Flash (Preview, Reasoning)949488827686.8%
DeepSeek V3.11008888887086.8%
Gemini 3.1 Pro (Preview)949482827685.6%
Gemma 3 4B929286837485.4%
Rocinante 12B1009487686783.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Arcee AI: Trinity Mini1001001001009899.5%
Claude 3 Haiku100100100989899.1%
Ministral 3 8B1001001001009699.1%
Claude Opus 4.61001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Nemotron 3 Super1001001001009498.8%
Stealth: Aurora Alpha1001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
LFM2 24B1001001001009498.8%
Grok 4 Fast1001001001009498.7%
Mistral Large1001001001009398.6%
Z.AI GLM 5 Turbo1001001001009198.2%
GPT-5.4 Nano1009999969698.2%
Ministral 3 14B100100100989197.9%
ByteDance Seed 1.6100100100959497.8%
GPT-4o, Aug. 6th (temp=1)100100100969397.7%
GPT-5100100100949497.6%
Qwen 3.5 397B A17B100100100949497.6%
Z.AI GLM 4.7100100100949497.6%
o4 Mini100100100949497.6%
Qwen 3.5 Flash100100100949497.6%
Qwen 3.5 Plus (2026-02-15)100100100949497.6%
GPT-4o, May 13th (temp=0)100100100949497.6%
GPT-4o, Aug. 6th (temp=0)100100100949497.6%
Qwen 3 32B100100100949497.6%
Inception Mercury1001001001008897.6%
Cohere Command R+ (Aug. 2024)1001001001008897.6%
GPT-5.4 Nano (Reasoning)10010099989297.5%
Mistral Medium 3.1100100100949397.5%
GPT-5.4 Mini10010096969597.4%
Hermes 3 70B10010099949497.3%
Qwen3 235B A22B Instruct 250710010096969497.3%
Claude Opus 410010099988997.2%
Claude Sonnet 4.6 (Reasoning)100100100988797.1%
Arcee AI: Trinity Large (Preview)10010096949496.9%
Rocinante 12B100100100968896.8%
DeepSeek V3.1100100100948996.6%
Claude Opus 4.6 (Reasoning)10010094949496.4%
MoonshotAI: Kimi K2.5100100100948896.4%
Grok 4.1 Fast10010094949496.4%
Aion 2.010010094949496.4%
Z.AI GLM 4.7 Flash100100100948896.4%
Inception Mercury 210010094949496.4%
Llama 3.1 70B1001001001008296.4%
Qwen 2.5 72B100100100948896.4%
ByteDance Seed 2.0 Mini1001001001008196.3%
Claude Sonnet 410010097949196.3%
Mistral Small 3.2 24B100100100948796.2%
Mistral NeMO1001001001008196.2%
Mistral Small Creative10010098929196.2%
Grok 4.20 (Beta)10010094949396.1%
GPT-5.4 (Reasoning)1009894949496.0%
GPT-5.4 Mini (Reasoning)1009595959495.9%
GPT-4o Mini (temp=1)1009995949195.9%
Claude Opus 4.51009896949195.8%
Mistral Large 210010095948995.6%
GPT-5.110010094949095.6%
Z.AI GLM 5100100100988095.6%
Qwen 3.5 27B1009494949495.2%
o4 Mini High10010094948895.2%
Gemini 3 Pro (Preview)100100100888895.2%
Stealth: Hunter Alpha10010099948295.1%
Writer: Palmyra X5100100100948195.0%
WizardLM 2 8x22b1009795948894.7%
Claude 3.5 Haiku100100100888594.6%
DeepSeek V3 (2025-03-24)1001001001007394.5%
Claude 3.7 Sonnet1009595938994.4%
MiniMax M2.510010094938594.2%
Claude Haiku 4.510010092908994.2%
DeepSeek V3 (2024-12-26)10010097908494.1%
Grok 4100100100947794.1%
DeepSeek V3.210010094948294.0%
Stealth: Healer Alpha10010094888894.0%
GPT-5.4 (Reasoning, Low)1009493929093.9%
GPT-5.4 Nano (Reasoning, Low)10010094938293.9%
Grok 4.20 (Beta, Reasoning)979695948693.7%
GPT-5.4989594948693.6%
Llama 3.1 Nemotron 70B999994938293.6%
Mistral Small 410010094918293.5%
Z.AI GLM 4.61009493928893.4%
Claude Sonnet 4.51009995868693.2%
Claude Sonnet 4.610010095947793.1%
GPT-4o, May 13th (temp=1)1009493918893.1%
Claude 3.5 Sonnet1009994947792.8%
Gemini 2.5 Flash Lite (Reasoning)1009993898392.8%
Gemini 2.5 Pro1009494888792.6%
GPT-5.4 Mini (Reasoning, Low)969595928592.5%
Gemini 3 Flash (Preview)949494928892.5%
GPT-5.21009494888391.8%
Mistral Small 4 (Reasoning)999594888291.5%
MiniMax M2.7989492918191.2%
Gemini 2.5 Flash Lite10010096897091.1%
GPT-4.1 Nano949493928191.0%
Mistral Large 3959392898490.7%
GPT-4.11009494917490.6%
GPT-4.1 Mini969494868290.4%
Z.AI GLM 4.5959493878390.2%
Gemma 3 4B1009896807689.9%
Gemini 3.1 Flash Lite (Preview)949488887688.0%
Llama 3.1 8B100100100706787.4%
DeepSeek-V2 Chat948988848187.1%
Gemma 3 12B1009694766486.2%
Gemma 3 27B1009190757185.6%
Ministral 3 3B1009982786885.5%
Gemini 2.5 Flash918986817784.6%
Gemini 3.1 Pro (Preview)1008882827084.4%
Gemini 2.5 Flash (Reasoning)888886727281.2%
Gemini 3 Flash (Preview, Reasoning)949476706078.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Aion 2.0100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Claude Opus 4.6 (Reasoning)10010010010010099.9%
DeepSeek-V2 Chat10010010010010099.9%
Mistral Large 21001001001009999.8%
Grok 4.20 (Beta, Reasoning)1001001001009999.8%
Inception Mercury1001001001009899.7%
Cohere Command R+ (Aug. 2024)1001001001009799.3%
Mistral Small Creative1001001001009699.3%
Mistral Large 3100100100999799.2%
DeepSeek V3 (2025-03-24)1001001001009699.2%
Z.AI GLM 51001001001009498.9%
GPT-5.4 (Reasoning)1001001001009498.8%
GPT-5 Mini1001001001009498.8%
GPT-51001001001009498.8%
Qwen 3.5 397B A17B1001001001009498.8%
GPT-5.4 (Reasoning, Low)1001001001009498.8%
Grok 4.1 Fast1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Z.AI GLM 4.7 Flash1001001001009498.8%
Claude 3.5 Sonnet1001001001009498.8%
Inception Mercury 21001001001009498.8%
DeepSeek V3 (2024-12-26)1001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
Arcee AI: Trinity Mini1001001001009498.8%
Mistral NeMO1001001001009498.8%
WizardLM 2 8x22b100100100999498.7%
Claude Sonnet 4.5100100100999498.5%
GPT-5.4 Mini (Reasoning, Low)10010099989598.5%
Ministral 3B1001001001009298.5%
GPT-5.4 Nano (Reasoning, Low)100100100989498.4%
GPT-5.4100100100989498.3%
GPT-5.4 Mini (Reasoning)100100100969698.3%
Ministral 8B1001001001009198.2%
GPT-5 Nano10010099979598.2%
Arcee AI: Trinity Large (Preview)100100100979398.1%
MiniMax M2.7100100100969498.1%
Mistral Large1001001001009097.9%
Claude Sonnet 4.610010098989397.9%
Ministral 3 3B1001001001008997.7%
MoonshotAI: Kimi K2.5100100100949497.6%
ByteDance Seed 1.61001001001008897.6%
GPT-5.2100100100949497.6%
GPT-4.1100100100949497.6%
o4 Mini100100100949497.6%
ByteDance Seed 2.0 Mini100100100949497.6%
Stealth: Healer Alpha100100100949497.6%
ByteDance Seed 2.0 Lite100100100949497.6%
Gemini 2.5 Flash100100100999097.6%
Qwen 3.5 Plus (2026-02-15)10010099949497.4%
GPT-5.4 Mini100100100989097.4%
GPT-5.4 Nano (Reasoning)1009998979297.4%
Z.AI GLM 4.510010099949497.4%
Mistral Medium 3.1100100100949397.3%
Hermes 3 405B100100100949397.3%
Claude 3.7 Sonnet100100100998897.2%
GPT-4.1 Mini100100100949397.2%
Claude Sonnet 4.6 (Reasoning)10010097959397.0%
Z.AI GLM 4.710010097949497.0%
LFM2 24B1001001001008496.8%
Qwen3 235B A22B Instruct 250710010099929196.5%
o4 Mini High10010094949496.4%
GPT-5.110010094949496.4%
Claude Opus 4.610010094949496.4%
Qwen 3.5 122B10010094949496.4%
Qwen 3.5 27B10010094949496.4%
GPT-4o, Aug. 6th (temp=0)10010094949496.4%
Qwen 2.5 72B100100100948896.4%
Claude Sonnet 4100100100948796.2%
Mistral Small 4 (Reasoning)10010095949296.1%
Hermes 3 70B1009894949496.0%
Claude Opus 410010096929095.7%
Rocinante 12B100100100918795.5%
Writer: Palmyra X51009594949495.4%
Claude 3.5 Haiku1009494949495.2%
Z.AI GLM 4.61009494949495.2%
DeepSeek V3.2100100100948295.2%
Grok 4.20 (Beta)1009994948895.1%
Llama 3.1 8B100100100918495.0%
Llama 3.1 70B100100100888694.7%
Gemma 3 12B10010098918594.6%
Mistral Small 4999794948894.5%
Qwen 3.5 35B10010094908894.4%
Gemini 3 Pro (Preview)10010094888894.0%
Stealth: Hunter Alpha1009494948894.0%
Ministral 3 8B999594948793.9%
Gemini 2.5 Flash Lite1009594938893.9%
GPT-4o, May 13th (temp=0)1009494948793.7%
Ministral 3 14B1009996937793.1%
MiniMax M2.5989592909093.1%
DeepSeek V3.11009494898893.0%
Gemini 3 Flash (Preview)100100100887692.8%
GPT-4o, May 13th (temp=1)999696947792.5%
GPT-5.4 Nano959494938792.4%
GPT-4o, Aug. 6th (temp=1)10010098827490.8%
Claude Opus 4.5989292907990.5%
Gemini 2.5 Pro10010094827690.4%
Llama 3.1 Nemotron 70B999494947190.3%
Claude 3 Haiku10010094886589.5%
GPT-4o Mini (temp=1)1009494847288.8%
Gemini 2.5 Flash Lite (Reasoning)1009685857888.8%
Gemini 3.1 Flash Lite (Preview)1009488827688.0%
Claude Haiku 4.5999586827888.0%
Gemini 3 Flash (Preview, Reasoning)1009482827085.6%
Gemini 2.5 Flash (Reasoning)1009490766785.4%
GPT-4.1 Nano919085817584.3%
Gemini 3.1 Pro (Preview)888882827683.2%
Gemma 3 27B939283785480.0%
Gemma 3 4B928482795077.4%
Mistral Small 3.2 24B1008266625973.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.2100100100100100100.0%
o4 Mini100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)1001001001009999.9%
GPT-4o, Aug. 6th (temp=0)1001001001009999.9%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.8%
Qwen3 235B A22B Instruct 25071001001001009999.8%
Z.AI GLM 4.5100100100999999.6%
MiniMax M2.71001001001009899.5%
GPT-4o Mini (temp=1)1001001001009799.5%
GPT-4o, May 13th (temp=1)1001001001009799.4%
Claude Haiku 4.5100100100999899.3%
Claude Opus 41001001001009498.9%
GPT-5 Mini1001001001009498.8%
GPT-51001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
o4 Mini High1001001001009498.8%
Gemini 3 Pro (Preview)1001001001009498.8%
Gemini 2.5 Pro1001001001009498.8%
Grok 41001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
GPT-5.41001001001009498.8%
Inception Mercury 21001001001009498.8%
Qwen 3 32B1001001001009498.8%
Inception Mercury1001001001009498.8%
Mistral Small 3.2 24B1001001001009498.8%
Llama 3.1 70B1001001001009498.8%
Qwen 2.5 72B1001001001009498.8%
Hermes 3 70B100100100999498.6%
GPT-5.4 Mini1001001001009298.5%
Mistral Large 3100100100989498.3%
MiniMax M2.51001001001009198.1%
Ministral 3 3B100100100959497.8%
Mistral Small 41001001001008997.8%
Claude Sonnet 4.6100100100959497.7%
Gemini 2.5 Flash Lite1001001001008997.7%
GPT-5.4 (Reasoning)100100100949497.6%
GPT-5.1100100100949497.6%
Qwen 3.5 397B A17B100100100949497.6%
GPT-5.4 (Reasoning, Low)100100100949497.6%
GPT-5.4 Mini (Reasoning)100100100949497.6%
Grok 4.1 Fast100100100949497.6%
Aion 2.0100100100949497.6%
Stealth: Hunter Alpha100100100949497.6%
Qwen 3.5 Plus (2026-02-15)100100100949497.6%
ByteDance Seed 2.0 Lite1001001001008897.6%
WizardLM 2 8x22b100100100949497.6%
Cohere Command R+ (Aug. 2024)100100100979197.6%
Writer: Palmyra X510010098969497.5%
Grok 4.20 (Beta, Reasoning)100100100949497.5%
Claude Opus 4.5100100100949497.5%
Mistral Large 2100100100949397.5%
Claude Sonnet 4.6 (Reasoning)100100100949397.5%
Claude Sonnet 4.5100100100969197.5%
GPT-5.4 Nano (Reasoning)10010099969297.3%
GPT-5.4 Nano (Reasoning, Low)10010099949497.3%
Llama 3.1 Nemotron 70B100100100949297.2%
Mistral Medium 3.110010099959196.9%
Ministral 8B10010097949296.8%
Rocinante 12B100100100949096.7%
Claude 3 Haiku10010097949296.5%
DeepSeek V3 (2024-12-26)1009999949096.4%
ByteDance Seed 1.610010094949496.4%
Z.AI GLM 4.710010094949496.4%
Grok 4 Fast10010094949496.4%
GPT-4o, May 13th (temp=0)10010094949496.4%
GPT-5 Nano10010094949496.4%
DeepSeek V3.110010094949496.4%
Arcee AI: Trinity Mini100100100948896.4%
Ministral 3 14B10010098929196.3%
Llama 3.1 8B1001001001008196.2%
Claude Sonnet 410010096938895.5%
Mistral Small 4 (Reasoning)100100100948395.4%
Z.AI GLM 4.610010094948895.2%
GPT-4.110010094948895.2%
ByteDance Seed 2.0 Mini1009494949495.2%
Stealth: Healer Alpha100100100888895.2%
DeepSeek V3.210010094948895.2%
Arcee AI: Trinity Large (Preview)1009898948595.0%
Gemma 3 12B10010099928395.0%
Claude 3.7 Sonnet10010095948594.9%
DeepSeek V3 (2025-03-24)1009894938894.6%
Z.AI GLM 4.7 Flash1009794948794.5%
Hermes 3 405B10010099957894.3%
Claude 3.5 Haiku100100100997093.9%
GPT-5.4 Nano10010094888793.8%
Ministral 3 8B10010094888793.7%
GPT-4.1 Mini1009794888793.2%
Gemini 2.5 Flash10010093898593.1%
Gemma 3 27B10010096858192.4%
Gemini 3.1 Flash Lite (Preview)100100100887091.6%
GPT-4.1 Nano939391888289.3%
Gemini 3 Flash (Preview)1009494827689.2%
Gemini 2.5 Flash (Reasoning)948684817984.7%
Gemini 3.1 Pro (Preview)948882826482.0%
Gemma 3 4B929084756981.9%
Gemini 2.5 Flash Lite (Reasoning)848383817681.3%
Gemini 3 Flash (Preview, Reasoning)948882706078.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Grok 4100100100100100100.0%
DeepSeek-V2 Chat1001001001009999.9%
Mistral Large1001001001009999.8%
ByteDance Seed 1.6 Flash1001001001009899.6%
Mistral Small 41001001001009899.5%
MiniMax M2.51001001001009899.5%
GPT-5.41001001001009699.3%
GPT-51001001001009498.8%
Z.AI GLM 51001001001009498.8%
o4 Mini High1001001001009498.8%
o4 Mini1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
Stealth: Healer Alpha1001001001009498.8%
GPT-4o, May 13th (temp=0)1001001001009498.8%
Nemotron 3 Nano1001001001009498.8%
Cohere Command R+ (Aug. 2024)1001001001009498.8%
GPT-4o, Aug. 6th (temp=0)1001001001009398.6%
Grok 4.20 (Beta, Reasoning)1001001001009498.6%
ByteDance Seed 2.0 Lite1001001001009398.5%
LFM2 24B100100100999398.4%
GPT-4o Mini (temp=1)100100100969698.4%
Claude 3 Haiku100100100989398.3%
Hermes 3 70B1001001001009198.3%
Claude Opus 4.6100100100979498.2%
Ministral 3 14B100100100969497.9%
GPT-5.4 Mini10010099999397.9%
Ministral 3B100100100969497.9%
GPT-5.4 Mini (Reasoning, Low)1009998969597.6%
Claude Opus 4.6 (Reasoning)100100100949497.6%
GPT-5.1100100100949497.6%
GPT-5.4 (Reasoning, Low)100100100949497.6%
ByteDance Seed 1.61001001001008897.6%
Aion 2.0100100100949497.6%
Qwen 3.5 35B100100100949497.6%
Stealth: Hunter Alpha1001001001008897.6%
Z.AI GLM 4.7 Flash100100100949497.6%
Claude 3.5 Sonnet100100100949497.6%
DeepSeek V3.21001001001008897.6%
Qwen3 235B A22B Instruct 2507100100100949397.5%
Mistral NeMO10010099949497.5%
DeepSeek V3 (2024-12-26)10010099949497.3%
GPT-5.4 Mini (Reasoning)100100100949397.3%
Claude Opus 410010099939397.0%
DeepSeek V3 (2025-03-24)100100100978896.9%
Z.AI GLM 5 Turbo100100100949196.9%
Claude Sonnet 4100100100949096.8%
GPT-5 Nano10010098949296.8%
GPT-5.4 Nano (Reasoning, Low)999797979496.6%
GPT-4.1100100100958896.5%
Arcee AI: Trinity Large (Preview)100100100919196.5%
GPT-5.4 (Reasoning)100100100948896.4%
Qwen 3.5 27B10010094949496.4%
Grok 4.1 Fast10010094949496.4%
Gemini 3 Pro (Preview)100100100948896.4%
Z.AI GLM 4.710010094949496.4%
Mistral Small Creative10010097939296.4%
Claude Sonnet 4.5100100100948896.3%
GPT-4o, May 13th (temp=1)10010094949296.1%
Writer: Palmyra X5100100100948595.9%
GPT-5 Mini10010098948895.9%
Claude Sonnet 4.6 (Reasoning)989896959295.8%
Mistral Large 310010095949095.7%
GPT-4o, Aug. 6th (temp=1)100100100948495.7%
GPT-5.4 Nano1009595949495.6%
Hermes 3 405B10010096948895.5%
GPT-5.210010094948895.2%
Z.AI GLM 4.61009494949495.2%
Gemini 2.5 Pro10010094948895.2%
Mistral Small 3.2 24B1009494949495.2%
Arcee AI: Trinity Mini10010094948895.2%
Gemini 3 Flash (Preview)100100100888895.1%
Llama 3.1 Nemotron 70B1009895929094.9%
GPT-5.4 Nano (Reasoning)1009895919094.9%
DeepSeek V3.11009494949294.7%
Gemini 2.5 Flash Lite (Reasoning)999896948694.6%
Rocinante 12B10010095948494.5%
Mistral Large 2100100100908294.5%
WizardLM 2 8x22b10010096898894.4%
Ministral 3 8B10010094948494.4%
Qwen 3 32B100100100947694.2%
Grok 4.20 (Beta)1009896938293.9%
Mistral Small 4 (Reasoning)989693919193.9%
GPT-4.1 Mini10010094878593.1%
Mistral Medium 3.1979695928593.0%
Gemini 2.5 Flash Lite10010094858592.8%
Claude Opus 4.5999693888792.6%
Claude Sonnet 4.6999792888792.6%
Gemini 3.1 Flash Lite (Preview)10010088888291.6%
Ministral 3 3B100100100906791.4%
Llama 3.1 70B1009390908391.0%
Gemini 2.5 Flash969490888590.8%
Gemma 3 27B1009393858090.2%
MiniMax M2.710010091916790.0%
Claude Haiku 4.5969190908089.4%
Gemma 3 12B1009492817989.0%
Ministral 8B1009987867388.9%
Gemini 2.5 Flash (Reasoning)929189898388.8%
GPT-4.1 Nano998786868388.3%
Claude 3.7 Sonnet1009884817988.2%
Z.AI GLM 4.5949391886786.7%
Llama 3.1 8B10010092776085.7%
Gemini 3.1 Pro (Preview)948883828285.7%
Gemini 3 Flash (Preview, Reasoning)949482767684.4%
Inception Mercury100100100595382.4%
Claude 3.5 Haiku1009789854082.3%
Gemma 3 4B777371665869.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 3B100100100100100100.0%
DeepSeek V3 (2024-12-26)10010010010010099.9%
LFM2 24B1001001001009999.9%
Claude 3.5 Sonnet1001001001009999.9%
Claude Haiku 4.51001001001009999.8%
Claude Opus 4.51001001001009999.8%
Arcee AI: Trinity Large (Preview)1001001001009799.5%
Rocinante 12B1001001001009799.4%
Z.AI GLM 5 Turbo1001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
Z.AI GLM 51001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
MiniMax M2.71001001001009498.8%
Claude Sonnet 41001001001009498.8%
GPT-4.11001001001009498.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009498.8%
Inception Mercury 21001001001009498.8%
GPT-4o, Aug. 6th (temp=1)1001001001009498.8%
Mistral Large 21001001001009498.8%
Qwen 3 32B1001001001009498.8%
Llama 3.1 70B1001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
Nemotron 3 Nano1001001001009498.8%
Hermes 3 70B1001001001009498.8%
Ministral 3 14B1001001001009498.8%
Ministral 3 8B1001001001009498.8%
Claude 3 Haiku1001001001009498.8%
Arcee AI: Trinity Mini1001001001009498.8%
Mistral NeMO1001001001009498.8%
Cohere Command R+ (Aug. 2024)1001001001009498.8%
Mistral Small 41001001001009498.7%
Z.AI GLM 4.51001001001009398.7%
Mistral Small 4 (Reasoning)1001001001009398.6%
GPT-4o, May 13th (temp=1)100100100999498.5%
Gemini 2.5 Flash (Reasoning)1001001001009198.1%
DeepSeek-V2 Chat100100100969498.0%
Mistral Large 3100100100969497.9%
Claude 3.7 Sonnet1001001001009097.9%
Claude Sonnet 4.6 (Reasoning)100100100949497.6%
Claude Opus 4.6100100100949497.6%
Qwen 3.5 397B A17B100100100949497.6%
Z.AI GLM 4.71001001001008897.6%
o4 Mini1001001001008897.6%
Stealth: Hunter Alpha1001001001008897.6%
Grok 4 Fast100100100949497.6%
Stealth: Healer Alpha1001001001008897.6%
ByteDance Seed 2.0 Lite1001001001008897.6%
Grok 4.20 (Beta)100100100949497.6%
GPT-5.4 Mini1001001001008897.6%
DeepSeek V3.2100100100949497.6%
GPT-5.4 Nano (Reasoning)100100100949497.6%
GPT-5.4 Nano (Reasoning, Low)100100100949497.6%
Llama 3.1 Nemotron 70B1001001001008897.6%
ByteDance Seed 1.6 Flash100100100949497.6%
o4 Mini High10010099949497.4%
Mistral Large100100100949197.0%
GPT-4.1 Nano100100100978896.9%
Claude Opus 4.6 (Reasoning)10010094949496.4%
GPT-5.4 (Reasoning)100100100948896.4%
GPT-5.1100100100948896.4%
Grok 4.20 (Beta, Reasoning)100100100948896.4%
ByteDance Seed 1.6100100100948896.4%
GPT-5.210010094949496.4%
Aion 2.010010094949496.4%
Z.AI GLM 4.6100100100948896.4%
Gemini 2.5 Pro100100100948896.4%
Qwen 3.5 Plus (2026-02-15)10010094949496.4%
GPT-4.1 Mini100100100948896.4%
GPT-5 Nano100100100948896.4%
Ministral 8B10010094949496.4%
Writer: Palmyra X510010094949396.2%
Hermes 3 405B10010094949396.2%
Claude 3.5 Haiku100100100958596.0%
Gemini 2.5 Flash Lite100100100998195.9%
GPT-4o Mini (temp=1)100100100948595.9%
Gemma 3 4B10010096948895.6%
GPT-510010094948895.2%
MoonshotAI: Kimi K2.51009494949495.2%
Gemini 3 Pro (Preview)10010094948895.2%
MiniMax M2.510010094948895.2%
Gemini 3 Flash (Preview)10010094948895.2%
Z.AI GLM 4.7 Flash10010094948895.2%
Qwen3 235B A22B Instruct 2507100100100888895.1%
GPT-5.4 Nano10010096888894.4%
GPT-5.4 (Reasoning, Low)10010094948294.0%
Gemini 3.1 Flash Lite (Preview)10010094948294.0%
Gemma 3 12B100100100967393.9%
Gemma 3 27B1001001001006993.7%
DeepSeek V3.1949494948892.8%
Mistral Medium 3.110010094888192.6%
Mistral Small Creative1009494928292.4%
Llama 3.1 8B10010091888292.1%
Inception Mercury1001001001006092.0%
Grok 4.1 Fast1009494888291.6%
Gemini 2.5 Flash1009694848291.1%
Gemini 2.5 Flash Lite (Reasoning)10010094826287.5%
Gemini 3.1 Pro (Preview)948888707082.0%
Gemini 3 Flash (Preview, Reasoning)1009482646080.0%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 405B1001001001009999.9%
DeepSeek V3 (2025-03-24)1001001001009999.8%
Claude Opus 4.61001001001009799.4%
GPT-4o Mini (temp=1)1001001001009699.3%
Ministral 3B1001001001009699.2%
Mistral Large 31001001001009599.0%
Grok 4.1 Fast1001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Inception Mercury 21001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
WizardLM 2 8x22b1001001001009498.8%
Cohere Command R+ (Aug. 2024)1001001001009498.8%
LFM2 24B1001001001009498.8%
GPT-4o, Aug. 6th (temp=1)100100100999498.7%
Llama 3.1 70B100100100999398.5%
Mistral Large100100100999498.5%
Claude 3 Haiku100100100989498.4%
Ministral 3 3B100100100979498.3%
GPT-4.1 Mini100100100979498.3%
Mistral Small 4100100100979498.3%
Hermes 3 70B100100100979498.1%
GPT-5100100100949497.6%
MoonshotAI: Kimi K2.5100100100949497.6%
ByteDance Seed 2.0 Mini1001001001008897.6%
Qwen 3.5 Flash100100100949497.6%
Qwen 3.5 9B1001001001008897.6%
ByteDance Seed 2.0 Lite100100100949497.6%
GPT-5 Nano100100100949497.6%
GPT-4o, Aug. 6th (temp=0)100100100949497.6%
Qwen 3 32B100100100949497.6%
ByteDance Seed 1.6 Flash100100100949497.6%
Mistral NeMO1001001001008897.6%
Z.AI GLM 4.7 Flash100100100949397.5%
Z.AI GLM 4.710010098949397.1%
Rocinante 12B1009998949497.0%
Claude 3.5 Sonnet10010099959096.8%
Z.AI GLM 5100100100949096.8%
Claude Opus 410010099949096.7%
Ministral 3 8B100100100949096.7%
Qwen3 235B A22B Instruct 250710010096949496.7%
Mistral Large 2100100100958896.6%
Gemma 3 4B10010098949096.4%
Qwen 3.5 397B A17B10010094949496.4%
ByteDance Seed 1.610010094949496.4%
o4 Mini High100100100948896.4%
Qwen 2.5 72B100100100948896.4%
Llama 3.1 Nemotron 70B100100100948896.4%
Arcee AI: Trinity Mini10010094949496.4%
Ministral 8B100100100988496.3%
Gemma 3 27B1009998949196.3%
Writer: Palmyra X51009896949296.0%
Claude Opus 4.51009797949396.0%
Z.AI GLM 4.6100100100948695.9%
Claude Sonnet 41009897968995.9%
Grok 4.20 (Beta)1009997958895.7%
GPT-5.4 Nano (Reasoning, Low)1009997948895.5%
GPT-5.2100100100968295.5%
Gemma 3 12B10010095938895.3%
GPT-5 Mini10010094948895.2%
Qwen 3.5 122B100100100948295.2%
Aion 2.01009494949495.2%
o4 Mini10010094948895.2%
Grok 4 Fast10010094948895.2%
Qwen 3.5 Plus (2026-02-15)1009494949495.2%
DeepSeek V3.210010094948895.2%
Gemini 2.5 Flash Lite (Reasoning)1009694919194.3%
Mistral Small Creative10010094938494.3%
Claude Haiku 4.510010094938494.2%
MiniMax M2.710010094908794.1%
DeepSeek V3 (2024-12-26)1009993938694.1%
Stealth: Healer Alpha10010094948294.0%
Claude Opus 4.6 (Reasoning)1009494948793.9%
Ministral 3 14B989895958393.8%
Claude 3.7 Sonnet1009997868693.6%
Z.AI GLM 5 Turbo1009493928993.5%
Grok 4.20 (Beta, Reasoning)989492928892.8%
Qwen 3.5 27B1009494888892.8%
Gemini 3 Pro (Preview)10010094888292.8%
GPT-4o, May 13th (temp=1)10010098947192.6%
GPT-5.4 Nano10010099946992.4%
GPT-5.4 Mini (Reasoning)949493938792.4%
GPT-4.11009493938292.4%
GPT-5.4 Nano (Reasoning)969493908792.1%
Mistral Small 4 (Reasoning)949494918691.7%
Gemini 3.1 Flash Lite (Preview)1009488888891.6%
Gemini 3 Flash (Preview)100100100827691.6%
Mistral Medium 3.110010094887691.6%
Z.AI GLM 4.51009290888691.3%
Claude Sonnet 4.51009393878191.0%
DeepSeek-V2 Chat1009189888791.0%
GPT-5.4 (Reasoning)999795837890.4%
Stealth: Hunter Alpha100100100886490.4%
GPT-4o, May 13th (temp=0)949488888890.4%
Gemini 2.5 Pro949492888289.9%
GPT-5.4 Mini (Reasoning, Low)979188888489.4%
Llama 3.1 8B1009494797688.5%
Gemini 3 Flash (Preview, Reasoning)948888888288.0%
GPT-5.4959586847987.9%
GPT-4.1 Nano939289858087.8%
Claude 3.5 Haiku999188837787.6%
Gemini 2.5 Flash898888878787.6%
MiniMax M2.5979487837587.2%
GPT-5.1938887858287.2%
GPT-5.4 Mini969291847287.0%
Claude Sonnet 4.6949286807886.0%
Gemini 2.5 Flash Lite888887867885.3%
Claude Sonnet 4.6 (Reasoning)949184817484.9%
DeepSeek V3.1948888827084.4%
GPT-5.4 (Reasoning, Low)878585847983.8%
Gemini 2.5 Flash (Reasoning)948782706679.5%
Gemini 3.1 Pro (Preview)948681746279.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
MoonshotAI: Kimi K2.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
GPT-4o Mini (temp=0)10010010010010099.9%
Claude 3 Haiku1001001001009999.9%
GPT-4o, Aug. 6th (temp=1)1001001001009999.8%
Llama 3.1 70B100100100989899.2%
Rocinante 12B1001001001009498.9%
ByteDance Seed 1.61001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
ByteDance Seed 1.6 Flash1001001001009498.8%
Hermes 3 70B1001001001009498.8%
Mistral NeMO1001001001009498.8%
Mistral Large100100100989598.5%
Ministral 3 3B100100100999298.3%
Ministral 3B10010099979598.2%
ByteDance Seed 2.0 Lite100100100989198.0%
GPT-4o, May 13th (temp=0)10010098979497.8%
o4 Mini100100100949497.6%
Inception Mercury 2100100100949497.6%
GPT-4o, Aug. 6th (temp=0)1001001001008897.6%
Hermes 3 405B100100100998997.5%
ByteDance Seed 2.0 Mini1001001001008797.4%
Mistral Large 310010099988997.2%
MiniMax M2.51009997959597.2%
Gemini 3 Flash (Preview)10010098949497.1%
Claude Opus 4.6 (Reasoning)10010097959497.1%
GPT-5.2100100100949197.0%
LFM2 24B100100100929296.8%
Gemini 2.5 Flash Lite1009996969496.7%
GPT-4.11009997949496.7%
Ministral 3 14B10010098949296.7%
GPT-4o Mini (temp=1)10010097949296.5%
Grok 4.1 Fast100100100948896.4%
GPT-5 Mini100100100948896.4%
GPT-5100100100948896.4%
Qwen 3.5 397B A17B10010094949496.4%
o4 Mini High100100100948896.4%
Aion 2.0100100100948896.4%
Qwen 3.5 35B10010094949496.4%
Qwen 3.5 9B100100100948896.4%
Gemini 3.1 Flash Lite (Preview)100100100948896.4%
Qwen 3 32B10010094949496.4%
Grok 410010097929196.1%
Z.AI GLM 510010097978796.1%
Mistral Small 41009695949495.8%
Mistral Small Creative100100100908995.7%
Arcee AI: Trinity Large (Preview)10010094929295.5%
Llama 3.1 Nemotron 70B1009695949295.5%
GPT-4o, May 13th (temp=1)100100100948395.3%
Cohere Command R+ (Aug. 2024)10010094948895.2%
DeepSeek V3.210010094948895.2%
Mistral Small 3.2 24B1009494949495.2%
Qwen 3.5 122B1009494949495.1%
Arcee AI: Trinity Mini1009895948895.1%
Claude Haiku 4.510010093919094.8%
MiniMax M2.710010095938694.7%
Qwen 3.5 Plus (2026-02-15)10010095948294.2%
DeepSeek V3 (2025-03-24)10010093918794.1%
Writer: Palmyra X51009996898694.1%
Qwen 3.5 27B1009994888893.8%
Gemini 2.5 Flash Lite (Reasoning)1009595948493.6%
Inception Mercury100100100947393.5%
Ministral 3 8B1009593908993.4%
Claude Opus 4.61009692918893.2%
Z.AI GLM 5 Turbo1009292929093.2%
Gemma 3 4B1009493898993.1%
Ministral 8B1009593898993.1%
Claude Sonnet 4.6 (Reasoning)1009891888893.1%
Claude Opus 410010094917992.9%
Z.AI GLM 4.7949494948892.8%
Mistral Small 4 (Reasoning)979493928892.7%
GPT-5.4 Nano1009494938292.6%
Mistral Large 2989594908692.6%
Mistral Medium 3.11009491898792.3%
Claude Opus 4.5979494888792.0%
GPT-5.11009890868591.9%
Z.AI GLM 4.61009594888291.8%
Z.AI GLM 4.7 Flash10010094828291.7%
Gemini 3 Flash (Preview, Reasoning)10010094887691.6%
Gemini 3 Pro (Preview)1009494888291.6%
Claude Sonnet 4.51009988868591.6%
GPT-5.4 Nano (Reasoning, Low)989695848391.5%
GPT-5 Nano1009488888591.0%
Stealth: Hunter Alpha100100100767590.3%
Qwen3 235B A22B Instruct 2507979493907689.9%
Claude Sonnet 4939190898589.5%
GPT-4.1 Mini959290907989.4%
Z.AI GLM 4.5969493897489.4%
Gemini 2.5 Pro1009488828289.2%
GPT-5.4 Mini (Reasoning, Low)969389897588.5%
DeepSeek V3 (2024-12-26)959493867388.1%
GPT-5.4 Mini (Reasoning)949088868287.9%
Llama 3.1 8B1009594806987.5%
Grok 4.20 (Beta, Reasoning)1009289817687.3%
Claude Sonnet 4.6929088847986.7%
Claude 3.7 Sonnet949486827786.6%
GPT-5.4 (Reasoning, Low)1008686837986.6%
Gemini 2.5 Flash1009483817486.2%
GPT-5.4 Mini1009088757585.8%
Claude 3.5 Haiku949485837385.6%
DeepSeek-V2 Chat958989857085.5%
Gemma 3 12B978887837285.3%
Gemma 3 27B1008986846584.8%
Gemini 3.1 Pro (Preview)1009481767284.5%
Stealth: Healer Alpha948888827084.3%
DeepSeek V3.11008882817084.1%
GPT-5.4 Nano (Reasoning)998988756883.8%
Claude 3.5 Sonnet918988786983.2%
GPT-5.4918582817582.9%
Grok 4.20 (Beta)1009279766682.6%
GPT-4.1 Nano888786746279.4%
GPT-5.4 (Reasoning)888278767379.3%
Gemini 2.5 Flash (Reasoning)1008976725778.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Inception Mercury100100100100100100.0%
MoonshotAI: Kimi K2.51001001001009999.8%
Nemotron 3 Super1001001001009899.6%
Claude Opus 4.6 (Reasoning)1001001001009498.8%
Claude Sonnet 41001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Inception Mercury 21001001001009498.8%
Mistral Large 210010099989698.8%
Claude 3.5 Sonnet100100100999498.6%
Nemotron 3 Nano100100100989498.5%
Qwen 2.5 72B1001001001009298.4%
Qwen 3 32B1001001001009098.0%
ByteDance Seed 1.61001001001008997.9%
Mistral Large 3100100100989197.8%
Aion 2.0100100100959497.7%
Qwen 3.5 397B A17B100100100949497.6%
o4 Mini High100100100949497.6%
o4 Mini100100100949497.6%
ByteDance Seed 2.0 Mini100100100949497.6%
GPT-4.11001001001008897.5%
Claude Opus 4.6100100100959197.3%
GPT-5100100100998897.3%
DeepSeek V3 (2024-12-26)10010099949397.3%
Arcee AI: Trinity Mini100100100949297.2%
MiniMax M2.7100100100949297.1%
Llama 3.1 Nemotron 70B1009998949497.1%
ByteDance Seed 1.6 Flash10010097949497.0%
LFM2 24B10010099949296.9%
Ministral 3 14B10010098959196.8%
Claude Sonnet 4.5100100100929196.6%
Z.AI GLM 5 Turbo100100100988596.5%
Mistral Medium 3.110010096959296.5%
Mistral Large10010099949096.5%
Cohere Command R+ (Aug. 2024)1009896949496.4%
GPT-5 Mini10010094949496.4%
Grok 4.1 Fast100100100948896.4%
Qwen 3.5 9B100100100948896.4%
ByteDance Seed 2.0 Lite100100100948896.4%
GPT-4o Mini (temp=1)100100100948896.4%
DeepSeek-V2 Chat100100100909096.0%
Z.AI GLM 510010098948896.0%
Mistral Small 3.2 24B10010094949195.8%
GPT-5.4 Nano (Reasoning, Low)1009897929295.8%
DeepSeek V3 (2025-03-24)100100100898895.4%
Qwen 3.5 27B1009494949495.2%
Gemini 3 Pro (Preview)1009494949495.2%
Stealth: Healer Alpha10010094948895.2%
Gemini 3.1 Flash Lite (Preview)10010094948895.2%
Gemini 3 Flash (Preview)1009494949495.2%
GPT-4o, Aug. 6th (temp=0)1009494949495.2%
Mistral NeMO1009494949495.2%
Qwen 3.5 35B100100100948295.1%
MiniMax M2.51009894949095.0%
Ministral 3 3B1009998938594.8%
Claude Opus 410010097898894.8%
Llama 3.1 8B1009897978294.7%
Mistral Small 4 (Reasoning)1009997948294.4%
Ministral 8B10010099957794.3%
Claude Haiku 4.510010096878794.1%
Qwen 3.5 Plus (2026-02-15)1009494948894.0%
GPT-4o Mini (temp=0)949494949494.0%
Grok 4.20 (Beta, Reasoning)989594948893.9%
Z.AI GLM 4.710010093888893.8%
Z.AI GLM 4.51009592929193.8%
GPT-4o, May 13th (temp=0)1009494948693.6%
DeepSeek V3.210010094888693.6%
GPT-5.2999896948093.5%
Claude Opus 4.51009494918993.5%
Gemini 2.5 Pro10010094898593.4%
Ministral 3B10010090898893.3%
GPT-5.4999692918893.2%
Arcee AI: Trinity Large (Preview)1009992898593.1%
Gemma 3 27B1009794938192.9%
Claude 3 Haiku10010094898192.9%
Qwen3 235B A22B Instruct 25071009493898892.9%
Claude 3.7 Sonnet10010094888292.8%
Gemini 3 Flash (Preview, Reasoning)949494948892.8%
Z.AI GLM 4.7 Flash949494948892.8%
Hermes 3 70B1009894888492.8%
WizardLM 2 8x22b10010094947592.7%
GPT-4o, May 13th (temp=1)10010094927792.6%
DeepSeek V3.11009994888292.6%
Ministral 3 8B1009795947692.4%
GPT-5 Nano10010094927692.3%
GPT-5.4 Nano10010095907692.3%
Writer: Palmyra X51009693918192.3%
Stealth: Hunter Alpha949494918892.2%
Mistral Small 4999592918492.2%
GPT-5.4 (Reasoning, Low)969492908992.1%
Gemini 2.5 Flash Lite (Reasoning)1009794887991.9%
GPT-5.4 Nano (Reasoning)10010092887891.7%
Qwen 3.5 122B949494888891.6%
GPT-4o, Aug. 6th (temp=1)100100100797891.5%
GPT-4.1 Mini949493908691.5%
Gemma 3 12B989491888791.5%
GPT-5.4 Mini (Reasoning)989793868491.4%
Z.AI GLM 4.6949492888891.2%
Grok 4.20 (Beta)999492888191.0%
Mistral Small Creative1009488878490.8%
Claude Sonnet 4.6969592868490.6%
Hermes 3 405B10010094926690.5%
Rocinante 12B10010088877790.3%
Claude Sonnet 4.6 (Reasoning)1009991877590.3%
GPT-5.11009391877990.1%
Gemini 2.5 Flash (Reasoning)1009789838190.0%
Claude 3.5 Haiku969489868489.9%
Gemma 3 4B999692827989.5%
GPT-5.4 (Reasoning)989290877488.3%
Gemini 2.5 Flash959384848287.6%
Gemini 3.1 Pro (Preview)999491846787.1%
Gemini 2.5 Flash Lite1009893786586.7%
Llama 3.1 70B1009794883783.1%
GPT-5.4 Mini888482807581.8%
GPT-5.4 Mini (Reasoning, Low)888777756878.9%
GPT-4.1 Nano977575676675.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 122B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
o4 Mini High100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)10010010010010099.9%
Z.AI GLM 51001001001009999.8%
Ministral 3B1001001001009899.6%
DeepSeek V3 (2024-12-26)1001001001009699.2%
Qwen 3.5 397B A17B1001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
Grok 4.1 Fast1001001001009498.8%
Z.AI GLM 4.61001001001009498.8%
Grok 41001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Stealth: Healer Alpha1001001001009498.8%
ByteDance Seed 2.0 Lite1001001001009498.8%
Stealth: Aurora Alpha1001001001009498.8%
Hermes 3 405B1001001001009498.8%
DeepSeek V3.11001001001009498.8%
DeepSeek V3 (2025-03-24)1001001001009498.8%
Llama 3.1 70B1001001001009498.8%
Nemotron 3 Nano1001001001009498.8%
Arcee AI: Trinity Mini1001001001009498.8%
Rocinante 12B1001001001009498.8%
MiniMax M2.7100100100999498.7%
Qwen3 235B A22B Instruct 250710010099999598.6%
Z.AI GLM 4.51001001001009398.6%
GPT-5.4 Nano (Reasoning, Low)100100100999498.5%
Grok 4.20 (Beta, Reasoning)100100100989498.5%
Arcee AI: Trinity Large (Preview)100100100989498.4%
GPT-5100100100989498.3%
GPT-5.4 Mini (Reasoning)100100100989398.3%
Mistral Large 2100100100989498.3%
Gemma 3 27B1001001001009198.3%
GPT-5.410010099979498.2%
Z.AI GLM 5 Turbo100100100989398.2%
Writer: Palmyra X5100100100989298.0%
Claude 3.5 Haiku1001001001009098.0%
Mistral Small 4100100100969397.8%
Claude Sonnet 4.5100100100949497.7%
ByteDance Seed 1.6100100100949497.6%
Gemini 3 Flash (Preview, Reasoning)100100100949497.6%
Claude Opus 4.5100100100949497.6%
o4 Mini100100100949497.6%
Qwen 3.5 35B100100100949497.6%
GPT-5 Nano100100100949497.6%
Hermes 3 70B100100100949497.6%
Mistral NeMO100100100949497.6%
Cohere Command R+ (Aug. 2024)1001001001008797.4%
Llama 3.1 Nemotron 70B10010099949497.4%
DeepSeek-V2 Chat100100100949397.3%
Claude Sonnet 4.610010099998897.2%
Qwen 3.5 Plus (2026-02-15)100100100988897.2%
Ministral 3 3B100100100949197.0%
GPT-5.4 Mini (Reasoning, Low)10010098949296.9%
Llama 3.1 8B10010097949396.9%
Gemini 2.5 Flash10010099949296.9%
Claude 3.7 Sonnet100100100939296.8%
Gemini 2.5 Flash Lite100100100949096.8%
Claude Opus 41009796959496.5%
GPT-5 Mini10010094949496.4%
Gemini 3 Pro (Preview)10010094949496.4%
GPT-4.110010094949496.4%
Gemini 2.5 Pro100100100948896.4%
Gemini 3.1 Flash Lite (Preview)10010094949496.4%
Nemotron 3 Super100100100948896.4%
DeepSeek V3.2100100100948896.4%
Mistral Small Creative10010094949496.4%
GPT-4o Mini (temp=1)1009994949496.3%
GPT-5.4 Mini100100100938896.2%
Qwen 3 32B100100100948595.8%
Mistral Medium 3.1100100100908895.5%
GPT-5.4 Nano100100100898895.3%
Aion 2.0100100100888895.2%
Z.AI GLM 4.7 Flash1009494949495.2%
GPT-5.210010094948895.1%
Ministral 3 14B989896948995.0%
Claude Haiku 4.51009994908894.3%
Stealth: Hunter Alpha10010094948294.0%
Claude Opus 4.61009494948894.0%
Z.AI GLM 4.710010094888894.0%
GPT-4o, Aug. 6th (temp=1)10010093918593.9%
Claude 3.5 Sonnet100100100868493.9%
GPT-5.4 Nano (Reasoning)1009494919093.6%
GPT-5.4 (Reasoning)1009694908893.6%
Claude Opus 4.6 (Reasoning)979494948893.5%
Mistral Small 4 (Reasoning)1009494909093.5%
Ministral 8B1009693898893.2%
Claude Sonnet 41009494948292.9%
Gemini 3 Flash (Preview)1009494888892.8%
Claude Sonnet 4.6 (Reasoning)1009694878692.5%
Claude 3 Haiku10010097946992.1%
GPT-5.1989491888891.8%
Grok 4.20 (Beta)1009694878291.6%
Gemini 2.5 Flash (Reasoning)979491888791.2%
Ministral 3 8B1009493897990.9%
Gemini 3.1 Pro (Preview)989491888090.1%
GPT-5.4 (Reasoning, Low)1009288888290.0%
Gemma 3 12B1009794797889.7%
GPT-4o, May 13th (temp=1)10010087847689.3%
Gemini 2.5 Flash Lite (Reasoning)1009488838189.1%
GPT-4.1 Mini979088878088.5%
ByteDance Seed 2.0 Mini1009488887088.0%
Gemma 3 4B959489827186.1%
GPT-4.1 Nano948584817583.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4.1 Fast100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
GPT-4o Mini (temp=0)10010010010010099.9%
LFM2 24B1001001001009999.8%
ByteDance Seed 2.0 Lite1001001001009899.6%
Qwen 3 32B1001001001009598.9%
Qwen 3.5 122B1001001001009498.8%
ByteDance Seed 1.61001001001009498.8%
o4 Mini1001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Llama 3.1 70B1001001001009498.8%
MoonshotAI: Kimi K2.51001001001009498.8%
Mistral Medium 3.11001001001009498.7%
GPT-5.210010099999698.7%
Ministral 8B1001001001009398.6%
Mistral Large 2100100100979698.5%
Qwen 3.5 Plus (2026-02-15)1009999989498.1%
Ministral 3B100100100979398.0%
Z.AI GLM 510010097979497.7%
Hermes 3 70B100100100989097.7%
Qwen 3.5 397B A17B100100100949497.6%
Inception Mercury 2100100100949497.6%
Qwen 2.5 72B100100100949497.6%
ByteDance Seed 1.6 Flash100100100949497.6%
Claude Opus 4100100100969197.5%
o4 Mini High100100100998897.4%
Mistral NeMO10010099949497.4%
MiniMax M2.71009796969597.0%
Z.AI GLM 5 Turbo1009896959596.9%
GPT-4.110010096959496.9%
Claude Opus 4.6 (Reasoning)10010099988796.8%
Claude Opus 4.61001001001008496.8%
DeepSeek-V2 Chat10010096949496.7%
Gemini 3 Pro (Preview)100100100948896.4%
Qwen 3.5 Flash100100100948896.4%
GPT-4o, Aug. 6th (temp=0)10010094949496.4%
WizardLM 2 8x22b10010094949496.4%
Ministral 3 8B100100100998396.3%
Z.AI GLM 4.710010094949396.2%
Qwen 3.5 27B10010094949396.2%
DeepSeek V3 (2025-03-24)1001001001008095.9%
GPT-510010097948895.8%
GPT-4o, May 13th (temp=1)10010095949195.8%
Mistral Large10010098918995.6%
Qwen 3.5 9B100100100888895.2%
GPT-5 Nano1009896948995.2%
Claude Sonnet 4.61009795919194.9%
GPT-5 Mini10010094928894.8%
Claude Haiku 4.5100100100898594.8%
Mistral Small Creative10010096908894.6%
Qwen3 235B A22B Instruct 250710010094908894.5%
Llama 3.1 Nemotron 70B1009694948894.5%
Mistral Large 310010098948094.4%
GPT-4o Mini (temp=1)100100100987394.2%
Aion 2.010010094948294.0%
Z.AI GLM 4.610010094888894.0%
GPT-4o, May 13th (temp=0)949494949494.0%
Z.AI GLM 4.7 Flash10010094888894.0%
Hermes 3 405B1009794928593.8%
Gemini 2.5 Pro10010094918393.7%
Rocinante 12B10010092898693.5%
Mistral Small 41009694908593.0%
GPT-5.4 Nano (Reasoning, Low)10010093898493.0%
MiniMax M2.510010092878592.9%
DeepSeek V3.21009494898892.9%
Mistral Small 3.2 24B1009994937792.7%
Claude 3.5 Haiku1009994888292.6%
Stealth: Hunter Alpha10010094888192.5%
Ministral 3 14B989494928492.5%
Gemini 3 Flash (Preview)1009494888692.4%
Writer: Palmyra X51009995937592.4%
Claude Sonnet 4.51009693888492.2%
Claude Sonnet 41009890888592.2%
GPT-4o, Aug. 6th (temp=1)10010094917592.0%
Stealth: Healer Alpha1009494888291.5%
GPT-5.4 Nano1009290898691.4%
Claude 3.5 Sonnet10010091897691.2%
Mistral Small 4 (Reasoning)969390908791.2%
Ministral 3 3B999490908391.1%
GPT-5.4 Nano (Reasoning)949191908891.0%
Gemma 3 12B1009888848390.6%
GPT-5.1969594917690.4%
Grok 4.20 (Beta, Reasoning)959292918390.4%
Claude Opus 4.510010094857390.3%
Arcee AI: Trinity Large (Preview)979290878590.2%
Gemini 2.5 Flash1009490857989.7%
Grok 4.20 (Beta)959594877789.7%
DeepSeek V3 (2024-12-26)10010087817989.3%
Arcee AI: Trinity Mini959492887789.1%
GPT-4.1 Mini999287838288.6%
Z.AI GLM 4.5949389887888.5%
Inception Mercury10010094886088.4%
Gemini 3 Flash (Preview, Reasoning)949488828288.0%
DeepSeek V3.1949388828287.7%
Claude 3.7 Sonnet939286867987.1%
Claude Sonnet 4.6 (Reasoning)939387847786.9%
Gemini 3.1 Flash Lite (Preview)1008888887086.8%
GPT-5.4908987858286.7%
GPT-5.4 (Reasoning)929185858086.6%
Claude 3 Haiku949489797786.5%
Gemini 2.5 Flash Lite (Reasoning)1009786737385.8%
Gemini 3.1 Pro (Preview)939390777685.7%
GPT-5.4 Mini (Reasoning)908986838085.6%
Gemini 2.5 Flash Lite949486847085.6%
GPT-5.4 Mini898989818085.5%
GPT-5.4 Mini (Reasoning, Low)908887847484.6%
Gemini 2.5 Flash (Reasoning)918684817182.7%
Gemma 3 27B928886786782.2%
GPT-5.4 (Reasoning, Low)878584797481.7%
Gemma 3 4B939188814078.6%
GPT-4.1 Nano898981755978.5%
Llama 3.1 8B909074575473.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Claude Opus 410010010010010099.9%
DeepSeek V3 (2025-03-24)1001001001009999.9%
Claude 3.5 Haiku1001001001009799.3%
WizardLM 2 8x22b1001001001009699.1%
Z.AI GLM 5 Turbo1001001001009498.8%
GPT-5.11001001001009498.8%
Qwen 3.5 397B A17B1001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
GPT-5.4 Mini (Reasoning)1001001001009498.8%
Gemini 3 Flash (Preview, Reasoning)1001001001009498.8%
MiniMax M2.51001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
GPT-4.1 Mini1001001001009498.8%
Qwen 2.5 72B1001001001009498.8%
Llama 3.1 Nemotron 70B1001001001009498.8%
Arcee AI: Trinity Large (Preview)1001001001009498.8%
Cohere Command R+ (Aug. 2024)1001001001009498.8%
Ministral 3B1001001001009498.8%
LFM2 24B1001001001009498.8%
Rocinante 12B1001001001009498.8%
Ministral 3 8B1001001001009398.7%
Gemma 3 12B100100100989498.5%
GPT-4o, May 13th (temp=1)100100100989498.4%
Ministral 3 14B100100100979498.1%
GPT-5.4 Nano (Reasoning, Low)100100100969498.1%
Grok 4.20 (Beta)1009999989497.9%
Writer: Palmyra X5100100100949497.6%
GPT-5 Mini100100100949497.6%
GPT-5100100100949497.6%
Z.AI GLM 5100100100949497.6%
ByteDance Seed 1.6100100100949497.6%
Aion 2.0100100100949497.6%
Gemini 2.5 Pro100100100949497.6%
o4 Mini100100100949497.6%
Grok 4100100100949497.6%
Claude Sonnet 4.5100100100949497.6%
Stealth: Hunter Alpha100100100949497.6%
Z.AI GLM 4.5100100100949497.6%
Qwen 3.5 Plus (2026-02-15)1001001001008897.6%
GPT-4o, May 13th (temp=0)100100100949497.6%
GPT-5.4100100100949497.6%
GPT-4o, Aug. 6th (temp=0)100100100949497.6%
Qwen 3 32B100100100949497.6%
Mistral Large100100100949497.6%
Qwen3 235B A22B Instruct 2507100100100949497.6%
Llama 3.1 70B1001001001008897.6%
Mistral Small Creative100100100949497.6%
Claude 3 Haiku100100100949497.6%
Claude 3.7 Sonnet100100100949497.5%
Mistral Small 4 (Reasoning)10010099949497.5%
Claude Sonnet 4.6 (Reasoning)10010099949296.9%
Gemma 3 4B1009999968996.5%
Claude Haiku 4.510010099939196.5%
GPT-4.1 Nano10010096949296.4%
Claude Sonnet 4.610010096949296.4%
Grok 4.20 (Beta, Reasoning)10010094949496.4%
Grok 4.1 Fast10010094949496.4%
MiniMax M2.710010094949496.4%
ByteDance Seed 2.0 Lite100100100948896.4%
GPT-4o, Aug. 6th (temp=1)100100100948896.4%
Mistral Large 2100100100948896.4%
DeepSeek V3.1100100100948896.4%
Mistral Small 3.2 24B10010094949496.4%
Mistral Medium 3.1100100100948896.4%
Nemotron 3 Nano10010094949496.4%
Gemini 2.5 Flash Lite100100100948896.3%
GPT-4.110010094949296.0%
GPT-5.4 Nano (Reasoning)1009996948895.5%
Gemini 2.5 Flash989694949495.3%
Gemini 2.5 Flash (Reasoning)10010094948895.3%
Claude Opus 4.6 (Reasoning)100100100948295.2%
GPT-5.4 (Reasoning)10010094948895.2%
MoonshotAI: Kimi K2.510010094948895.2%
Qwen 3.5 27B10010094948895.2%
Z.AI GLM 4.61009494949495.2%
Z.AI GLM 4.710010094948895.2%
ByteDance Seed 2.0 Mini10010094948895.2%
Grok 4 Fast100100100888895.2%
Gemini 3.1 Flash Lite (Preview)10010094948895.2%
Nemotron 3 Super10010094948895.2%
DeepSeek V3.21001001001007695.2%
Arcee AI: Trinity Mini10010094948895.2%
GPT-5.4 Nano10010094929095.2%
Mistral NeMO10010098898894.8%
GPT-5.4 Mini (Reasoning, Low)1009994928894.6%
Gemma 3 27B1009694948894.4%
Gemini 2.5 Flash Lite (Reasoning)10010094908894.4%
GPT-5.4 (Reasoning, Low)10010096948194.2%
Gemini 3 Pro (Preview)1009494948894.0%
Stealth: Healer Alpha10010094888894.0%
GPT-5.210010094948293.9%
Mistral Small 4100100100828292.8%
Inception Mercury1001001001006092.0%
Gemini 3.1 Pro (Preview)959494888290.5%
GPT-5 Nano1008888888890.4%
Llama 3.1 8B1009688868290.3%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 27B100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B1001001001009999.8%
Llama 3.1 70B1001001001009699.3%
GPT-4o Mini (temp=1)1001001001009699.1%
Ministral 3 8B1001001001009599.1%
Z.AI GLM 5 Turbo1001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
Grok 4.20 (Beta, Reasoning)1001001001009498.8%
ByteDance Seed 1.61001001001009498.8%
Grok 4.1 Fast1001001001009498.8%
GPT-4.11001001001009498.8%
Claude 3.5 Haiku1001001001009498.8%
GPT-5.4 Mini1001001001009498.8%
Mistral Large1001001001009498.8%
Llama 3.1 Nemotron 70B1001001001009498.8%
Ministral 3 3B1001001001009498.8%
Claude 3 Haiku1001001001009498.8%
Mistral Small Creative1001001001009298.5%
Claude Opus 4100100100979598.5%
Claude Sonnet 4100100100989498.4%
Mistral Large 3100100100989498.4%
Writer: Palmyra X5100100100989398.3%
Mistral Small 4100100100999398.3%
Mistral Medium 3.11001001001009098.1%
GPT-5.4 Nano (Reasoning)10010098979497.8%
GPT-5100100100949497.6%
Z.AI GLM 4.7100100100949497.6%
Qwen 3.5 35B100100100949497.6%
ByteDance Seed 2.0 Mini1001001001008897.6%
Qwen 3.5 Flash100100100949497.6%
GPT-5.4 Mini (Reasoning, Low)100100100949497.6%
Z.AI GLM 4.7 Flash100100100949497.6%
Nemotron 3 Super1001001001008897.6%
GPT-5.4 Nano (Reasoning, Low)1001001001008897.6%
WizardLM 2 8x22b100100100949497.6%
Ministral 3B100100100949497.6%
Mistral Small 3.2 24B1001001001008897.5%
Arcee AI: Trinity Large (Preview)1001001001008797.5%
Gemma 3 27B100100100998897.4%
GPT-5.110010099949497.3%
DeepSeek V3 (2025-03-24)10010098949497.2%
Qwen3 235B A22B Instruct 2507100100100939297.1%
Claude Sonnet 4.510010099949397.1%
MiniMax M2.7100100100978897.0%
Mistral Large 2100100100978897.0%
Gemma 3 4B10010096949496.9%
Gemini 2.5 Flash Lite100100100939096.7%
GPT-5 Mini10010094949496.4%
GPT-5.4 (Reasoning, Low)10010094949496.4%
MoonshotAI: Kimi K2.5100100100948896.4%
GPT-5.210010094949496.4%
o4 Mini10010094949496.4%
Qwen 3.5 Plus (2026-02-15)100100100948896.4%
Gemini 3.1 Flash Lite (Preview)10010094949496.4%
GPT-4o, May 13th (temp=1)10010094949496.4%
Hermes 3 405B100100100948896.4%
Qwen 3 32B10010094949496.4%
Qwen 2.5 72B100100100948896.4%
GPT-5.4 Nano10010094949496.4%
GPT-4o, Aug. 6th (temp=1)10010094949496.4%
Gemma 3 12B10010094949496.2%
MiniMax M2.510010099919196.2%
Claude Sonnet 4.6 (Reasoning)10010094949195.7%
GPT-4.1 Mini10010095948795.3%
Llama 3.1 8B1009995948895.3%
Qwen 3.5 397B A17B10010094948895.2%
GPT-5.4 Mini (Reasoning)10010094948895.2%
Z.AI GLM 4.610010094948895.2%
Stealth: Hunter Alpha10010094948895.2%
ByteDance Seed 2.0 Lite100100100888895.2%
DeepSeek V3.21009494949495.2%
GPT-5.410010094948895.1%
Z.AI GLM 4.510010094938894.9%
DeepSeek V3.11009894948894.9%
DeepSeek V3 (2024-12-26)1009994928994.8%
Claude Haiku 4.510010096918594.5%
Mistral Small 4 (Reasoning)100100100997294.2%
DeepSeek-V2 Chat10010097918394.0%
Ministral 8B1009494929094.0%
Aion 2.01009494948894.0%
Grok 4.20 (Beta)1009994918594.0%
GPT-5 Nano10010094888793.8%
GPT-5.4 (Reasoning)969494948893.3%
Ministral 3 14B1009995898393.0%
Claude Opus 4.6 (Reasoning)1009494948292.8%
Gemini 3 Pro (Preview)949494948892.8%
Gemini 2.5 Pro1009494948292.8%
GPT-4o, May 13th (temp=0)1009494888892.8%
Claude Sonnet 4.610010094858492.5%
Claude Opus 4.510010094888192.5%
Claude 3.7 Sonnet999692898592.3%
Inception Mercury100100100946591.8%
Hermes 3 70B949493898891.7%
Gemini 2.5 Flash1009494888291.6%
Z.AI GLM 5959491898891.5%
GPT-4.1 Nano959494898190.6%
Gemini 3 Flash (Preview)10010094827690.4%
Gemini 3 Flash (Preview, Reasoning)1009494887690.4%
Arcee AI: Trinity Mini1009488888290.4%
Claude Opus 4.6949392888490.3%
Stealth: Healer Alpha1008888888289.2%
Gemini 2.5 Flash Lite (Reasoning)1009288877989.1%
Gemini 2.5 Flash (Reasoning)948882828285.6%
Gemini 3.1 Pro (Preview)948276767680.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
GPT-5.210010010010010099.9%
ByteDance Seed 1.6 Flash1001001001009999.9%
GPT-5.41001001001009899.6%
GPT-5.4 (Reasoning, Low)10010099999899.2%
GPT-5 Mini1001001001009498.8%
GPT-51001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
Grok 4.1 Fast1001001001009498.8%
Aion 2.01001001001009498.8%
Gemini 3 Pro (Preview)1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
Inception Mercury 21001001001009498.8%
Stealth: Aurora Alpha1001001001009498.8%
Qwen 3 32B1001001001009498.8%
Hermes 3 70B1001001001009498.8%
Mistral NeMO100100100999498.7%
Llama 3.1 70B100100100999498.6%
Z.AI GLM 4.7 Flash1001001001009298.5%
Mistral Medium 3.1100100100989498.3%
Hermes 3 405B100100100989498.3%
GPT-4.110010099989398.1%
Z.AI GLM 4.510010098989498.0%
GPT-4o, Aug. 6th (temp=1)1001001001008997.9%
GPT-4o, May 13th (temp=0)100100100949497.6%
GPT-5.1100100100949497.6%
Claude Opus 4.6100100100949497.6%
MoonshotAI: Kimi K2.5100100100949497.6%
ByteDance Seed 1.6100100100949497.6%
o4 Mini High100100100949497.6%
Qwen 3.5 Flash100100100949497.6%
Grok 4.20 (Beta, Reasoning)10010098969497.5%
Rocinante 12B100100100988897.3%
Gemini 2.5 Flash Lite100100100959297.3%
GPT-5.4 (Reasoning)100100100949397.3%
Cohere Command R+ (Aug. 2024)1001001001008697.2%
Ministral 3 14B10010099949296.9%
Grok 4100100100949096.9%
MiniMax M2.510010097949396.8%
MiniMax M2.7100100100958996.8%
Claude Opus 4.6 (Reasoning)10010098949196.6%
GPT-5.4 Nano10010096939396.5%
Stealth: Healer Alpha10010094949496.4%
Qwen 3.5 122B10010094949496.4%
Stealth: Hunter Alpha100100100948896.4%
Gemini 3 Flash (Preview)10010094949496.4%
Nemotron 3 Nano10010094949496.4%
GPT-5.4 Mini (Reasoning)10010097949196.3%
DeepSeek V3 (2025-03-24)1009997949196.2%
Z.AI GLM 5 Turbo999896949496.2%
Qwen 2.5 72B10010099948896.1%
Claude 3.5 Sonnet10010096949196.1%
LFM2 24B100100100928896.1%
Llama 3.1 Nemotron 70B100100100928896.1%
GPT-5.4 Mini (Reasoning, Low)1009997978695.9%
Gemma 3 4B10010096938895.5%
GPT-5.4 Nano (Reasoning)1009994939195.3%
Claude Sonnet 4.5999796949095.3%
Qwen 3.5 397B A17B1009494949495.2%
ByteDance Seed 2.0 Lite100100100948295.2%
Gemini 3 Flash (Preview, Reasoning)100100100997695.1%
GPT-4o Mini (temp=1)10010095948695.0%
Ministral 3B10010095928895.0%
GPT-4o, May 13th (temp=1)10010094948794.9%
GPT-5 Nano100100100948194.9%
Claude Opus 4.51009894948894.8%
Ministral 3 3B1009797918894.8%
Arcee AI: Trinity Mini999494949394.7%
Claude Sonnet 4.6 (Reasoning)999998928594.6%
Arcee AI: Trinity Large (Preview)100100100878594.5%
Gemini 2.5 Flash Lite (Reasoning)10010094948594.5%
DeepSeek-V2 Chat1009996898894.3%
GPT-5.4 Mini1009794928894.2%
Z.AI GLM 4.71009494948894.0%
Gemini 2.5 Pro1009494948894.0%
GPT-4.1 Mini1009493929194.0%
Claude Haiku 4.510010093898893.8%
Gemma 3 12B10010097927993.7%
Writer: Palmyra X510010092908593.5%
GPT-5.4 Nano (Reasoning, Low)10010094947993.4%
Mistral Large1009894888492.8%
Qwen 3.5 Plus (2026-02-15)100100100828292.8%
Nemotron 3 Super1009494888892.8%
Claude Sonnet 4.61009393908892.7%
Mistral Small 4 (Reasoning)1009493898692.6%
DeepSeek V3.110010088888792.4%
Mistral Small Creative1009494938192.3%
DeepSeek V3 (2024-12-26)1009494938092.2%
Mistral Small 41009794888392.2%
Z.AI GLM 4.61009788888792.0%
Claude 3.5 Haiku10010094877891.7%
Gemini 3.1 Flash Lite (Preview)1009494888291.6%
Claude 3.7 Sonnet989591868691.1%
WizardLM 2 8x22b1009494887991.0%
Claude Sonnet 41009694887690.7%
DeepSeek V3.2949494888290.4%
Claude Opus 41009492877890.2%
Grok 4.20 (Beta)1009488887889.7%
GPT-4.1 Nano959389878289.5%
Qwen3 235B A22B Instruct 25071009692787488.2%
Gemini 3.1 Pro (Preview)948888888288.0%
Ministral 8B978885858487.9%
Ministral 3 8B1009589856586.8%
Mistral Large 2949490797786.8%
Llama 3.1 8B949485827686.1%
Gemini 2.5 Flash1008888836985.6%
Mistral Large 3948887817685.3%
Gemini 2.5 Flash (Reasoning)1009486757185.2%
Gemma 3 27B1009392815984.9%
ByteDance Seed 2.0 Mini998875707080.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Claude Opus 4.61001001001009999.9%
MiniMax M2.7100100100989899.2%
LFM2 24B100100100989899.2%
GPT-5.4 Mini1001001001009699.1%
Claude 3.5 Sonnet1001001001009598.9%
Claude Opus 4.6 (Reasoning)1001001001009498.8%
GPT-5 Mini1001001001009498.8%
Qwen 3.5 122B1001001001009498.8%
GPT-5.4 (Reasoning, Low)1001001001009498.8%
ByteDance Seed 1.61001001001009498.8%
GPT-5.21001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
Gemini 3.1 Flash Lite (Preview)1001001001009498.8%
ByteDance Seed 2.0 Lite1001001001009498.8%
GPT-5.41001001001009498.8%
Stealth: Aurora Alpha1001001001009498.8%
Claude 3.5 Haiku1001001001009498.8%
DeepSeek V3.11001001001009498.8%
Inception Mercury1001001001009498.8%
GPT-4o Mini (temp=1)1001001001009498.8%
Nemotron 3 Nano1001001001009498.8%
Qwen 2.5 72B1001001001009498.8%
Mistral Small Creative1001001001009498.8%
GPT-5 Nano100100100999498.7%
Stealth: Healer Alpha100100100999498.6%
DeepSeek-V2 Chat1001001001009398.6%
Rocinante 12B100100100999498.6%
Mistral Large100100100999498.6%
GPT-5.4 Nano (Reasoning)100100100999498.5%
Qwen3 235B A22B Instruct 2507100100100989498.5%
Mistral Large 210010099989598.5%
Grok 4.20 (Beta)1001001001009398.5%
GPT-4.1100100100989498.4%
Claude Sonnet 4.51001001001009298.4%
DeepSeek V3 (2025-03-24)100100100979498.3%
Hermes 3 70B100100100969598.1%
Gemma 3 27B10010099959497.7%
Qwen 3.5 397B A17B100100100949497.6%
MoonshotAI: Kimi K2.51001001001008897.6%
Gemini 2.5 Pro1001001001008897.6%
GPT-4o, May 13th (temp=0)100100100949497.6%
DeepSeek V3.2100100100949497.6%
Gemini 3 Flash (Preview)100100100949497.6%
Llama 3.1 Nemotron 70B100100100949497.6%
Claude Haiku 4.5100100100979197.5%
DeepSeek V3 (2024-12-26)100100100998897.4%
Arcee AI: Trinity Mini10010098969197.1%
Mistral Large 3100100100949197.1%
GPT-4o, May 13th (temp=1)10010097949497.0%
Gemma 3 12B10010099949197.0%
Cohere Command R+ (Aug. 2024)100100100968896.9%
Ministral 3 8B10010098978996.9%
Claude 3 Haiku100100100968896.8%
Ministral 3 3B100100100948996.5%
GPT-4o, Aug. 6th (temp=1)100100100988596.5%
Mistral Small 410010098949196.5%
Arcee AI: Trinity Large (Preview)100100100948896.5%
Z.AI GLM 510010099968796.4%
Gemini 3 Flash (Preview, Reasoning)10010094949496.4%
Stealth: Hunter Alpha100100100948896.4%
Z.AI GLM 4.7 Flash10010094949496.4%
WizardLM 2 8x22b10010094949496.4%
GPT-5.4 Nano (Reasoning, Low)1009797949496.3%
Llama 3.1 8B1001001001008196.2%
GPT-5.4 Mini (Reasoning, Low)999896949496.1%
GPT-5.4 Nano1009897939396.1%
Z.AI GLM 4.61009894949496.0%
Claude Opus 410010094949195.8%
Qwen 3 32B1001001001007995.8%
Mistral NeMO1009794949495.8%
Z.AI GLM 4.5100100100908895.6%
Mistral Medium 3.1100100100908995.6%
Writer: Palmyra X51009795949095.2%
Qwen 3.5 27B10010094948895.2%
Qwen 3.5 35B100100100948295.2%
ByteDance Seed 2.0 Mini10010094948895.2%
Qwen 3.5 Plus (2026-02-15)1009494949495.2%
Mistral Small 3.2 24B1009494949495.2%
Claude Opus 4.510010098948495.1%
GPT-4o Mini (temp=0)1009494949395.0%
Claude Sonnet 4.6 (Reasoning)10010096908894.8%
Claude Sonnet 410010096918694.6%
Hermes 3 405B1009894919094.6%
MiniMax M2.51009999888694.3%
Gemini 3.1 Pro (Preview)1009494948894.0%
Z.AI GLM 4.71009494948894.0%
Aion 2.010010094938293.8%
Ministral 8B10010091898993.8%
Llama 3.1 70B1009494928893.6%
Ministral 3B100100100878093.5%
Mistral Small 4 (Reasoning)10010098947392.9%
Gemini 2.5 Flash (Reasoning)1009994868392.5%
Gemini 2.5 Flash1009494908091.6%
Claude 3.7 Sonnet1009995847891.2%
Claude Sonnet 4.6959390898890.9%
Gemma 3 4B1009493867990.7%
Gemini 2.5 Flash Lite (Reasoning)949290908690.3%
Ministral 3 14B10010097866589.5%
GPT-4.1 Mini1009090867788.7%
Gemini 2.5 Flash Lite1009894786987.7%
GPT-4.1 Nano918980787081.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3 32B100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
MiniMax M2.71001001001009999.9%
GPT-5.4 Mini1001001001009999.8%
GPT-4o Mini (temp=1)1001001001009999.8%
Writer: Palmyra X51001001001009999.8%
Ministral 3 3B1001001001009999.8%
Ministral 3 14B1001001001009799.4%
Hermes 3 405B1001001001009699.3%
Ministral 8B100100100999698.9%
Llama 3.1 8B1001001001009498.9%
Z.AI GLM 5 Turbo1001001001009498.8%
Claude Sonnet 4.6 (Reasoning)1001001001009498.8%
GPT-5 Mini1001001001009498.8%
GPT-51001001001009498.8%
Z.AI GLM 51001001001009498.8%
Qwen 3.5 27B1001001001009498.8%
GPT-5.4 Mini (Reasoning)1001001001009498.8%
o4 Mini High1001001001009498.8%
GPT-4.11001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Claude Opus 41001001001009498.8%
Grok 4 Fast1001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
Mistral Large 31001001001009498.8%
GPT-4o, May 13th (temp=0)1001001001009498.8%
Nemotron 3 Super1001001001009498.8%
GPT-4o, Aug. 6th (temp=1)1001001001009498.8%
GPT-5 Nano1001001001009498.8%
GPT-4o, Aug. 6th (temp=0)1001001001009498.8%
o4 Mini100100100999498.7%
Qwen 2.5 72B100100100999498.6%
Mistral Large1001001001009398.6%
Mistral NeMO1001001001009398.6%
Ministral 3 8B100100100979498.2%
Claude Sonnet 410010098989498.1%
Mistral Small 4 (Reasoning)100100100969598.1%
Z.AI GLM 4.61001001001009098.1%
Z.AI GLM 4.51001001001009098.0%
Qwen3 235B A22B Instruct 250710010099979498.0%
Rocinante 12B1001001001008997.9%
GPT-5.4 (Reasoning, Low)100100100949497.6%
GPT-5.2100100100949497.6%
Claude Opus 4.5100100100949497.6%
Stealth: Hunter Alpha1001001001008897.6%
Qwen 3.5 Plus (2026-02-15)100100100949497.6%
Gemini 3 Flash (Preview)100100100949497.6%
Inception Mercury 2100100100949497.6%
Mistral Large 2100100100949497.6%
DeepSeek V3.1100100100949497.6%
Cohere Command R+ (Aug. 2024)100100100949497.6%
Claude 3.7 Sonnet100100100959397.6%
GPT-5.4 Mini (Reasoning, Low)100100100949497.5%
Llama 3.1 Nemotron 70B100100100998897.4%
GPT-5.4 Nano (Reasoning, Low)10010098959497.4%
Claude Haiku 4.5100100100949397.4%
Claude 3.5 Haiku1001001001008797.3%
GPT-5.4 Nano (Reasoning)100100100988897.2%
GPT-4o, May 13th (temp=1)10010098949497.2%
Hermes 3 70B100100100949197.1%
Llama 3.1 70B100100100949096.8%
Claude Sonnet 4.5100100100948996.6%
GPT-5.1100100100948896.4%
Claude Opus 4.610010094949496.4%
Aion 2.010010094949496.4%
Z.AI GLM 4.7100100100948896.4%
ByteDance Seed 2.0 Mini10010094949496.4%
Gemini 3.1 Flash Lite (Preview)100100100948896.4%
Z.AI GLM 4.7 Flash100100100948896.4%
ByteDance Seed 2.0 Lite100100100948896.4%
GPT-5.4100100100948896.4%
GPT-4.1 Mini10010095949396.4%
GPT-5.4 Nano10010094949396.1%
DeepSeek V3 (2024-12-26)1009994949496.1%
Grok 4.20 (Beta)1009994949496.1%
Gemini 2.5 Flash Lite10010097948995.9%
Mistral Small Creative100100100938695.8%
Gemini 2.5 Flash Lite (Reasoning)10010099948695.8%
Claude 3 Haiku10010096948895.7%
DeepSeek V3 (2025-03-24)10010094939195.6%
Gemma 3 27B10010095958895.6%
Arcee AI: Trinity Mini1009894949095.3%
Gemini 3 Flash (Preview, Reasoning)100100100888895.2%
Grok 4.1 Fast10010094948895.2%
Gemini 3 Pro (Preview)10010094948895.2%
Mistral Medium 3.110010094938995.1%
DeepSeek-V2 Chat999694948894.2%
Gemini 3.1 Pro (Preview)1009494948894.0%
DeepSeek V3.210010094888894.0%
Mistral Small 410010092918693.6%
Gemini 2.5 Flash (Reasoning)10010094918293.4%
ByteDance Seed 1.6100100100887692.8%
Gemini 2.5 Flash1009491918892.7%
GPT-4.1 Nano1009790888692.0%
Gemma 3 4B979594878291.2%
Gemma 3 12B10010094797689.8%
Mistral Small 3.2 24B10010088885987.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Grok 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)1001001001009999.9%
Claude 3.5 Sonnet1001001001009999.8%
Mistral Large 21001001001009999.8%
GPT-4o Mini (temp=1)1001001001009999.7%
DeepSeek V3 (2024-12-26)1001001001009899.5%
Claude Sonnet 4100100100999899.5%
Z.AI GLM 5 Turbo1001001001009799.3%
Claude Haiku 4.5100100100999899.3%
GPT-5.4 Nano1001001001009699.3%
Z.AI GLM 51001001001009599.1%
Hermes 3 70B100100100989698.9%
Gemini 3 Pro (Preview)1001001001009498.8%
Gemini 2.5 Pro1001001001009498.8%
o4 Mini1001001001009498.8%
Qwen 3.5 35B1001001001009498.8%
Stealth: Hunter Alpha1001001001009498.8%
Grok 4 Fast1001001001009498.8%
Nemotron 3 Super1001001001009498.8%
WizardLM 2 8x22b1001001001009498.8%
GPT-5.2100100100999598.7%
Arcee AI: Trinity Large (Preview)1001001001009398.6%
Cohere Command R+ (Aug. 2024)1001001001009298.5%
DeepSeek V3 (2025-03-24)1001001001009298.3%
GPT-5.4 Nano (Reasoning)1009898989698.1%
Arcee AI: Trinity Mini100100100999298.1%
MiniMax M2.510010098979598.0%
Claude Opus 4.6100100100959497.9%
Qwen 2.5 72B1001001001008897.7%
Mistral Small Creative100100100979197.7%
Mistral Large10010099959497.7%
Claude Opus 4.6 (Reasoning)100100100949497.6%
Grok 4.20 (Beta, Reasoning)100100100949497.6%
Aion 2.0100100100949497.6%
Gemini 3 Flash (Preview)1001001001008897.6%
Stealth: Aurora Alpha100100100949497.6%
Nemotron 3 Nano1001001001008897.6%
ByteDance Seed 2.0 Lite100100100949497.5%
GPT-4o, May 13th (temp=0)100100100949397.4%
GPT-5.4 Mini (Reasoning, Low)10010099959197.1%
GPT-5.4 (Reasoning)10010097949497.0%
GPT-5.41009997949496.7%
GPT-5.4 (Reasoning, Low)10010098959096.5%
Ministral 3 14B10010099948996.5%
Gemini 3 Flash (Preview, Reasoning)100100100948896.4%
Z.AI GLM 4.7100100100948896.4%
Mistral NeMO10010094949496.4%
Grok 4.20 (Beta)10010094949496.4%
GPT-5.4 Mini (Reasoning)10010098929196.2%
Ministral 3 8B10010094939396.0%
Claude Opus 410010095949196.0%
Claude Sonnet 4.5100100100948595.7%
Hermes 3 405B100100100948595.7%
GPT-4o, Aug. 6th (temp=1)1001001001007895.7%
Llama 3.1 70B10010094949095.6%
Gemini 2.5 Flash10010094929295.6%
GPT-5.4 Mini1009896958895.5%
Claude 3.5 Haiku10010096968595.5%
Qwen3 235B A22B Instruct 250710010097948595.3%
Gemini 3.1 Flash Lite (Preview)10010094948895.2%
Z.AI GLM 4.7 Flash10010094948895.2%
Llama 3.1 Nemotron 70B1009494949395.1%
Mistral Small 41009897948695.0%
Mistral Small 4 (Reasoning)10010093928994.9%
GPT-5.4 Nano (Reasoning, Low)999898948594.8%
Mistral Small 3.2 24B959594949494.5%
MiniMax M2.710010098987694.5%
DeepSeek-V2 Chat10010094938594.4%
GPT-5 Nano10010094928694.3%
Mistral Medium 3.110010097918394.1%
Gemini 3.1 Pro (Preview)1009494948894.0%
GPT-4.1 Mini10010099868493.8%
Claude 3 Haiku1009393929193.7%
Claude Sonnet 4.610010096937993.6%
Ministral 3 3B1009695948393.5%
Claude Opus 4.51009794908693.5%
Z.AI GLM 4.610010094918293.4%
Ministral 8B1009494938593.2%
Ministral 3B100100100887793.1%
LFM2 24B979794938593.0%
Claude 3.7 Sonnet1009593918793.0%
DeepSeek V3.1989494928793.0%
Gemini 2.5 Flash Lite1009494888892.7%
GPT-4.11009494948092.5%
Gemma 3 12B989795947792.4%
Claude Sonnet 4.6 (Reasoning)949393919092.2%
Stealth: Healer Alpha1009289888891.4%
Writer: Palmyra X510010089887991.2%
Gemini 2.5 Flash Lite (Reasoning)1009490878190.3%
Z.AI GLM 4.5949488888690.0%
GPT-4o, May 13th (temp=1)10010094847290.0%
Gemma 3 27B989494907389.8%
Gemini 2.5 Flash (Reasoning)949191888589.8%
Rocinante 12B989889897189.0%
Qwen 3 32B10010090866287.7%
Llama 3.1 8B10010093864083.8%
Gemma 3 4B1009976726883.1%
GPT-4.1 Nano977978777581.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Claude 3.7 Sonnet1001001001009999.8%
Mistral Small 41001001001009999.8%
Claude Haiku 4.51001001001009899.6%
Hermes 3 70B1001001001009799.5%
Claude Sonnet 4.6 (Reasoning)1001001001009599.1%
Qwen 3.5 397B A17B1001001001009498.8%
MoonshotAI: Kimi K2.51001001001009498.8%
o4 Mini High1001001001009498.8%
GPT-5.21001001001009498.8%
Aion 2.01001001001009498.8%
Z.AI GLM 4.61001001001009498.8%
MiniMax M2.71001001001009498.8%
MiniMax M2.51001001001009498.8%
Z.AI GLM 4.71001001001009498.8%
GPT-4.11001001001009498.8%
Grok 41001001001009498.8%
Claude Opus 41001001001009498.8%
Qwen 3.5 9B1001001001009498.8%
Qwen 3.5 Plus (2026-02-15)1001001001009498.8%
Mistral Large 31001001001009498.8%
DeepSeek-V2 Chat1001001001009498.8%
Inception Mercury 21001001001009498.8%
GPT-5 Nano1001001001009498.8%
Mistral Large 21001001001009498.8%
DeepSeek V3.11001001001009498.8%
Gemini 2.5 Flash Lite1001001001009498.8%
Mistral Large1001001001009498.8%
GPT-4o Mini (temp=1)1001001001009498.8%
Mistral Small 3.2 24B1001001001009498.8%
GPT-4o Mini (temp=0)1001001001009498.8%
Gemma 3 27B1001001001009498.8%
Qwen 2.5 72B1001001001009498.8%
GPT-5.4 Nano1001001001009498.8%
ByteDance Seed 1.6 Flash1001001001009498.8%
Claude 3 Haiku1001001001009498.8%
Mistral NeMO1001001001009498.8%
GPT-5.4 Nano (Reasoning, Low)1001001001009498.8%
Gemma 3 12B1001001001009498.7%
Hermes 3 405B1001001001009398.5%
Cohere Command R+ (Aug. 2024)1001001001009298.5%
Arcee AI: Trinity Large (Preview)1001001001009097.9%
Claude Sonnet 4.6100100100989197.9%
Claude Opus 4.6 (Reasoning)100100100949497.6%
Z.AI GLM 5 Turbo100100100949497.6%
GPT-5 Mini100100100949497.6%
GPT-5.1100100100949497.6%
GPT-5.4 (Reasoning, Low)100100100949497.6%
Z.AI GLM 5100100100949497.6%
Qwen 3.5 27B1001001001008897.6%
Claude Sonnet 4100100100949497.6%
Gemini 2.5 Pro100100100949497.6%
Claude Sonnet 4.5100100100949497.6%
Stealth: Hunter Alpha100100100949497.6%
Gemini 2.5 Flash (Reasoning)100100100949497.6%
Z.AI GLM 4.5100100100949497.6%
Z.AI GLM 4.7 Flash100100100949497.6%
Claude 3.5 Sonnet100100100949497.6%
GPT-4o, May 13th (temp=1)100100100949497.6%
DeepSeek V3 (2024-12-26)100100100949497.6%
GPT-5.4 Mini100100100949497.6%
Qwen3 235B A22B Instruct 2507100100100949497.6%
Writer: Palmyra X5100100100949497.6%
Arcee AI: Trinity Mini100100100949497.6%
LFM2 24B1001001001008897.6%
Rocinante 12B1001001001008897.6%
Ministral 3 14B1001001001008697.2%
Claude 3.5 Haiku100100100988696.9%
Llama 3.1 8B10010096949396.5%
GPT-4.1 Nano10010099978696.5%
GPT-5.4 Mini (Reasoning)100100100948896.4%
o4 Mini100100100948896.4%
Grok 4 Fast10010094949496.4%
Gemini 3.1 Flash Lite (Preview)100100100948896.4%
Nemotron 3 Super100100100948896.4%
GPT-5.410010094949496.4%
Mistral Small 4 (Reasoning)100100100948896.4%
Gemini 2.5 Flash10010094949496.4%
Nemotron 3 Nano100100100948896.4%
Mistral Small Creative10010094949496.4%
WizardLM 2 8x22b100100100948896.4%
Mistral Medium 3.110010094949296.0%
Claude Opus 4.510010098948896.0%
GPT-5.4 (Reasoning)10010094948895.2%
ByteDance Seed 1.6100100100888895.2%
Gemini 3 Flash (Preview, Reasoning)100100100948295.2%
Grok 4.20 (Beta)10010094948895.2%
DeepSeek V3.21009494949495.2%
GPT-5.4 Nano (Reasoning)1009494949495.2%
Llama 3.1 Nemotron 70B10010097948595.2%
GPT-5949494949494.0%
Qwen 3.5 122B10010094888894.0%
Grok 4.1 Fast10010094888894.0%
ByteDance Seed 2.0 Mini1009494948894.0%
Stealth: Healer Alpha10010094888894.0%
Gemini 2.5 Flash Lite (Reasoning)1009494948894.0%
Qwen 3.5 Flash10010094947692.8%
Gemini 3 Flash (Preview)1009494888892.8%
ByteDance Seed 2.0 Lite10010094888292.8%
Inception Mercury1001001001006092.0%
Gemini 3.1 Pro (Preview)1009494888291.6%