Filter word density

Test: Bad Writing Habits

Avg. Score
86.5%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Mistral Small Creative99.8%$0.00079.1s97%
2Ministral 3 14B99.8%$0.000711.7s97%
3Mistral Small 499.8%$0.001418.2s98%
4Writer: Palmyra X5100.0%$0.01122.0s99%
5Mistral Medium 3.199.6%$0.004836.5s97%
6Mistral Large 398.8%$0.003330.3s92%
7Mistral Small 4 (Reasoning)98.9%$0.002230.2s91%
8Qwen3 235B A22B Instruct 250799.8%$0.001159.2s96%
9o4 Mini99.1%$0.01525.7s91%
10GPT-5.4 Nano (Reasoning, Low)97.5%$0.005520.6s88%
11GPT-5.4 Mini97.8%$0.01516.8s88%
12Mistral Large 298.7%$0.01329.4s90%
13GPT-5.4 Mini (Reasoning, Low)98.3%$0.01516.8s87%
14Mistral Large98.6%$0.01430.9s90%
15DeepSeek V3 (2025-03-24)98.2%$0.001439.4s86%
16Ministral 3 8B97.2%$0.000819.6s81%
17GPT-5.4 Nano (Reasoning)97.1%$0.006124.5s85%
18GPT-5.4 Nano97.3%$0.005726.3s84%
19Qwen 3.5 9B99.5%$0.00111.4m95%
20ByteDance Seed 1.6 Flash97.1%$0.001327.3s81%
21o4 Mini High99.2%$0.02547.2s94%
22GPT-4o Mini (temp=1)95.9%$0.001234.8s82%
23Stealth: Hunter Alpha97.4%$0.000055.0s85%
24Grok 4.20 (Beta)96.0%$0.01815.8s83%
25GPT-4.198.2%$0.01844.7s88%
26Ministral 8B96.1%$0.000410.4s69%
27GPT-4.1 Mini95.5%$0.002719.0s73%
28GPT-5.4 Mini (Reasoning)96.9%$0.02228.1s83%
29Z.AI GLM 5 Turbo95.1%$0.008133.2s80%
30Qwen 3 32B96.5%$0.001554.6s79%
31GPT-5 Mini95.6%$0.010057.4s84%
32Grok 4 Fast94.9%$0.001724.1s68%
33Ministral 3B93.1%$0.00018.1s64%
34Z.AI GLM 4.7 Flash96.1%$0.00171.2m80%
35Qwen 3.5 Flash95.0%$0.002547.5s73%
36Z.AI GLM 4.797.0%$0.0101.4m85%
37Grok 4.1 Fast96.6%$0.001837.8s67%
38Claude Sonnet 4.596.4%$0.03538.1s83%
39Qwen 3.5 122B96.3%$0.0251.1m85%
40Claude Sonnet 4.696.6%$0.03139.3s79%
41GPT-5.499.8%$0.0491.4m97%
42Grok 4.20 (Beta, Reasoning)94.4%$0.03934.0s82%
43LFM2 24B92.2%$0.000228.4s65%
44GPT-5.4 (Reasoning, Low)99.7%$0.0551.4m97%
45Gemma 3 27B93.1%$0.000652.6s70%
46Ministral 3 3B92.0%$0.000511.1s59%
47GPT-4o, Aug. 6th (temp=1)92.3%$0.01824.4s70%
48Aion 2.093.8%$0.00641.3m77%
49Stealth: Healer Alpha90.7%$0.000023.7s63%
50GPT-4o Mini (temp=0)89.2%$0.001234.8s65%
51GPT-4.1 Nano89.6%$0.000713.3s57%
52GPT-5.199.7%$0.0541.8m96%
53Gemini 3 Pro (Preview)97.1%$0.05554.4s84%
54GPT-5.298.8%$0.0561.5m93%
55DeepSeek-V2 Chat92.2%$0.002153.3s63%
56Gemini 2.5 Pro93.3%$0.03636.2s72%
57Qwen 3.5 35B93.9%$0.0181.0m70%
58DeepSeek V3 (2024-12-26)91.6%$0.002154.6s63%
59Gemini 2.5 Flash (Reasoning)87.6%$0.01121.5s60%
60Z.AI GLM 590.7%$0.00841.2m66%
61Claude Opus 4.698.2%$0.0781.2m88%
62Mistral NeMO86.4%$0.000510.1s48%
63Claude Sonnet 4.6 (Reasoning)97.5%$0.0601.2m79%
64Qwen 3.5 Plus (2026-02-15)85.2%$0.006031.5s58%
65Rocinante 12B88.9%$0.001438.4s52%
66DeepSeek V3.291.9%$0.00141.9m69%
67Qwen 3.5 27B94.3%$0.0201.6m70%
68Claude Haiku 4.586.4%$0.01121.6s53%
69Gemma 3 12B85.3%$0.000441.3s55%
70Hermes 3 405B90.1%$0.003253.2s52%
71Gemini 2.5 Flash83.7%$0.005210.6s46%
72Claude Opus 4.6 (Reasoning)97.7%$0.0881.4m85%
73Gemini 3 Flash (Preview)81.3%$0.007819.6s48%
74Arcee AI: Trinity Mini80.8%$0.00039.2s42%
75Gemma 3 4B79.9%$0.000220.0s46%
76MiniMax M2.586.4%$0.00341.3m54%
77Z.AI GLM 4.684.1%$0.006551.5s51%
78Claude 3 Haiku81.4%$0.002514.9s41%
79DeepSeek V3.187.8%$0.00201.8m60%
80MiniMax M2.784.4%$0.00401.1m52%
81GPT-5.4 (Reasoning)99.8%$0.0892.6m98%
82Grok 495.0%$0.0481.7m68%
83GPT-4o, May 13th (temp=1)84.3%$0.03314.4s47%
84GPT-598.3%$0.0652.8m90%
85Gemini 2.5 Flash Lite75.4%$0.00099.5s39%
86Qwen 3.5 397B A17B92.6%$0.0143.0m70%
87Gemini 3.1 Pro (Preview)97.7%$0.1071.8m83%
88Cohere Command R+ (Aug. 2024)81.8%$0.02052.5s38%
89Gemini 3.1 Flash Lite (Preview)70.4%$0.00308.4s32%
90GPT-4o, May 13th (temp=0)79.2%$0.03514.1s37%
91Gemini 3 Flash (Preview, Reasoning)74.2%$0.01230.1s37%
92GPT-4o, Aug. 6th (temp=0)77.8%$0.02322.7s34%
93Claude Opus 4.585.4%$0.07053.4s55%
94MoonshotAI: Kimi K2.591.0%$0.0193.2m63%
95ByteDance Seed 2.0 Lite83.2%$0.0122.2m46%
96Arcee AI: Trinity Large (Preview)73.3%$0.000043.6s25%
97Claude 3.5 Haiku67.2%$0.003510.8s23%
98Hermes 3 70B73.8%$0.00101.2m31%
99Z.AI GLM 4.570.5%$0.005142.1s26%
100WizardLM 2 8x22b79.3%$0.00261.8m31%
101Qwen 2.5 72B66.2%$0.001036.7s24%
102Claude 3.7 Sonnet73.0%$0.04246.7s34%
103Gemini 2.5 Flash Lite (Reasoning)62.8%$0.002830.8s21%
104Claude 3.5 Sonnet71.2%$0.04835.5s27%
105ByteDance Seed 1.674.2%$0.0132.5m33%
106GPT-5 Nano61.0%$0.00421.4m25%
107Claude Sonnet 464.0%$0.03243.7s18%
108Claude Opus 493.7%$0.2091.4m72%
109Inception Mercury 240.8%$0.00327.0s10%
110Llama 3.1 70B44.9%$0.001529.4s10%
111Nemotron 3 Super51.5%$0.00001.4m15%
112Inception Mercury47.2%$0.01117.6s2%
113Stealth: Aurora Alpha33.9%$0.00009.8s4%
114ByteDance Seed 2.0 Mini75.4%$0.00454.9m36%
115Llama 3.1 Nemotron 70B37.5%$0.003831.7s6%
116Llama 3.1 8B40.0%$0.00031.3m9%
117Mistral Small 3.2 24B77.6%$0.00685.6m26%
118Nemotron 3 Nano25.6%$0.00101.1m2%
86.53%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-5.4 Nano (Reasoning)1001001001009999.7%
Qwen 3 32B1001001001009899.6%
Ministral 3 8B1001001001009799.4%
Qwen 3.5 35B1001001001009599.0%
GPT-4o Mini (temp=1)1001001001009198.2%
LFM2 24B100100100989297.9%
WizardLM 2 8x22b100100100999097.8%
GPT-4o, Aug. 6th (temp=1)1001001001008997.7%
Aion 2.01001001001008496.8%
Ministral 8B100100100948896.5%
DeepSeek V3 (2024-12-26)1001001001008296.5%
MoonshotAI: Kimi K2.5100100100968696.5%
Grok 4.20 (Beta, Reasoning)1009996929195.7%
Arcee AI: Trinity Mini100100100958395.6%
Qwen 3.5 Plus (2026-02-15)100100100948495.5%
Z.AI GLM 5100100100928495.3%
GPT-5 Mini100100100987895.1%
Grok 4.20 (Beta)100100100888795.1%
Grok 41001001001007394.5%
Claude Sonnet 4.6 (Reasoning)1001001001007194.1%
Z.AI GLM 4.710010093918493.7%
MiniMax M2.7969595929093.7%
ByteDance Seed 1.6100100100947193.0%
Claude Sonnet 4.5100100100877692.6%
Z.AI GLM 5 Turbo1009897927592.3%
GPT-4.1 Nano1009895917692.1%
Claude Opus 4.6 (Reasoning)100100100916992.0%
Gemma 3 27B100100100936391.3%
Claude Opus 41001001001005591.1%
DeepSeek V3.2100100100777389.9%
Claude 3 Haiku100100100856289.4%
Gemini 2.5 Pro1001001001004288.4%
GPT-4.1 Mini100100100875287.9%
ByteDance Seed 2.0 Lite100100100983787.1%
Stealth: Healer Alpha1001001001003486.8%
Claude Opus 4.510010090786686.8%
Stealth: Hunter Alpha10010089755583.8%
Ministral 3B100100100100981.8%
Gemini 3 Flash (Preview, Reasoning)10010084744981.4%
Rocinante 12B1009695882480.6%
Hermes 3 70B10010010086077.1%
Gemini 2.5 Flash (Reasoning)1009976664076.3%
DeepSeek-V2 Chat100100100701176.1%
Ministral 3 3B100100100522775.7%
Mistral NeMO1009188881175.6%
GPT-4o Mini (temp=0)878275736175.5%
Claude 3.5 Sonnet1008675684975.5%
Gemini 3.1 Flash Lite (Preview)1007469676274.3%
Gemini 3 Flash (Preview)968880632770.8%
Gemma 3 4B887971714470.3%
Z.AI GLM 4.61007974633269.4%
GPT-4o, May 13th (temp=1)10010010046069.2%
GPT-4o, Aug. 6th (temp=0)100948862068.9%
Hermes 3 405B10010056462665.7%
Nemotron 3 Super1008654453864.8%
Z.AI GLM 4.5100837365064.2%
DeepSeek V3.1948869601064.1%
Claude Haiku 4.591837766063.4%
Inception Mercury 21007363403562.3%
Claude Sonnet 41007752331956.2%
GPT-4o, May 13th (temp=0)945958392354.5%
ByteDance Seed 2.0 Mini1001005411053.0%
GPT-5 Nano716856402852.7%
MiniMax M2.5100864136052.6%
Gemini 2.5 Flash100893825451.2%
Claude 3.7 Sonnet100814824050.4%
Arcee AI: Trinity Large (Preview)9589602049.2%
Gemma 3 12B85553327039.9%
Nemotron 3 Nano8166330035.9%
Llama 3.1 70B1007500035.0%
Claude 3.5 Haiku6859480034.8%
Gemini 2.5 Flash Lite63464218033.8%
Stealth: Aurora Alpha10034169031.8%
Llama 3.1 8B6151370029.9%
Qwen 2.5 72B6050155026.1%
Gemini 2.5 Flash Lite (Reasoning)6129150020.9%
Mistral Small 3.2 24B3832210018.3%
Cohere Command R+ (Aug. 2024)71000014.3%
Inception Mercury000000.0%
Llama 3.1 Nemotron 70B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-51001001001009799.4%
WizardLM 2 8x22b1001001001009699.2%
Z.AI GLM 4.71001001001009598.9%
Z.AI GLM 51001001001009498.9%
Stealth: Hunter Alpha100100100989498.4%
Ministral 3 3B1001001001009298.4%
GPT-4.1 Mini1001001001009098.0%
Qwen 3.5 122B1001001001008797.5%
Grok 4.20 (Beta, Reasoning)100100100998997.5%
ByteDance Seed 2.0 Lite1001001001008797.4%
Z.AI GLM 5 Turbo1001001001008697.2%
DeepSeek V3.11001001001008597.1%
Ministral 3B1001001001008396.6%
Z.AI GLM 4.6100100100919196.5%
GPT-5.4 Nano (Reasoning)1001001001008196.2%
GPT-4o Mini (temp=1)100100100928895.9%
DeepSeek-V2 Chat100100100948595.8%
Qwen 3.5 35B1001001001007595.1%
DeepSeek V3 (2024-12-26)1001001001007595.0%
DeepSeek V3.2100100100898594.7%
MiniMax M2.5100100100878594.3%
Claude Opus 4.510010098878694.3%
Claude Sonnet 41001001001007194.1%
GPT-5.4 Nano100100100898093.8%
Mistral Large 2100100100826990.2%
GPT-4o, May 13th (temp=1)10010094797289.0%
Stealth: Healer Alpha10010099935488.9%
Z.AI GLM 4.51009588887188.6%
GPT-4o Mini (temp=0)1009189897188.0%
Hermes 3 405B1001001001003687.1%
GPT-4.1 Nano100100100864886.8%
Gemini 2.5 Pro10010081777486.3%
Arcee AI: Trinity Large (Preview)100100100963185.5%
MiniMax M2.7100100100725585.4%
Gemini 3 Flash (Preview)10010098834384.8%
GPT-4o, Aug. 6th (temp=1)10010082766083.5%
Gemini 3 Flash (Preview, Reasoning)10010083745883.0%
Ministral 8B1001001001001382.6%
Gemma 3 27B1009382815381.7%
Gemini 2.5 Flash (Reasoning)1008881706781.3%
Gemini 2.5 Flash1009980636380.9%
Mistral Small 3.2 24B100100100100080.0%
Arcee AI: Trinity Mini100100100792080.0%
Gemini 3.1 Flash Lite (Preview)10010073646179.7%
Gemma 3 12B959379645878.0%
ByteDance Seed 2.0 Mini100100100711877.8%
Claude 3.5 Sonnet1009388623876.3%
GPT-4o, Aug. 6th (temp=0)978983623072.3%
Hermes 3 70B100919073070.8%
Claude 3.5 Haiku100987861067.5%
ByteDance Seed 1.610010070382666.9%
GPT-4o, May 13th (temp=0)1009264452565.4%
Gemma 3 4B757166615164.8%
Mistral NeMO100767059060.9%
Nemotron 3 Super908344361854.3%
Claude 3.7 Sonnet100945812754.2%
Cohere Command R+ (Aug. 2024)100684436049.5%
GPT-5 Nano100634439049.2%
Gemini 2.5 Flash Lite894542241142.2%
Claude 3 Haiku79464621539.5%
Qwen 2.5 72B9251391036.9%
Llama 3.1 70B1002200024.4%
Stealth: Aurora Alpha801700019.4%
Nemotron 3 Nano3633170017.3%
Inception Mercury 2373330014.5%
Gemini 2.5 Flash Lite (Reasoning)42960011.4%
Llama 3.1 8B54000010.7%
Llama 3.1 Nemotron 70B1100002.1%
Inception Mercury000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Small 4100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)10010010010010099.9%
Mistral Medium 3.11001001001009999.7%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.7%
Gemma 3 27B1001001001009799.4%
Ministral 3B1001001001009799.4%
Claude Haiku 4.51001001001009699.3%
Qwen 3.5 397B A17B1001001001009699.2%
Inception Mercury1001001001009699.1%
GPT-4.1 Nano1001001001009599.0%
Gemini 3 Flash (Preview, Reasoning)1001001001009498.7%
o4 Mini High1001001001009498.7%
Gemini 2.5 Flash (Reasoning)1001001001009398.6%
Qwen 3.5 122B1001001001009398.5%
MoonshotAI: Kimi K2.51001001001009198.2%
GPT-5 Mini1001001001009198.1%
Stealth: Aurora Alpha1001001001009098.1%
LFM2 24B1001001001008997.9%
GPT-5.4 Nano1001001001008897.7%
Z.AI GLM 4.5100100100989197.6%
ByteDance Seed 1.61001001001008897.6%
Gemini 2.5 Pro1001001001008897.6%
Hermes 3 405B1001001001008897.6%
Mistral Large1001001001008797.4%
DeepSeek-V2 Chat1001001001008596.9%
Z.AI GLM 4.61001001001008396.7%
GPT-5.4 Nano (Reasoning)1001001001008096.1%
Grok 4.20 (Beta, Reasoning)1001001001008096.0%
Qwen 2.5 72B1001001001008096.0%
GPT-51001001001007695.3%
Gemini 3 Flash (Preview)100100100997795.2%
GPT-4o, May 13th (temp=0)100100100908595.0%
Rocinante 12B100100100878694.7%
Nemotron 3 Super1001001001007194.3%
Claude Opus 4.5100100100887993.5%
MiniMax M2.5100100100966993.1%
ByteDance Seed 2.0 Mini1001001001006292.4%
GPT-4o, May 13th (temp=1)1001001001006192.2%
GPT-4o Mini (temp=1)100100100847591.8%
Qwen 3.5 27B1001001001005691.3%
Inception Mercury 2100100100995390.3%
DeepSeek V3.1100100100915188.5%
Hermes 3 70B10010091856688.2%
Gemini 2.5 Flash Lite10010089874383.9%
Gemini 3.1 Flash Lite (Preview)1009895912681.9%
Mistral NeMO100100100981181.8%
GPT-5 Nano1009073706980.4%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Llama 3.1 Nemotron 70B1007575666375.8%
Llama 3.1 70B1008571433867.5%
Nemotron 3 Nano1001008635064.2%
Llama 3.1 8B997875382863.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 410010010010010099.9%
Gemini 3 Pro (Preview)1001001001009999.7%
GPT-4o, May 13th (temp=0)1001001001009899.6%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.5%
Qwen 3.5 397B A17B1001001001009799.3%
ByteDance Seed 1.6 Flash1001001001009699.3%
GPT-4o Mini (temp=0)1001001001009699.2%
Claude Opus 41001001001009699.1%
Cohere Command R+ (Aug. 2024)1001001001009599.0%
MiniMax M2.51001001001009599.0%
Ministral 3 3B1001001001009598.9%
Claude Opus 4.51001001001009498.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009498.7%
GPT-4o, May 13th (temp=1)1001001001009498.7%
Mistral Small 41001001001009498.7%
Gemini 2.5 Flash (Reasoning)100100100989398.2%
Grok 4.20 (Beta)1001001001009098.0%
Gemma 3 12B1001001001009098.0%
Claude 3 Haiku1001001001009097.9%
Z.AI GLM 5 Turbo100100100999097.8%
Gemini 2.5 Pro1001001001008997.7%
Aion 2.01001001001008897.6%
Mistral Large 31001001001008897.6%
WizardLM 2 8x22b100100100949397.5%
GPT-4.1 Mini100100100989097.5%
Qwen 3.5 27B1001001001008797.4%
Qwen 3.5 122B1001001001008797.4%
GPT-5100100100979097.2%
Z.AI GLM 510010098949397.0%
Claude Sonnet 4.51001001001008396.7%
Mistral Small 4 (Reasoning)1001001001008296.5%
o4 Mini High1001001001008296.4%
Qwen 3.5 Plus (2026-02-15)1001001001008296.4%
LFM2 24B1001001001008196.1%
DeepSeek-V2 Chat1001001001008096.0%
GPT-5.4 Nano (Reasoning)1001001001007695.3%
DeepSeek V3.210010098918594.9%
Claude Haiku 4.5100100100947894.3%
DeepSeek V3 (2025-03-24)1001001001007093.9%
GPT-4o Mini (temp=1)100100100897793.2%
Hermes 3 70B1001001001006693.1%
Qwen 3 32B100100100897592.8%
Arcee AI: Trinity Large (Preview)100100100996392.4%
GPT-5 Mini100100100877391.9%
Grok 4.20 (Beta, Reasoning)10010099877091.2%
Grok 4 Fast100100100866390.0%
GPT-4o, Aug. 6th (temp=0)10010098836889.7%
Ministral 3B100100100905589.1%
GPT-4o, Aug. 6th (temp=1)100100100875488.1%
DeepSeek V3.110010088846186.7%
MiniMax M2.710010083767186.1%
Claude 3.5 Sonnet10010085737085.5%
MoonshotAI: Kimi K2.51009793864985.1%
ByteDance Seed 2.0 Lite100100100665984.9%
Stealth: Healer Alpha1009487717184.7%
Z.AI GLM 4.510010082815282.8%
GPT-5 Nano1009387746082.6%
Claude Sonnet 410010093861979.5%
Gemini 3 Flash (Preview)10010095762479.0%
Claude 3.5 Haiku1001009588778.0%
Gemini 2.5 Flash Lite999594574177.2%
Arcee AI: Trinity Mini998973714475.4%
Z.AI GLM 4.610010082672174.0%
Claude 3.7 Sonnet10010083393871.9%
Gemini 3 Flash (Preview, Reasoning)877672693467.9%
Gemma 3 4B1008679492567.7%
Qwen 2.5 72B10010062522467.6%
Mistral Small 3.2 24B10010010024064.8%
Gemini 2.5 Flash Lite (Reasoning)747168594563.2%
Nemotron 3 Nano1008754521361.1%
ByteDance Seed 1.6967670431660.2%
Inception Mercury1001001000060.0%
ByteDance Seed 2.0 Mini1001007117057.6%
Inception Mercury 293855548557.4%
Nemotron 3 Super100944833055.2%
Gemini 3.1 Flash Lite (Preview)938473151055.1%
Llama 3.1 70B63636154048.4%
Stealth: Aurora Alpha8281700046.4%
Llama 3.1 Nemotron 70B10048170032.9%
Llama 3.1 8B7648390032.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral NeMO100100100100100100.0%
LFM2 24B100100100100100100.0%
Qwen 3.5 397B A17B1001001001009999.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.8%
Ministral 8B1001001001009999.7%
GPT-4o, May 13th (temp=1)1001001001009899.6%
Stealth: Healer Alpha1001001001009899.5%
Arcee AI: Trinity Large (Preview)1001001001009899.5%
GPT-4.11001001001009598.9%
GPT-5.21001001001009398.6%
Ministral 3 3B1001001001009398.5%
GPT-4o, Aug. 6th (temp=1)1001001001009298.4%
Stealth: Hunter Alpha1001001001009198.3%
Claude Opus 4.61001001001009198.1%
Claude Opus 4.51001001001008997.9%
Z.AI GLM 51001001001008997.7%
Claude 3.7 Sonnet1001001001008897.7%
GPT-5.4 Mini (Reasoning)100100100969097.4%
GPT-4o Mini (temp=0)100100100978997.1%
ByteDance Seed 1.61001001001008697.1%
GPT-5.4 Nano (Reasoning, Low)100100100958896.6%
Mistral Small 3.2 24B1001001001008296.4%
Gemini 2.5 Pro100100100968696.3%
DeepSeek V3.21001001001008196.1%
GPT-4.1 Nano1001001001007995.8%
Claude 3.5 Haiku100100100928695.5%
Claude Haiku 4.5100100100928595.4%
Z.AI GLM 5 Turbo100100100918294.6%
Cohere Command R+ (Aug. 2024)100100100908294.3%
Aion 2.0100100100957694.1%
Gemini 3 Pro (Preview)100100100907993.7%
Z.AI GLM 4.71001001001006893.5%
Arcee AI: Trinity Mini1001001001006893.5%
DeepSeek V3.110010096898293.5%
GPT-4o, Aug. 6th (temp=0)100100100996893.3%
GPT-5 Mini100100100897793.3%
Grok 4.20 (Beta, Reasoning)100100100996893.3%
Claude 3.5 Sonnet1001001001006192.2%
Gemini 2.5 Flash100100100887091.6%
Gemini 2.5 Flash (Reasoning)100100100886790.9%
MiniMax M2.510010096847290.4%
Gemma 3 4B1001001001005190.1%
Ministral 3B1001001001005090.0%
Inception Mercury 21001001001004589.1%
Qwen 3.5 27B1001001001004589.0%
Ministral 3 8B1001001001004589.0%
Gemini 3.1 Flash Lite (Preview)100100100815887.7%
Qwen 3.5 Plus (2026-02-15)100100100854886.6%
Gemma 3 12B10010079747184.8%
Gemini 3 Flash (Preview, Reasoning)1009593766084.8%
Gemini 2.5 Flash Lite1008987686782.4%
ByteDance Seed 2.0 Lite100100100100981.7%
Rocinante 12B100100100713681.4%
Z.AI GLM 4.61009486784881.0%
Qwen 2.5 72B10010084813980.8%
Stealth: Aurora Alpha10010089813580.8%
Llama 3.1 70B10010090663778.5%
GPT-5 Nano10010089851377.4%
Z.AI GLM 4.510010010080076.0%
Gemini 3 Flash (Preview)928987604474.5%
Nemotron 3 Super868071616071.4%
Hermes 3 70B1001009550069.0%
Gemini 2.5 Flash Lite (Reasoning)977155454562.6%
Nemotron 3 Nano100766617051.7%
Llama 3.1 8B10075660048.1%
Llama 3.1 Nemotron 70B100763315545.9%
ByteDance Seed 2.0 Mini604638131133.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-5.4100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
DeepSeek V3 (2024-12-26)1001001001009999.7%
GPT-4o, Aug. 6th (temp=1)1001001001009899.6%
LFM2 24B1001001001009899.5%
DeepSeek V3.21001001001009799.5%
GPT-5.21001001001009799.4%
Mistral Large1001001001009799.4%
Claude Opus 4.6 (Reasoning)1001001001009799.4%
MiniMax M2.51001001001009699.3%
GPT-5.11001001001009298.5%
Mistral Medium 3.11001001001009198.1%
Stealth: Hunter Alpha1001001001009098.1%
Gemma 3 27B1001001001008797.5%
Rocinante 12B1001001001008797.4%
Gemini 3 Pro (Preview)100100100988997.2%
MoonshotAI: Kimi K2.51001001001008697.1%
GPT-4o Mini (temp=0)1001001001008697.1%
Mistral Small Creative1001001001008697.1%
Ministral 3 14B1001001001008697.1%
Qwen 3 32B100100100959097.1%
Qwen 3.5 397B A17B1001001001008496.9%
Hermes 3 70B1001001001008396.7%
Mistral Large 2100100100958996.7%
MiniMax M2.71001001001008296.4%
Mistral Large 31001001001008096.0%
Qwen 3.5 122B1001001001008096.0%
Claude Opus 4.61001001001008095.9%
GPT-4o, Aug. 6th (temp=0)1001001001007995.8%
Ministral 3 3B1001001001007795.4%
DeepSeek V3.1100100100908795.4%
Gemini 2.5 Pro10010093938995.1%
GPT-5 Mini100100100987594.7%
Z.AI GLM 5 Turbo1001001001007194.3%
Z.AI GLM 4.6100100100878494.1%
Stealth: Healer Alpha100100100997093.9%
Gemma 3 12B100100100927493.1%
Claude Haiku 4.5100100100937192.8%
GPT-4o, May 13th (temp=0)1001001001006392.7%
Grok 4.20 (Beta)10010096887892.5%
Claude Opus 4.510010096868092.4%
Qwen 3.5 Plus (2026-02-15)100100100976492.3%
Grok 4100100100946792.1%
Gemini 2.5 Flash Lite10010096956992.0%
GPT-4.1 Mini100100100916891.7%
Grok 4.20 (Beta, Reasoning)100100100867091.1%
Qwen 2.5 72B100100100975790.8%
Qwen 3.5 35B1001001001005490.8%
Aion 2.010010089887590.4%
Claude Sonnet 4100100100916090.1%
Z.AI GLM 4.510010092916689.8%
GPT-5.4 Nano (Reasoning, Low)1008988858489.2%
DeepSeek-V2 Chat1001001001004488.7%
WizardLM 2 8x22b100100100974388.1%
Z.AI GLM 4.7 Flash10010098865587.6%
Mistral NeMO100100100954187.2%
Claude 3.5 Sonnet1009488856586.4%
Gemini 2.5 Flash Lite (Reasoning)10010097716085.5%
Arcee AI: Trinity Large (Preview)100100100923585.3%
GPT-4o, May 13th (temp=1)10010087873982.6%
GPT-5 Nano1008077757481.1%
Claude 3 Haiku10010089813580.9%
Mistral Small 3.2 24B100100100100080.0%
ByteDance Seed 2.0 Lite1009087634877.6%
Cohere Command R+ (Aug. 2024)100100100661476.2%
Arcee AI: Trinity Mini1008875665276.0%
Claude 3.7 Sonnet1008988544975.9%
Gemini 3.1 Flash Lite (Preview)898571625472.1%
Gemini 3 Flash (Preview)938872664272.0%
Gemini 3 Flash (Preview, Reasoning)958582732471.6%
Claude 3.5 Haiku1008882731371.2%
Gemma 3 4B867772684769.8%
Llama 3.1 70B1007559545267.8%
ByteDance Seed 1.61006961554866.6%
Llama 3.1 Nemotron 70B10010010027065.4%
ByteDance Seed 2.0 Mini1001007050063.9%
Inception Mercury1001006646062.4%
Nemotron 3 Super827558443859.2%
Llama 3.1 8B1001005733358.7%
Stealth: Aurora Alpha77553129038.6%
Inception Mercury 269604312036.9%
Nemotron 3 Nano1006780035.1%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude Sonnet 4.61001001001009999.8%
GPT-5.4 Nano1001001001009999.8%
Mistral Large1001001001009999.7%
Grok 4.20 (Beta)1001001001009799.4%
Claude Opus 4.6 (Reasoning)1001001001009699.2%
GPT-5.4 (Reasoning, Low)1001001001009598.9%
o4 Mini1001001001009298.5%
Ministral 8B1001001001009198.3%
GPT-5.4 (Reasoning)1001001001009198.2%
Grok 4 Fast1001001001008997.8%
Stealth: Hunter Alpha1001001001008897.6%
Mistral Small 4 (Reasoning)1001001001008897.6%
Mistral Large 31001001001008797.5%
ByteDance Seed 2.0 Lite1001001001008396.7%
Qwen 3.5 35B100100100958796.3%
Claude Sonnet 4.51001001001008296.3%
GPT-4.1 Mini1001001001008196.1%
GPT-5.2100100100938695.8%
GPT-5.4 Nano (Reasoning)1001001001007795.4%
Grok 410010097967894.2%
GPT-5 Mini100100100888394.1%
GPT-5.4 Mini (Reasoning)100100100868093.2%
Grok 4.20 (Beta, Reasoning)1001001001006693.1%
GPT-4.1100100100897392.4%
Gemini 3 Pro (Preview)100100100877592.4%
Z.AI GLM 4.7 Flash10010099966491.8%
DeepSeek V3.11009695847990.9%
GPT-5.4 Nano (Reasoning, Low)1009998896389.8%
Claude Opus 410010089867489.7%
DeepSeek-V2 Chat100100100816388.7%
ByteDance Seed 2.0 Mini100100100895488.5%
GPT-5.4 Mini (Reasoning, Low)100100100915188.4%
GPT-4o Mini (temp=1)100100100885488.3%
Mistral NeMO100100100736888.1%
GPT-4o, Aug. 6th (temp=1)10010096885688.0%
Mistral Small 3.2 24B1001001001003787.5%
Qwen 3.5 122B10010093796587.4%
Ministral 3B100100100765987.0%
GPT-5.4 Mini1009386856886.6%
LFM2 24B100100100983286.0%
ByteDance Seed 1.6 Flash100100100804885.5%
Z.AI GLM 5 Turbo10010086796185.2%
Rocinante 12B100100100754884.5%
Gemini 3.1 Pro (Preview)1009988795584.1%
Z.AI GLM 4.710010086686583.7%
Hermes 3 405B1001001001001583.0%
Claude 3 Haiku1009891883682.5%
Qwen 3.5 397B A17B1009891725182.4%
MoonshotAI: Kimi K2.510010087626181.8%
Cohere Command R+ (Aug. 2024)100100100604180.3%
Qwen 3.5 Flash1009379605978.1%
Aion 2.01008579656077.7%
GPT-4o Mini (temp=0)10010075554474.7%
Gemini 3 Flash (Preview, Reasoning)1009974702974.1%
Gemma 3 27B1001008870772.9%
Qwen 3.5 27B1001007171669.5%
Gemma 3 12B1007169683668.8%
Gemini 2.5 Pro1008785521467.6%
DeepSeek V3 (2024-12-26)10010056413967.2%
GPT-4o, May 13th (temp=0)1008369542967.1%
Gemini 2.5 Flash Lite847365654566.6%
MiniMax M2.51008268462463.9%
Gemini 3.1 Flash Lite (Preview)1008652413662.9%
Z.AI GLM 5917970452261.3%
DeepSeek V3.21006551494061.0%
Gemini 3 Flash (Preview)100867239360.2%
Claude 3.7 Sonnet1001007222058.7%
Gemini 2.5 Flash (Reasoning)94746357057.6%
Ministral 3 3B10010042211455.4%
Claude Opus 4.5956360421655.3%
Hermes 3 70B100926019054.1%
Stealth: Healer Alpha100784534352.0%
GPT-4.1 Nano100943526050.8%
Claude Haiku 4.581726819248.2%
Gemini 2.5 Flash81635046048.1%
ByteDance Seed 1.688853124045.4%
Gemma 3 4B94524324844.2%
GPT-4o, May 13th (temp=1)8975240037.4%
Z.AI GLM 4.67454510035.8%
Qwen 3.5 Plus (2026-02-15)48484027032.3%
GPT-4o, Aug. 6th (temp=0)7765180031.9%
GPT-5 Nano8754170031.6%
MiniMax M2.778363111031.2%
Arcee AI: Trinity Mini58523310030.5%
Qwen 2.5 72B8626240027.2%
Arcee AI: Trinity Large (Preview)7929160024.6%
Llama 3.1 8B961600022.4%
Llama 3.1 Nemotron 70B921370022.3%
Claude Sonnet 442312011020.8%
Claude 3.5 Haiku68900015.3%
Z.AI GLM 4.5492700015.2%
Gemini 2.5 Flash Lite (Reasoning)372770014.1%
WizardLM 2 8x22b481100011.6%
Claude 3.5 Sonnet4300008.6%
Inception Mercury2300004.7%
Llama 3.1 70B1700003.4%
Nemotron 3 Super1600003.2%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 8B100100100100100100.0%
Claude Opus 4.6 (Reasoning)10010010010010099.9%
Gemini 3.1 Pro (Preview)1001001001009999.8%
Claude Opus 4.61001001001009899.6%
MoonshotAI: Kimi K2.51001001001009799.3%
Z.AI GLM 4.7 Flash1001001001009498.9%
Claude Sonnet 4.61001001001009498.8%
Z.AI GLM 5 Turbo1001001001009198.2%
o4 Mini High1001001001008797.3%
Gemini 2.5 Pro100100100988897.1%
GPT-4.11001001001008597.0%
GPT-51001001001008496.8%
DeepSeek-V2 Chat1001001001008296.4%
Qwen 3.5 27B1001001001007995.8%
GPT-5.4 Nano (Reasoning, Low)1001001001007995.8%
Stealth: Hunter Alpha1001001001007795.2%
Grok 4.20 (Beta, Reasoning)100100100958195.2%
Qwen 3.5 122B100100100958195.2%
GPT-4.1 Nano100100100898594.7%
Grok 4.20 (Beta)100100100928294.7%
Mistral Large100100100898594.7%
GPT-5.2100100100957894.6%
ByteDance Seed 1.6 Flash100100100997494.6%
Qwen 3 32B1001001001007394.6%
GPT-4.1 Mini10010097888694.3%
Qwen 3.5 Flash10010091918994.1%
Mistral Large 21001001001007094.0%
Ministral 3 8B1001001001006893.6%
Ministral 3 3B1001001001006693.1%
GPT-5.4 Mini (Reasoning)100100100877993.0%
GPT-5.4 Mini1009994937992.9%
o4 Mini100100100917392.8%
Mistral Large 3100100100857992.7%
GPT-4o Mini (temp=1)1001001001006292.3%
Z.AI GLM 51001001001006192.2%
Z.AI GLM 4.710010097947092.2%
GPT-5.4 Mini (Reasoning, Low)10010099857191.2%
ByteDance Seed 2.0 Lite10010093917090.8%
Gemini 3 Pro (Preview)100100100925990.2%
Gemma 3 12B10010086867990.2%
Aion 2.01009689838390.1%
GPT-5.4 Nano100100100816990.0%
Claude Sonnet 4.5100100100876189.7%
GPT-4o, Aug. 6th (temp=1)100100100885989.3%
GPT-5 Mini1009291907088.7%
Qwen 3.5 35B1009984837588.1%
MiniMax M2.5100100100786288.0%
Claude Opus 4100100100825587.3%
GPT-5.4 Nano (Reasoning)1009493786586.2%
DeepSeek V3.210010089786185.4%
Ministral 3B1001001001001783.3%
Gemini 3 Flash (Preview)1009685746083.0%
Qwen 3.5 397B A17B10010073696481.2%
Claude Opus 4.51009693654880.2%
Rocinante 12B898886716680.0%
Grok 4 Fast10010091832479.7%
ByteDance Seed 2.0 Mini10010077616079.6%
DeepSeek V3 (2024-12-26)10010092871879.3%
Claude Haiku 4.5898581796379.3%
LFM2 24B10010010072074.5%
MiniMax M2.7928778694273.6%
Gemini 3.1 Flash Lite (Preview)10010094571673.4%
Arcee AI: Trinity Mini10010089572073.0%
ByteDance Seed 1.61001008671071.4%
Gemini 2.5 Flash Lite1008171624271.1%
Mistral NeMO1001007569068.6%
DeepSeek V3.1999858503668.2%
Gemma 3 4B917662605267.9%
Gemma 3 27B857665545466.8%
GPT-4o, May 13th (temp=1)1009776302766.0%
Z.AI GLM 4.6807770544665.4%
Qwen 3.5 Plus (2026-02-15)716966634863.4%
Hermes 3 405B1001009620063.1%
Cohere Command R+ (Aug. 2024)1007776382462.9%
Gemini 2.5 Flash996456544262.8%
Claude 3.5 Haiku1008848362459.0%
Mistral Small 3.2 24B1007067261154.8%
GPT-4o Mini (temp=0)886054391751.5%
Gemini 2.5 Flash (Reasoning)1005148371951.3%
Claude Sonnet 4766766331350.8%
Grok 410080690049.8%
Claude 3.5 Sonnet9473700047.4%
Hermes 3 70B90813727046.9%
Claude 3.7 Sonnet847532251846.8%
Gemini 2.5 Flash Lite (Reasoning)92454238043.4%
Gemini 3 Flash (Preview, Reasoning)76585220041.2%
Claude 3 Haiku71643629040.3%
Z.AI GLM 4.587493231039.7%
Arcee AI: Trinity Large (Preview)90493210036.1%
Qwen 2.5 72B10043114031.7%
GPT-5 Nano6732270025.2%
GPT-4o, Aug. 6th (temp=0)363440014.9%
Nemotron 3 Super54000010.7%
Llama 3.1 8B4800009.5%
WizardLM 2 8x22b4400008.9%
GPT-4o, May 13th (temp=0)800001.5%
Inception Mercury 2000000.0%
Stealth: Aurora Alpha000000.0%
Inception Mercury000000.0%
Llama 3.1 70B000000.0%
Nemotron 3 Nano000000.0%
Llama 3.1 Nemotron 70B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Small 4100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Sonnet 4.61001001001009999.8%
Gemini 3 Flash (Preview)1001001001009999.7%
Gemini 2.5 Flash Lite1001001001009999.7%
Cohere Command R+ (Aug. 2024)1001001001009899.6%
Gemini 2.5 Flash100100100999999.6%
GPT-4.11001001001009799.5%
GPT-5 Mini1001001001009799.4%
DeepSeek V3 (2024-12-26)1001001001009799.4%
Ministral 3 3B1001001001009799.3%
MiniMax M2.71001001001009699.3%
GPT-4.1 Nano1001001001009498.9%
GPT-5.4 Nano (Reasoning)1001001001009498.8%
Claude Haiku 4.51001001001009298.5%
GPT-5.21001001001009198.3%
Gemma 3 4B1001001001009198.1%
Gemini 3 Pro (Preview)1001001001009098.1%
Gemini 2.5 Flash (Reasoning)1001001001009098.1%
Stealth: Healer Alpha1001001001008997.9%
Gemini 3 Flash (Preview, Reasoning)1001001001008997.9%
Mistral NeMO1001001001008997.8%
Gemini 2.5 Pro1001001001008897.7%
Qwen 2.5 72B1001001001008897.5%
GPT-5.41001001001008797.3%
GPT-51001001001008697.2%
Claude Opus 4.6 (Reasoning)1001001001008697.1%
Arcee AI: Trinity Mini1001001001008697.1%
Mistral Medium 3.1100100100929196.7%
Qwen 3.5 122B1001001001008196.2%
Qwen 3.5 Flash1001001001008096.1%
Claude 3.5 Haiku100100100909096.0%
Claude Opus 4.61001001001008095.9%
Grok 4.20 (Beta, Reasoning)100100100997995.6%
Gemma 3 12B100100100997995.4%
Claude Opus 4.510010095908994.9%
Claude 3.7 Sonnet100100100878794.8%
DeepSeek V3.21001001001007294.5%
GPT-4o Mini (temp=0)100100100888293.9%
Qwen 3.5 27B1001001001007093.9%
GPT-4o, Aug. 6th (temp=0)100100100887692.7%
MiniMax M2.51001001001006392.7%
Claude Opus 41001001001006292.3%
GPT-5.4 Mini100100100808092.2%
LFM2 24B1001001001005490.7%
Gemini 3.1 Pro (Preview)100100100896190.1%
GPT-5.4 Mini (Reasoning)100100100994889.4%
Arcee AI: Trinity Large (Preview)10010092796988.0%
Claude 3.5 Sonnet100100100796088.0%
Claude 3 Haiku10010095875787.8%
GPT-5.4 Nano10010092855887.0%
Qwen 3 32B1001001001003386.6%
Z.AI GLM 510010096765785.8%
Z.AI GLM 4.51001001001002284.4%
Qwen 3.5 Plus (2026-02-15)100100100685083.6%
ByteDance Seed 2.0 Lite100100100921681.5%
Gemini 2.5 Flash Lite (Reasoning)1009285715681.0%
Hermes 3 70B10010094713580.2%
ByteDance Seed 2.0 Mini100100100701977.8%
WizardLM 2 8x22b10010083613876.5%
Qwen 3.5 35B100999479074.4%
Ministral 3B100100100482474.3%
Claude Sonnet 4.6 (Reasoning)10010079751774.2%
Qwen 3.5 397B A17B959184652471.8%
Inception Mercury10010010054070.7%
Gemini 3.1 Flash Lite (Preview)10010073363568.8%
Llama 3.1 70B1001006666066.4%
GPT-5 Nano100776559861.6%
Nemotron 3 Super1004938352449.0%
Llama 3.1 Nemotron 70B100541915037.5%
Llama 3.1 8B1003119181736.8%
Inception Mercury 26449479033.9%
Nemotron 3 Nano52271913022.2%
Stealth: Aurora Alpha1919160010.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Aion 2.0100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 8B100100100100100100.0%
Gemini 3.1 Pro (Preview)10010010010010099.9%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.8%
DeepSeek V3.21001001001009899.6%
GPT-5.21001001001009799.5%
Stealth: Hunter Alpha1001001001009799.4%
Mistral Large 31001001001009799.4%
GPT-5.4 Nano (Reasoning, Low)1001001001009799.4%
GPT-5.41001001001009699.2%
Claude Opus 41001001001009699.2%
Mistral Medium 3.11001001001009599.0%
GPT-5.4 Nano100100100999698.9%
Ministral 3 8B1001001001009498.8%
GPT-51001001001009498.7%
Gemini 2.5 Pro1001001001009398.6%
GPT-5.4 Nano (Reasoning)10010099989698.5%
Qwen 3.5 9B1001001001009298.5%
Qwen 3.5 397B A17B1001001001009098.1%
GPT-4.11001001001009098.1%
Z.AI GLM 4.7100100100959597.9%
Z.AI GLM 5 Turbo100100100979197.5%
GPT-5.11001001001008797.3%
Rocinante 12B1001001001008697.1%
o4 Mini High1001001001008597.0%
Gemini 2.5 Flash (Reasoning)1001001001008597.0%
Gemini 2.5 Flash Lite (Reasoning)10010099958996.7%
Qwen3 235B A22B Instruct 25071001001001008096.0%
Ministral 3B10010096968595.4%
Qwen 3.5 27B100100100978095.4%
Z.AI GLM 4.61001001001007695.3%
Ministral 3 3B1001001001007695.2%
Stealth: Healer Alpha1001001001007595.1%
Gemma 3 27B1001001001007595.0%
GPT-5 Mini100100100888694.8%
Z.AI GLM 4.7 Flash100100100868594.3%
o4 Mini1001001001007194.3%
Grok 4100100100878494.3%
MiniMax M2.5100100100898394.3%
Mistral Small 3.2 24B1001001001007094.1%
Grok 4.1 Fast1001001001007094.0%
GPT-4o Mini (temp=1)10010098977493.6%
Gemini 2.5 Flash100100100917793.6%
Gemma 3 4B100100100868193.3%
Mistral Small 4 (Reasoning)100100100996793.2%
GPT-4o, Aug. 6th (temp=0)100100100927493.1%
Mistral NeMO10010099996292.0%
DeepSeek V3 (2025-03-24)100100100827791.8%
Arcee AI: Trinity Large (Preview)100100100857491.8%
Grok 4.20 (Beta, Reasoning)10010089858591.8%
Gemini 3 Pro (Preview)1001001001005991.8%
GPT-4o Mini (temp=0)100100100936691.7%
Claude Opus 4.61001001001005891.6%
Hermes 3 405B100100100827591.4%
ByteDance Seed 1.6 Flash1001001001005791.3%
DeepSeek V3.11001001001005190.3%
Claude Sonnet 4.51009690828290.0%
GPT-4o, Aug. 6th (temp=1)10010099885989.1%
Claude Sonnet 4.6100100100944988.5%
GPT-4o, May 13th (temp=0)10010098836088.1%
Qwen 3 32B100100100726888.0%
Llama 3.1 70B1001001001003887.6%
GPT-4o, May 13th (temp=1)100100100805887.4%
Gemini 3 Flash (Preview)10010090836387.2%
Cohere Command R+ (Aug. 2024)100100100765886.8%
Gemini 2.5 Flash Lite10010083806986.6%
DeepSeek V3 (2024-12-26)10010097825486.4%
Qwen 3.5 35B10010092726786.2%
Qwen 3.5 122B1009481767585.2%
Grok 4.20 (Beta)1008982806983.9%
Z.AI GLM 5928988826182.5%
Claude Haiku 4.510010082814481.4%
Claude 3.7 Sonnet1009473725879.3%
Qwen 3.5 Flash100100100791178.2%
WizardLM 2 8x22b1009475714877.7%
MiniMax M2.71009268666277.5%
Arcee AI: Trinity Mini10010010068073.4%
Z.AI GLM 4.51009381603373.4%
GPT-5 Nano1009163624873.0%
Qwen 3.5 Plus (2026-02-15)969272624072.6%
ByteDance Seed 2.0 Mini10010082423772.2%
GPT-4.1 Nano1009862583770.8%
Llama 3.1 Nemotron 70B1007166664369.2%
Gemini 3 Flash (Preview, Reasoning)968373563067.6%
Grok 4 Fast898468633267.0%
Claude Opus 4.5100967160766.7%
Claude 3.5 Haiku988671441963.7%
LFM2 24B867467484263.3%
Gemini 3.1 Flash Lite (Preview)1006355484862.8%
MoonshotAI: Kimi K2.5977066493262.7%
Claude 3.5 Sonnet91866859561.7%
Qwen 2.5 72B896256453357.1%
ByteDance Seed 2.0 Lite876866411655.4%
Claude Sonnet 49789840054.0%
Hermes 3 70B636055371045.1%
Nemotron 3 Super100633322043.8%
ByteDance Seed 1.69870446043.5%
Inception Mercury1002400024.8%
Llama 3.1 8B454100017.1%
Stealth: Aurora Alpha51500011.2%
Nemotron 3 Nano24175009.0%
Inception Mercury 2151212308.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Grok 4 Fast1001001001009899.6%
o4 Mini1001001001009599.0%
GPT-5.21001001001009599.0%
Aion 2.01001001001009498.8%
Stealth: Healer Alpha100100100989598.5%
DeepSeek V3 (2024-12-26)1001001001009298.4%
GPT-5.4 Mini (Reasoning, Low)100100100999298.2%
Mistral Large1001001001009198.2%
Ministral 3 3B1001001001009198.2%
LFM2 24B1001001001009198.2%
GPT-4o, May 13th (temp=0)1001001001009198.1%
GPT-5 Mini1001001001008797.5%
Claude Sonnet 4.61001001001008697.3%
Mistral Large 31001001001008697.1%
GPT-5.4 Nano (Reasoning, Low)10010099988997.1%
GPT-5.4 Nano (Reasoning)100100100939296.9%
Gemma 3 27B1001001001008596.9%
Z.AI GLM 4.71001001001008396.6%
GPT-4.11001001001008396.5%
Arcee AI: Trinity Mini1001001001008296.4%
Qwen 3.5 122B10010099928895.9%
Claude Haiku 4.5100100100968295.8%
GPT-5.4 Nano100100100938595.6%
MiniMax M2.71001001001007895.6%
Ministral 3B100100100898995.6%
Gemini 3 Pro (Preview)100100100997995.5%
GPT-4o Mini (temp=1)1001001001007895.5%
Gemma 3 4B100100100928595.4%
Ministral 3 8B1001001001007695.2%
Stealth: Hunter Alpha100100100938094.6%
MoonshotAI: Kimi K2.5100100100977594.4%
GPT-4o, Aug. 6th (temp=0)10010099878694.4%
Gemini 3 Flash (Preview)10010097908394.1%
Claude 3 Haiku1001001001006893.5%
Qwen 3.5 9B100100100878193.5%
Claude 3.7 Sonnet1001001001006693.2%
Hermes 3 70B100100100858193.1%
GPT-5.4 Mini100100100838293.0%
Grok 4.20 (Beta)1001001001006492.9%
Z.AI GLM 5 Turbo100100100897392.5%
Arcee AI: Trinity Large (Preview)1001001001006292.4%
Z.AI GLM 5100100100837992.4%
Claude Sonnet 4100100100976391.9%
Mistral NeMO1001001001005991.8%
MiniMax M2.5100100100877191.7%
Gemini 2.5 Flash1001001001005891.6%
Z.AI GLM 4.7 Flash10010094847991.5%
GPT-5100100100886991.4%
ByteDance Seed 1.6 Flash1001001001005390.6%
GPT-5.4 Mini (Reasoning)10010099807490.5%
Gemma 3 12B10010089877690.4%
Grok 4.20 (Beta, Reasoning)1009789897489.8%
Gemini 2.5 Flash (Reasoning)10010091906789.5%
Claude 3.5 Sonnet100100100766989.1%
Gemini 2.5 Pro100100100816388.8%
Claude 3.5 Haiku1001001001004188.2%
Qwen 2.5 72B10010098865287.2%
Gemini 3 Flash (Preview, Reasoning)100100100755585.9%
Gemini 3.1 Pro (Preview)100100100686185.8%
Qwen 3.5 35B10010086845885.4%
GPT-4.1 Nano100100100903585.0%
Z.AI GLM 4.5100100100972584.4%
Qwen 3.5 Flash100100100753782.4%
Gemini 2.5 Flash Lite (Reasoning)1009881795282.2%
GPT-4o, Aug. 6th (temp=1)10010098902281.8%
Gemini 2.5 Flash Lite10010094713580.1%
Rocinante 12B100100100100080.0%
Z.AI GLM 4.6100100100901080.0%
ByteDance Seed 1.610010083803679.9%
ByteDance Seed 2.0 Lite10010081665279.6%
DeepSeek V3.110010098652577.5%
Qwen 3.5 27B989190663876.7%
Qwen 3.5 Plus (2026-02-15)10010086504275.6%
Qwen 3.5 397B A17B998577625375.1%
Ministral 8B10010010068073.6%
Claude Opus 4.510010064623371.8%
WizardLM 2 8x22b100948354066.3%
Gemini 3.1 Flash Lite (Preview)948868611264.6%
GPT-4o Mini (temp=0)777167632861.2%
ByteDance Seed 2.0 Mini100827348360.9%
GPT-5 Nano858467343460.9%
Inception Mercury100100390047.8%
Llama 3.1 8B764440393346.5%
Llama 3.1 70B9483503046.1%
Nemotron 3 Super7148380031.3%
Llama 3.1 Nemotron 70B10033200030.7%
Stealth: Aurora Alpha473400016.2%
Inception Mercury 24300008.6%
Nemotron 3 Nano730002.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-510010010010010099.9%
Cohere Command R+ (Aug. 2024)1001001001009899.6%
Gemma 3 27B1001001001009699.3%
GPT-5.4 Mini1001001001009699.3%
GPT-5.41001001001009699.2%
GPT-4o Mini (temp=1)1001001001009699.1%
GPT-5.4 (Reasoning)1001001001009498.8%
Mistral Large 21001001001009398.6%
Ministral 8B1001001001009298.5%
Mistral Large 31001001001009298.4%
Ministral 3B1001001001008997.8%
Ministral 3 8B100100100959497.8%
Gemma 3 12B100100100999097.7%
Z.AI GLM 4.71001001001008697.2%
DeepSeek V3.11001001001008597.0%
Qwen 3.5 Flash100100100988696.8%
o4 Mini High100100100958796.5%
Mistral NeMO100100100948796.2%
GPT-4.1100100100968596.0%
GPT-5.4 Nano (Reasoning)100100100968496.0%
ByteDance Seed 1.6 Flash1001001001007995.8%
Stealth: Healer Alpha100100100958395.7%
Claude Opus 4100100100968295.5%
GPT-4o Mini (temp=0)100100100978195.5%
GPT-5.4 Nano (Reasoning, Low)100100100928194.6%
Gemini 2.5 Flash (Reasoning)100100100957694.2%
GPT-5 Mini100100100927994.1%
GPT-5.210010095918293.6%
GPT-4o, Aug. 6th (temp=1)1001001001006893.5%
Stealth: Hunter Alpha10010095947693.1%
DeepSeek-V2 Chat10010097888193.1%
Qwen 3.5 122B10010096878192.8%
Claude Opus 4.6100100100897592.7%
Claude Sonnet 4.510010099947192.6%
Qwen 3 32B100100100986492.5%
Z.AI GLM 4.61009793888392.3%
DeepSeek V3 (2024-12-26)100100100916691.3%
Claude Opus 4.6 (Reasoning)10010099946391.1%
Grok 410010090838291.1%
Mistral Large100100100787790.9%
Z.AI GLM 4.7 Flash10010091877690.8%
GPT-4.1 Nano100100100965289.6%
Qwen 3.5 27B1009890807989.5%
Grok 4.20 (Beta, Reasoning)1009388838289.2%
GPT-5.4 Nano10010096945489.0%
GPT-4o, May 13th (temp=1)1009694935988.5%
Claude Sonnet 4.6 (Reasoning)10010092846388.0%
Grok 4.20 (Beta)10010094806187.1%
Claude Sonnet 4.6100100100914186.4%
DeepSeek V3 (2025-03-24)100100100735986.3%
Qwen 2.5 72B100100100795386.2%
Qwen 3.5 35B100100100814685.3%
Aion 2.010010091765584.4%
Hermes 3 405B10010081717084.3%
Qwen 3.5 397B A17B10010099804184.0%
DeepSeek V3.2100100100971782.9%
LFM2 24B979789665680.8%
Gemini 2.5 Flash100100100100080.0%
Gemma 3 4B1009487625279.2%
Gemini 2.5 Flash Lite949084646078.5%
Grok 4 Fast100999796078.3%
Claude Opus 4.5988581656278.2%
Claude Haiku 4.51008686625677.9%
GPT-4o, Aug. 6th (temp=0)1009872685177.8%
MiniMax M2.710010083535277.5%
Arcee AI: Trinity Large (Preview)1001009486075.9%
Gemini 2.5 Flash Lite (Reasoning)10010084573875.8%
Ministral 3 3B1001008982074.2%
Z.AI GLM 51008766585773.7%
Qwen 3.5 Plus (2026-02-15)797773686171.9%
ByteDance Seed 2.0 Mini10010077631971.7%
Mistral Small 3.2 24B1001007371068.9%
Arcee AI: Trinity Mini87867674064.6%
GPT-5 Nano100988040063.5%
MiniMax M2.510010055511063.2%
Claude 3 Haiku1009368292162.1%
Hermes 3 70B100866247058.8%
Gemini 3 Flash (Preview)96726340855.9%
Inception Mercury100944613050.7%
Llama 3.1 Nemotron 70B78737130050.5%
MoonshotAI: Kimi K2.598715924050.2%
Claude 3.7 Sonnet717065222049.5%
Claude 3.5 Haiku1005936331648.7%
ByteDance Seed 1.6696851451148.6%
ByteDance Seed 2.0 Lite73685443047.5%
Llama 3.1 8B76755915045.0%
Z.AI GLM 4.576745124044.8%
Grok 4.1 Fast96942011044.3%
Gemini 3.1 Flash Lite (Preview)585846361342.0%
Llama 3.1 70B8282380040.5%
WizardLM 2 8x22b10010000040.0%
Gemini 3 Flash (Preview, Reasoning)6342310027.1%
Claude 3.5 Sonnet5441240023.6%
Claude Sonnet 42818104011.7%
Nemotron 3 Super1183004.4%
Stealth: Aurora Alpha900001.7%
Inception Mercury 2300000.6%
Nemotron 3 Nano000000.0%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 3B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001009999.8%
LFM2 24B1001001001009899.6%
Claude Sonnet 4.61001001001009799.4%
Qwen 3.5 27B1001001001009598.9%
Z.AI GLM 4.71001001001009598.9%
Mistral Small 4 (Reasoning)100100100999598.9%
Qwen 3 32B1001001001009498.7%
Rocinante 12B1001001001008897.6%
Mistral Large 21001001001008797.5%
Qwen 3.5 122B1001001001008396.6%
GPT-5.4 Mini (Reasoning, Low)100100100938996.5%
GPT-4o Mini (temp=1)1001001001007995.8%
Stealth: Hunter Alpha1001001001007695.2%
Gemini 2.5 Flash (Reasoning)100100100948094.9%
Ministral 3 8B1001001001007494.7%
Gemma 3 27B1001001001007394.5%
GPT-5 Mini100100100868494.1%
ByteDance Seed 2.0 Mini1001001001007093.8%
Gemini 3 Pro (Preview)100100100937493.4%
Qwen 3.5 Plus (2026-02-15)10010098976792.3%
Grok 4.20 (Beta)10010093848392.2%
Claude Sonnet 4.51001001001005691.3%
WizardLM 2 8x22b100100100866390.0%
Z.AI GLM 4.7 Flash1001001001004989.8%
GPT-4o Mini (temp=0)1009489887589.1%
ByteDance Seed 1.61001001001004188.2%
GPT-4o, Aug. 6th (temp=0)100100100746387.4%
Ministral 8B100100100983887.1%
DeepSeek V3.110010086805684.5%
Gemini 2.5 Pro10010081756684.2%
DeepSeek V3.21009475746481.4%
GPT-4.1 Mini10010099852281.1%
Z.AI GLM 5 Turbo1009586813980.2%
Claude Opus 4.51009694812178.3%
DeepSeek-V2 Chat10010097741978.1%
DeepSeek V3 (2024-12-26)10010075713976.9%
Gemini 3.1 Flash Lite (Preview)1008379714976.4%
Gemini 3 Flash (Preview)1009274664575.5%
Aion 2.0998869635875.3%
GPT-4o, Aug. 6th (temp=1)1007370696375.1%
Z.AI GLM 51009790731575.1%
Stealth: Healer Alpha10010077712474.2%
Claude Haiku 4.510010081611571.2%
Gemini 3 Flash (Preview, Reasoning)999276443569.3%
GPT-4o, May 13th (temp=0)977873494368.0%
Arcee AI: Trinity Mini1008978521366.4%
Gemma 3 4B998173611365.4%
Mistral NeMO1001009426464.8%
GPT-4.1 Nano1007769671164.6%
ByteDance Seed 2.0 Lite1001009924064.6%
Claude 3 Haiku1008983361364.3%
Claude Opus 41007167453964.3%
Hermes 3 405B10010010016063.2%
Qwen 2.5 72B998969312061.6%
Nemotron 3 Super897848464461.0%
Z.AI GLM 4.6988563342260.6%
MiniMax M2.595836652059.3%
Gemma 3 12B1006844424259.3%
Gemini 2.5 Flash987151353457.8%
Hermes 3 70B10010039331557.4%
Claude 3.5 Sonnet100100739056.4%
Cohere Command R+ (Aug. 2024)100787619756.0%
GPT-4o, May 13th (temp=1)716765431652.2%
GPT-5 Nano786943421649.7%
MiniMax M2.7725451341144.5%
Claude 3.7 Sonnet98524115842.9%
Inception Mercury 259514918035.3%
Llama 3.1 Nemotron 70B886850032.0%
Mistral Small 3.2 24B10042180031.9%
Arcee AI: Trinity Large (Preview)10029270031.3%
Gemini 2.5 Flash Lite765070026.5%
Stealth: Aurora Alpha762950021.9%
Gemini 2.5 Flash Lite (Reasoning)81700017.5%
Claude Sonnet 4543300017.3%
Z.AI GLM 4.565000013.0%
Claude 3.5 Haiku481300012.1%
Llama 3.1 70B49300010.5%
Nemotron 3 Nano3800007.5%
Llama 3.1 8B2000004.1%
Inception Mercury600001.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude Haiku 4.510010010010010099.9%
MiniMax M2.51001001001009999.9%
GPT-5.4 Mini1001001001009999.9%
GPT-5.4 Mini (Reasoning)1001001001009799.4%
Claude Sonnet 4.6 (Reasoning)1001001001009799.4%
Mistral Small 41001001001009699.3%
Qwen 3.5 Flash1001001001009699.3%
Ministral 3 8B1001001001009699.2%
Cohere Command R+ (Aug. 2024)100100100989698.8%
Gemini 2.5 Pro1001001001009398.6%
GPT-4.1 Nano1001001001009398.6%
Aion 2.01001001001009298.5%
GPT-5.4 Nano (Reasoning, Low)1001001001009298.5%
GPT-4o Mini (temp=0)100100100979498.3%
Hermes 3 405B100100100998997.6%
ByteDance Seed 2.0 Mini1001001001008596.9%
Ministral 8B1001001001008496.8%
Z.AI GLM 51001001001008396.6%
Ministral 3B1001001001008196.1%
Claude Sonnet 4.5100100100948796.0%
Z.AI GLM 4.7 Flash1001001001008096.0%
Grok 4.20 (Beta)1001001001008096.0%
GPT-4o, Aug. 6th (temp=1)1001001001008096.0%
GPT-5.4 Nano (Reasoning)100100100968395.9%
GPT-4o Mini (temp=1)100100100948695.9%
DeepSeek V3.21001001001007695.2%
Gemini 3 Flash (Preview)100100100987895.1%
DeepSeek V3.1100100100977694.6%
Z.AI GLM 5 Turbo1001001001007394.6%
GPT-4.1 Mini100100100937794.0%
Gemma 3 27B10010097878092.7%
Gemini 3.1 Flash Lite (Preview)100100100976692.6%
WizardLM 2 8x22b10010096926791.2%
Claude Sonnet 4.6100100100856790.3%
Claude Opus 410010098757489.5%
Qwen 3.5 Plus (2026-02-15)10010093817289.3%
MiniMax M2.710010096965288.9%
Z.AI GLM 4.610010084827788.7%
Ministral 3 3B1001001001003286.3%
ByteDance Seed 1.6100100100735485.5%
Stealth: Healer Alpha1009882776985.3%
DeepSeek-V2 Chat100100100844285.2%
Claude Opus 4.5100100100933184.8%
DeepSeek V3 (2024-12-26)1001001001001482.9%
Gemini 3 Flash (Preview, Reasoning)10010095645582.8%
Mistral NeMO10010010096079.2%
Gemini 2.5 Flash Lite1009275545274.8%
Hermes 3 70B1001008682073.6%
Gemini 2.5 Flash (Reasoning)979383513471.5%
Z.AI GLM 4.51009287443070.7%
Claude 3.5 Sonnet10010068503570.5%
Gemma 3 4B928681801069.6%
Rocinante 12B1001007055064.9%
Gemma 3 12B1009877272264.7%
Llama 3.1 8B100928148064.2%
GPT-4o, May 13th (temp=1)1007562443563.2%
Claude 3.7 Sonnet888860472361.3%
Gemini 2.5 Flash1001005447060.0%
Nemotron 3 Super1005651422254.2%
Arcee AI: Trinity Mini100767516053.3%
Gemini 2.5 Flash Lite (Reasoning)82826130051.1%
Mistral Small 3.2 24B1001000050.0%
Arcee AI: Trinity Large (Preview)100100270045.3%
Claude Sonnet 4715148361944.9%
Claude 3 Haiku8884510044.5%
Claude 3.5 Haiku8686510044.4%
Inception Mercury1004200028.5%
GPT-4o, May 13th (temp=0)755500026.2%
GPT-4o, Aug. 6th (temp=0)674800022.9%
GPT-5 Nano4032265221.2%
Qwen 2.5 72B30251919018.9%
Llama 3.1 Nemotron 70B3021100012.3%
Stealth: Aurora Alpha361400010.0%
Inception Mercury 21100002.3%
Llama 3.1 70B000000.0%
Nemotron 3 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Rocinante 12B1001001001009999.8%
LFM2 24B1001001001009899.6%
Gemini 2.5 Flash1001001001009899.5%
Gemini 2.5 Pro1001001001009799.4%
Stealth: Hunter Alpha1001001001009799.4%
Stealth: Healer Alpha1001001001009799.4%
Mistral Small 3.2 24B1001001001009799.4%
Claude 3 Haiku1001001001009799.4%
Qwen 3.5 27B1001001001009799.3%
Grok 4.20 (Beta, Reasoning)1001001001009599.0%
Claude Haiku 4.51001001001009599.0%
GPT-4o Mini (temp=1)1001001001009599.0%
Ministral 3 8B1001001001009599.0%
Qwen 3.5 122B1001001001009398.6%
GPT-4.1 Mini1001001001009398.5%
GPT-5.4 (Reasoning, Low)1001001001009298.5%
GPT-51001001001009298.4%
Gemini 2.5 Flash Lite (Reasoning)1001001001009198.1%
Cohere Command R+ (Aug. 2024)1001001001009098.0%
Qwen 3.5 397B A17B1001001001008997.8%
GPT-5.4 Mini (Reasoning, Low)1001001001008897.5%
GPT-5.4 Mini1001001001008797.4%
Z.AI GLM 4.71001001001008697.3%
GPT-4o, May 13th (temp=1)1001001001008697.1%
Qwen 3.5 35B1001001001008597.0%
Arcee AI: Trinity Mini1001001001008596.9%
Claude Sonnet 41001001001008396.7%
Hermes 3 70B1001001001008396.7%
Claude Opus 4.6100100100918895.8%
Aion 2.0100100100958295.4%
GPT-4o, Aug. 6th (temp=0)1001001001007695.2%
Claude 3.7 Sonnet100100100908394.7%
Ministral 3B100100100927994.1%
Claude Sonnet 4.61001001001007094.0%
Mistral NeMO100100100967493.9%
GPT-4o, May 13th (temp=0)1001001001006893.5%
Gemma 3 4B100100100897993.5%
Qwen 2.5 72B100100100946892.4%
Gemma 3 12B1001001001006091.9%
Z.AI GLM 4.7 Flash1001001001005991.8%
Z.AI GLM 5100100100995891.4%
Claude Opus 41001001001005490.7%
Arcee AI: Trinity Large (Preview)1001001001004589.1%
ByteDance Seed 1.6100100100835086.6%
Gemini 3.1 Flash Lite (Preview)100100100735585.6%
Claude 3.5 Haiku10010090815484.9%
Inception Mercury 210010073696080.3%
Nemotron 3 Super100100100593378.3%
Llama 3.1 70B100937970369.0%
Inception Mercury100100100281067.7%
Stealth: Aurora Alpha100989050067.5%
Llama 3.1 Nemotron 70B817873703366.9%
GPT-5 Nano100777674466.2%
Llama 3.1 8B100100310046.2%
Nemotron 3 Nano78363328034.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Aion 2.0100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Stealth: Healer Alpha100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 8B100100100100100100.0%
GPT-51001001001009999.8%
o4 Mini1001001001009799.4%
Z.AI GLM 4.61001001001009799.4%
DeepSeek V3.21001001001009799.4%
Writer: Palmyra X51001001001009699.3%
Z.AI GLM 51001001001009699.1%
Mistral Medium 3.11001001001009599.0%
GPT-4o, May 13th (temp=1)1001001001009498.9%
Gemini 2.5 Flash1001001001009498.6%
Gemini 3.1 Pro (Preview)1001001001009398.6%
Gemma 3 27B1001001001009398.6%
GPT-5.11001001001009398.5%
Hermes 3 405B1001001001009398.5%
Gemini 2.5 Flash (Reasoning)1001001001009298.5%
WizardLM 2 8x22b1001001001009098.1%
GPT-4o, Aug. 6th (temp=0)1001001001009098.0%
Qwen 3.5 122B1001001001008997.9%
Z.AI GLM 4.71001001001008997.8%
Gemini 2.5 Pro100100100949497.7%
Claude Opus 4.61001001001008897.7%
GPT-5 Mini1001001001008897.5%
Grok 4 Fast100100100979097.4%
GPT-4.11001001001008797.4%
GPT-4.1 Mini1001001001008697.3%
Mistral Small 4 (Reasoning)1001001001008697.1%
Grok 4.20 (Beta)1001001001008496.7%
Qwen 3.5 Flash1001001001008296.5%
Claude Sonnet 4.51001001001008296.3%
Ministral 3B1001001001008196.1%
Grok 4.20 (Beta, Reasoning)10010095949296.1%
Grok 4100100100928896.1%
GPT-4o Mini (temp=1)100100100928895.9%
MiniMax M2.5100100100948595.8%
ByteDance Seed 1.6 Flash100100100988095.6%
ByteDance Seed 2.0 Mini100100100928695.5%
Mistral NeMO1001001001007695.2%
Mistral Large1001001001007595.0%
Claude Opus 41001001001007394.5%
Qwen 3 32B100100100928094.3%
Cohere Command R+ (Aug. 2024)1001001001007094.0%
Claude Opus 4.6 (Reasoning)100100100967393.7%
Z.AI GLM 5 Turbo10010092898693.4%
GPT-4o Mini (temp=0)100100100828292.7%
Arcee AI: Trinity Mini100100100986191.9%
DeepSeek V3 (2024-12-26)100100100857091.0%
GPT-4.1 Nano100100100886290.1%
Qwen 3.5 Plus (2026-02-15)10010094817589.9%
DeepSeek-V2 Chat1001001001004989.8%
GPT-4o, Aug. 6th (temp=1)100100100905989.8%
Ministral 3 3B100100100905789.4%
Qwen 3.5 397B A17B100100100945289.3%
ByteDance Seed 2.0 Lite100100100836088.6%
LFM2 24B100100100895488.4%
Z.AI GLM 4.51009489827688.1%
MiniMax M2.71009189896787.1%
GPT-4o, May 13th (temp=0)100100100835186.9%
Ministral 3 8B10010092925086.8%
MoonshotAI: Kimi K2.510010090737086.5%
Gemma 3 4B10010097686185.1%
Inception Mercury100100100943084.8%
Claude Opus 4.5999086717183.6%
Claude 3 Haiku100100100912783.5%
Mistral Small 3.2 24B100100100902583.1%
GPT-5 Nano10010091714581.4%
Nemotron 3 Super100100100733381.0%
Arcee AI: Trinity Large (Preview)10010010093779.9%
Gemini 2.5 Flash Lite969579725178.8%
Gemini 2.5 Flash Lite (Reasoning)10010081802877.6%
Claude 3.7 Sonnet1007571706776.5%
Claude Haiku 4.510010010079075.9%
Claude 3.5 Haiku100989871975.4%
DeepSeek V3.110010065544572.7%
Gemini 3 Flash (Preview)1008374543669.4%
Rocinante 12B10010010033066.6%
ByteDance Seed 1.610010048443966.2%
Gemini 3.1 Flash Lite (Preview)10010061432666.0%
Inception Mercury 2966158545364.5%
Gemini 3 Flash (Preview, Reasoning)1006666513964.4%
Llama 3.1 8B100946151061.3%
Hermes 3 70B1005956292253.1%
Qwen 2.5 72B635954423650.7%
Claude Sonnet 41001002019248.1%
Llama 3.1 70B96714824047.8%
Llama 3.1 Nemotron 70B100685113046.3%
Claude 3.5 Sonnet7963597041.7%
Stealth: Aurora Alpha9337164030.0%
Nemotron 3 Nano661685419.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)1001001001009999.9%
Claude 3 Haiku1001001001009899.6%
GPT-4.1 Nano1001001001009899.5%
Z.AI GLM 4.51001001001009699.2%
GPT-4o Mini (temp=0)1001001001009699.2%
Grok 4 Fast1001001001009598.9%
Z.AI GLM 51001001001009498.8%
Ministral 3 3B1001001001009498.7%
Z.AI GLM 4.7 Flash100100100999398.4%
GPT-4o, May 13th (temp=1)1001001001009298.4%
Claude Sonnet 4.51001001001009298.3%
Qwen 3.5 122B1001001001009198.3%
Claude Opus 4.5100100100969598.2%
Qwen 3.5 35B1001001001009198.2%
DeepSeek-V2 Chat1001001001009198.2%
Stealth: Aurora Alpha100100100999198.0%
LFM2 24B1001001001008997.9%
GPT-5.4 (Reasoning, Low)1001001001008897.7%
Z.AI GLM 5 Turbo1001001001008897.6%
Gemini 2.5 Flash1001001001008797.4%
GPT-4o, May 13th (temp=0)1001001001008597.0%
Gemini 2.5 Pro100100100968796.7%
Gemini 3 Pro (Preview)1001001001008396.7%
Ministral 3 8B1001001001008296.5%
Gemma 3 4B1001001001008196.1%
MiniMax M2.51001001001007995.7%
Cohere Command R+ (Aug. 2024)1001001001007895.6%
Qwen 3.5 Flash1001001001007795.5%
GPT-4o Mini (temp=1)1001001001007595.0%
Qwen 3.5 Plus (2026-02-15)1001001001007194.3%
ByteDance Seed 2.0 Lite100100100878293.8%
Inception Mercury 21001001001006893.5%
Claude Opus 410010095947592.7%
Arcee AI: Trinity Large (Preview)100100100897392.5%
Qwen 3.5 397B A17B100100100857792.4%
GPT-4o, Aug. 6th (temp=1)1001001001006292.4%
ByteDance Seed 1.6100100100877492.0%
Mistral NeMO100100100887191.7%
Stealth: Healer Alpha100100100787891.2%
MiniMax M2.710010099906791.1%
Qwen 2.5 72B1001001001005490.7%
Arcee AI: Trinity Mini1001001001005490.7%
Grok 4.20 (Beta, Reasoning)100100100807190.3%
Aion 2.010010094926490.1%
Claude Sonnet 4.6100100100896089.9%
Hermes 3 70B100100100787189.9%
Claude Haiku 4.5100100100984688.7%
DeepSeek V3.210010094796387.0%
Gemini 2.5 Flash Lite1009389836886.5%
Gemma 3 12B100100100676386.1%
Gemini 2.5 Flash (Reasoning)949491756784.1%
Gemini 3 Flash (Preview)1008784816884.0%
GPT-4.1 Mini1001001001001983.7%
GPT-5.4 Mini (Reasoning)10010076756783.5%
GPT-4o, Aug. 6th (temp=0)100100100912482.9%
DeepSeek V3.1100100100724182.6%
Inception Mercury100100100100080.0%
ByteDance Seed 2.0 Mini100100100623679.5%
Gemini 2.5 Flash Lite (Reasoning)10010090723379.0%
Gemini 3.1 Flash Lite (Preview)1001009392277.4%
Rocinante 12B10010098542475.2%
Claude Sonnet 410010010074074.7%
Gemini 3 Flash (Preview, Reasoning)969567474169.1%
Llama 3.1 8B1001009844068.5%
GPT-5 Nano1009181302966.1%
Nemotron 3 Super1009482301364.0%
Llama 3.1 70B1008878211760.7%
Nemotron 3 Nano100615339050.7%
Llama 3.1 Nemotron 70B10061510042.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Opus 4100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Qwen 3.5 122B1001001001009899.6%
Arcee AI: Trinity Mini1001001001009899.5%
Claude Sonnet 4.6 (Reasoning)1001001001009699.2%
Z.AI GLM 4.7 Flash1001001001009699.2%
Qwen 3.5 397B A17B1001001001009598.9%
Mistral Small 41001001001009498.9%
Ministral 3B100100100989798.8%
Aion 2.01001001001009498.7%
GPT-4o, Aug. 6th (temp=1)100100100999498.7%
Gemini 3 Pro (Preview)1001001001009398.7%
GPT-5.4 Mini (Reasoning)1001001001009398.5%
Qwen 3.5 9B1001001001009298.5%
GPT-4.1 Mini1001001001009098.0%
Z.AI GLM 4.7100100100969498.0%
Qwen 3.5 Plus (2026-02-15)1001001001008697.3%
Gemini 2.5 Flash Lite1001001001008697.3%
Grok 41001001001008697.1%
DeepSeek V3.11001001001008697.1%
Gemma 3 12B100100100988696.9%
Z.AI GLM 5 Turbo1001001001008496.8%
Ministral 8B1001001001008396.5%
Mistral Small 4 (Reasoning)10010097929296.3%
DeepSeek V3 (2025-03-24)100100100978496.2%
GPT-4o Mini (temp=1)1001001001008196.1%
GPT-4o, May 13th (temp=1)1001001001008096.0%
ByteDance Seed 1.6 Flash10010098919095.7%
GPT-4.1 Nano1001001001007895.6%
GPT-5100100100948395.4%
GPT-5.4 Nano (Reasoning, Low)1001001001007795.4%
Qwen 3.5 35B100100100987895.1%
Claude Haiku 4.51001001001007595.0%
Z.AI GLM 510010098968194.9%
Claude Sonnet 4.5100100100957794.4%
Gemini 2.5 Flash1001001001007194.3%
Gemini 2.5 Flash (Reasoning)100100100888294.1%
Grok 4.20 (Beta, Reasoning)100100100888293.9%
Qwen 3.5 Flash100100100937693.8%
Claude Sonnet 4.610010098927993.6%
Qwen 2.5 72B1001001001006593.0%
DeepSeek V3.2100100100877692.5%
GPT-4o, May 13th (temp=0)100100100946892.3%
Qwen 3 32B100100100996292.2%
GPT-5.4 Nano (Reasoning)100100100976492.2%
MiniMax M2.7100100100887292.0%
GPT-4.11001001001005891.6%
Gemini 2.5 Pro100100100857391.5%
Mistral NeMO1009997837691.0%
Hermes 3 405B100100100837191.0%
Gemma 3 27B100100100767490.0%
GPT-4o, Aug. 6th (temp=0)1001001001004589.0%
Stealth: Healer Alpha10010090866988.8%
GPT-5 Mini1009891886588.4%
LFM2 24B1008888867988.3%
Gemma 3 4B10010098717087.6%
Cohere Command R+ (Aug. 2024)989793806987.3%
Claude Opus 4.6 (Reasoning)10010086845985.7%
MoonshotAI: Kimi K2.510010086816185.7%
WizardLM 2 8x22b10010092914485.5%
Z.AI GLM 4.6100100100745285.2%
Hermes 3 70B100100100754984.8%
DeepSeek V3 (2024-12-26)100100100655684.2%
Gemini 3 Flash (Preview)1009482746683.5%
MiniMax M2.51008685726982.4%
Mistral Small 3.2 24B100100100941882.3%
Claude Opus 4.51009683775481.9%
Claude 3.5 Sonnet10010086763379.0%
DeepSeek-V2 Chat100100100533076.5%
Claude 3.5 Haiku1008678595475.2%
GPT-5 Nano100100100561574.1%
ByteDance Seed 2.0 Lite100898679070.7%
Gemini 2.5 Flash Lite (Reasoning)1008767623369.8%
Gemini 3 Flash (Preview, Reasoning)1008080463467.9%
Inception Mercury10010010033066.6%
Claude 3.7 Sonnet1008857533265.8%
ByteDance Seed 2.0 Mini100866557061.6%
Arcee AI: Trinity Large (Preview)100887141060.0%
Z.AI GLM 4.5987063391757.5%
Inception Mercury 2746964631456.9%
Llama 3.1 8B100945430055.4%
Llama 3.1 70B897171242155.4%
Nemotron 3 Nano98795019049.3%
ByteDance Seed 1.692883516547.2%
Nemotron 3 Super88585433046.5%
Gemini 3.1 Flash Lite (Preview)82544633043.0%
Llama 3.1 Nemotron 70B100703310042.5%
Stealth: Aurora Alpha504017161227.1%
Claude Sonnet 45933150021.3%