Pronoun-first sentence starts

Test: Bad Writing Habits

Avg. Score
73.6%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude 3.5 Haiku97.7%$0.003510.8s87%
2Llama 3.1 Nemotron 70B97.4%$0.003831.7s83%
3Llama 3.1 8B98.4%$0.00031.3m87%
4GPT-5.4 Mini (Reasoning)95.5%$0.02228.1s79%
5Llama 3.1 70B96.0%$0.001529.4s71%
6GPT-5.4 Mini (Reasoning, Low)93.3%$0.01516.8s73%
7GPT-5.4 Nano (Reasoning, Low)89.7%$0.005520.6s71%
8Claude 3 Haiku90.0%$0.002514.9s62%
9GPT-5.4 Mini91.0%$0.01516.8s64%
10GPT-5.4 Nano (Reasoning)87.7%$0.006124.5s66%
11Mistral Small 4 (Reasoning)89.7%$0.002230.2s63%
12Mistral Small Creative87.8%$0.00079.1s58%
13Grok 4.1 Fast90.5%$0.001837.8s61%
14Ministral 3B88.2%$0.00018.1s56%
15Ministral 3 14B86.8%$0.000711.7s57%
16DeepSeek V3 (2025-03-24)91.5%$0.001439.4s57%
17Grok 4.20 (Beta)87.3%$0.01815.8s61%
18Z.AI GLM 4.590.5%$0.005142.1s58%
19GPT-5.4 Nano84.9%$0.005726.3s61%
20Z.AI GLM 5 Turbo89.5%$0.008133.2s56%
21Ministral 3 3B85.9%$0.000511.1s52%
22Mistral Small 485.4%$0.001418.2s53%
23Grok 4.20 (Beta, Reasoning)88.7%$0.03934.0s67%
24Mistral Medium 3.186.9%$0.004836.5s54%
25Mistral Large 386.6%$0.003330.3s51%
26Claude Haiku 4.587.1%$0.01121.6s50%
27Claude Sonnet 490.6%$0.03243.7s58%
28GPT-5.293.6%$0.0561.5m75%
29Ministral 3 8B83.7%$0.000819.6s49%
30Rocinante 12B86.6%$0.001438.4s49%
31Grok 4 Fast83.1%$0.001724.1s50%
32Claude Sonnet 4.591.5%$0.03538.1s55%
33GPT-5.4 (Reasoning, Low)93.7%$0.0551.4m71%
34Qwen 2.5 72B83.5%$0.001036.7s51%
35Mistral Large85.9%$0.01430.9s51%
36Mistral Large 285.8%$0.01329.4s49%
37GPT-5.492.0%$0.0491.4m69%
38Ministral 8B81.3%$0.000410.4s44%
39Qwen 3.5 Plus (2026-02-15)82.2%$0.006031.5s49%
40Z.AI GLM 586.8%$0.00841.2m54%
41Writer: Palmyra X583.8%$0.01122.0s43%
42Claude 3.7 Sonnet88.2%$0.04246.7s55%
43Qwen 3 32B81.1%$0.001554.6s41%
44GPT-4o, May 13th (temp=1)81.1%$0.03314.4s42%
45GPT-4.1 Mini78.4%$0.002719.0s35%
46Claude Opus 4.588.6%$0.07053.4s56%
47LFM2 24B77.9%$0.000228.4s36%
48Qwen3 235B A22B Instruct 250781.3%$0.001159.2s38%
49GPT-4o, Aug. 6th (temp=1)78.4%$0.01824.4s38%
50GPT-5 Nano80.9%$0.00421.4m43%
51Inception Mercury 275.7%$0.00327.0s28%
52Hermes 3 405B77.8%$0.003253.2s36%
53MiniMax M2.779.4%$0.00401.1m38%
54ByteDance Seed 1.6 Flash75.8%$0.001327.3s31%
55MiniMax M2.580.5%$0.00341.3m37%
56Claude 3.5 Sonnet82.6%$0.04835.5s37%
57Cohere Command R+ (Aug. 2024)77.8%$0.02052.5s36%
58GPT-4o, Aug. 6th (temp=0)75.1%$0.02322.7s32%
59Arcee AI: Trinity Large (Preview)73.9%$0.000043.6s30%
60Hermes 3 70B76.7%$0.00101.2m33%
61Stealth: Aurora Alpha70.6%$0.00009.8s24%
62GPT-5.4 (Reasoning)91.8%$0.0892.6m64%
63Gemma 3 12B70.5%$0.000441.3s26%
64Claude Opus 4.681.7%$0.0781.2m48%
65Nemotron 3 Nano72.6%$0.00101.1m28%
66GPT-4o Mini (temp=1)68.7%$0.001234.8s23%
67Claude Opus 4.6 (Reasoning)83.1%$0.0881.4m49%
68Gemini 3 Flash (Preview)63.9%$0.007819.6s27%
69DeepSeek V3 (2024-12-26)70.7%$0.002154.6s25%
70Gemini 2.5 Flash66.9%$0.005210.6s19%
71Gemini 2.5 Flash (Reasoning)67.3%$0.01121.5s24%
72MoonshotAI: Kimi K2.581.0%$0.0193.2m48%
73Gemini 2.5 Flash Lite61.6%$0.00099.5s21%
74DeepSeek-V2 Chat68.9%$0.002153.3s22%
75Claude Sonnet 4.670.1%$0.03139.3s26%
76Gemini 3 Flash (Preview, Reasoning)62.3%$0.01230.1s27%
77GPT-4.169.0%$0.01844.7s23%
78Stealth: Hunter Alpha66.0%$0.000055.0s22%
79Stealth: Healer Alpha59.7%$0.000023.7s21%
80WizardLM 2 8x22b70.3%$0.00261.8m29%
81GPT-5.179.1%$0.0541.8m36%
82Arcee AI: Trinity Mini57.4%$0.00039.2s17%
83GPT-4o, May 13th (temp=0)66.4%$0.03514.1s19%
84Gemma 3 27B61.1%$0.000652.6s18%
85o4 Mini61.2%$0.01525.7s17%
86Gemini 2.5 Flash Lite (Reasoning)57.3%$0.002830.8s18%
87Nemotron 3 Super63.8%$0.00001.4m21%
88Grok 473.8%$0.0481.7m28%
89Z.AI GLM 4.657.8%$0.006551.5s21%
90Aion 2.063.0%$0.00641.3m19%
91GPT-5 Mini59.6%$0.010057.4s20%
92GPT-4o Mini (temp=0)56.7%$0.001234.8s14%
93DeepSeek V3.263.0%$0.00141.9m22%
94Gemma 3 4B52.5%$0.000220.0s11%
95o4 Mini High59.5%$0.02547.2s16%
96GPT-4.1 Nano46.4%$0.000713.3s15%
97Claude Sonnet 4.6 (Reasoning)65.6%$0.0601.2m22%
98Qwen 3.5 Flash48.5%$0.002547.5s17%
99Gemini 3.1 Pro (Preview)75.8%$0.1071.8m36%
100Mistral NeMO38.9%$0.000510.1s16%
101Z.AI GLM 4.7 Flash47.9%$0.00171.2m16%
102Qwen 3.5 397B A17B65.1%$0.0143.0m24%
103Gemini 3 Pro (Preview)56.8%$0.05554.4s19%
104Gemini 2.5 Pro49.0%$0.03636.2s16%
105Z.AI GLM 4.749.8%$0.0101.4m16%
106Claude Opus 487.3%$0.2091.4m44%
107Gemini 3.1 Flash Lite (Preview)36.1%$0.00308.4s11%
108DeepSeek V3.151.7%$0.00201.8m15%
109Qwen 3.5 9B45.8%$0.00111.4m11%
110Qwen 3.5 35B43.3%$0.0181.0m16%
111Inception Mercury39.4%$0.01117.6s6%
112ByteDance Seed 1.655.5%$0.0132.5m17%
113Qwen 3.5 122B44.3%$0.0251.1m13%
114Qwen 3.5 27B46.4%$0.0201.6m15%
115ByteDance Seed 2.0 Lite50.7%$0.0122.2m13%
116GPT-551.6%$0.0652.8m17%
117ByteDance Seed 2.0 Mini43.3%$0.00454.9m15%
118Mistral Small 3.2 24B42.5%$0.00695.7m10%
73.64%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Ministral 3 3B1001001001009999.8%
Claude 3.5 Sonnet1001001001009999.8%
Claude 3 Haiku1001001001009999.7%
Claude Sonnet 4.61001001001009899.6%
Qwen 3.5 Plus (2026-02-15)1001001001009899.5%
GPT-5.4 Nano (Reasoning)1001001001009799.5%
Gemini 2.5 Flash1001001001009699.2%
GPT-4.1 Mini1001001001009598.9%
Aion 2.0100100100999598.8%
Gemini 2.5 Pro100100100979698.6%
Grok 4 Fast1001001001009298.4%
DeepSeek V3.110010097979798.2%
Gemini 2.5 Flash Lite (Reasoning)1001001001008997.8%
Arcee AI: Trinity Mini1001001001008997.8%
Gemma 3 4B1001001001008897.7%
GPT-4.1 Nano100100100959397.5%
Qwen 3 32B1001001001008897.5%
Hermes 3 405B1001001001008496.9%
Stealth: Healer Alpha100100100948796.2%
ByteDance Seed 1.6 Flash1001001001007294.5%
Cohere Command R+ (Aug. 2024)100100100937994.4%
Ministral 8B1001001001007294.4%
GPT-5.21001001001007094.1%
DeepSeek V3.210010095958093.9%
Gemini 3 Flash (Preview, Reasoning)1001001001006793.4%
Gemini 3 Flash (Preview)1001001001006693.3%
WizardLM 2 8x22b1001001001006392.7%
GPT-51001001001005691.2%
Stealth: Hunter Alpha100100100847190.9%
Gemma 3 27B100100100876890.9%
Qwen 3.5 122B10010090897390.6%
Z.AI GLM 4.6100100100975490.1%
Gemini 2.5 Flash (Reasoning)1001001001004689.2%
ByteDance Seed 1.610010093787589.2%
GPT-5 Mini10010097757188.8%
Arcee AI: Trinity Large (Preview)100100100865487.9%
ByteDance Seed 2.0 Lite1001001001002785.5%
GPT-4o, May 13th (temp=0)100100100831880.1%
Qwen 3.5 9B10010085595579.9%
Gemini 3.1 Flash Lite (Preview)100100100464377.7%
Qwen 3.5 Flash10010093454376.2%
Inception Mercury10010075554875.3%
Qwen 3.5 35B1009672664074.7%
Z.AI GLM 4.7 Flash10010081472470.4%
Gemini 3.1 Pro (Preview)858172555168.7%
Qwen 3.5 27B988474492365.7%
Gemini 3 Pro (Preview)1009851442663.8%
Mistral Small 3.2 24B938566222057.1%
Mistral NeMO1001002011647.5%
Z.AI GLM 4.71004233282645.8%
ByteDance Seed 2.0 Mini794234333043.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Stealth: Hunter Alpha1001001001009999.8%
Writer: Palmyra X51001001001009999.7%
Z.AI GLM 51001001001009899.7%
Mistral Large1001001001009799.3%
MoonshotAI: Kimi K2.51001001001009799.3%
Z.AI GLM 5 Turbo1001001001009598.9%
Claude Opus 4.51001001001009498.9%
Claude 3.7 Sonnet1001001001009498.9%
Claude Opus 4.61001001001009498.7%
ByteDance Seed 1.6 Flash100100100999498.6%
GPT-5.4 Mini (Reasoning, Low)1001001001009198.2%
Ministral 3 3B1001001001009098.1%
Qwen 3 32B1001001001008997.9%
Grok 4.1 Fast100100100989297.9%
Hermes 3 405B1001001001008997.8%
Ministral 3 14B100100100969397.7%
GPT-5.2100100100988797.2%
Inception Mercury 2100100100978897.1%
Ministral 8B100100100978897.0%
Llama 3.1 8B1001001001008496.9%
Ministral 3B100100100978796.7%
Stealth: Aurora Alpha1001001001008396.5%
Grok 4 Fast10010098968896.5%
Mistral Small 4 (Reasoning)100100100928996.1%
GPT-4o, May 13th (temp=1)10010097948795.7%
LFM2 24B10010098968195.1%
Qwen 3.5 Plus (2026-02-15)100100100888494.5%
Mistral Small 4100100100888093.5%
Qwen 3.5 397B A17B999893918793.4%
WizardLM 2 8x22b10010095908393.4%
GPT-5.4 Nano10010095947693.1%
GPT-5.4 Nano (Reasoning, Low)100100100887692.7%
GPT-4o, Aug. 6th (temp=1)100100100887692.7%
Ministral 3 8B10010097858192.6%
Grok 4.20 (Beta)100100100907292.5%
Llama 3.1 70B100100100946792.2%
GPT-4.1100100100877091.3%
DeepSeek V3 (2024-12-26)100100100806889.4%
Cohere Command R+ (Aug. 2024)10010096965589.3%
Aion 2.010010086787788.1%
Grok 410010092776987.6%
GPT-5 Nano10010088826687.2%
GPT-5.11008484848387.0%
Gemini 2.5 Flash1009994746787.0%
Gemma 3 27B10010084767086.0%
Rocinante 12B1001001001002985.9%
Claude Sonnet 4.610010080796584.6%
Gemini 3.1 Pro (Preview)1009892755584.0%
Gemini 3 Flash (Preview)918585837583.8%
Qwen 2.5 72B1009385805983.5%
Grok 4.20 (Beta, Reasoning)918684797683.2%
DeepSeek V3.21009078757383.1%
DeepSeek-V2 Chat10010087646282.6%
Hermes 3 70B10010098921781.3%
Nemotron 3 Super1008985676180.4%
GPT-5.4 Nano (Reasoning)989785615880.1%
Arcee AI: Trinity Mini10010080606080.1%
o4 Mini10010088772177.2%
Gemini 2.5 Flash Lite (Reasoning)958580774676.5%
o4 Mini High1007573706576.3%
Stealth: Healer Alpha1008681575676.0%
GPT-4o Mini (temp=0)927473676674.3%
Claude Sonnet 4.6 (Reasoning)1008079595274.0%
GPT-5 Mini1008071655073.3%
Gemini 2.5 Flash (Reasoning)968379713572.7%
Z.AI GLM 4.6937968675472.5%
Arcee AI: Trinity Large (Preview)10010080572472.4%
Gemma 3 4B988564545470.9%
Gemma 3 12B989573454370.8%
DeepSeek V3.11007570664170.5%
Inception Mercury100999063070.3%
GPT-4o, Aug. 6th (temp=0)888762595169.5%
Gemini 2.5 Pro887268595067.5%
GPT-4.1 Nano1007361594467.4%
Gemini 3 Flash (Preview, Reasoning)100887254062.8%
Nemotron 3 Nano100827551562.5%
GPT-5767462515062.4%
ByteDance Seed 1.6847160573862.1%
Gemini 2.5 Flash Lite896060514460.6%
Z.AI GLM 4.7 Flash757157551454.4%
Gemini 3 Pro (Preview)706455353251.2%
GPT-4o, May 13th (temp=0)95874526050.7%
Qwen 3.5 122B784947393249.1%
Qwen 3.5 9B100100450049.0%
Qwen 3.5 Flash615949482347.9%
Qwen 3.5 35B79746125047.8%
Mistral NeMO72575444045.6%
Qwen 3.5 27B645043382243.4%
Z.AI GLM 4.768554641042.0%
ByteDance Seed 2.0 Mini40232013620.3%
Mistral Small 3.2 24B5430160020.0%
ByteDance Seed 2.0 Lite4428145018.3%
Gemini 3.1 Flash Lite (Preview)36201511016.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.2100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Mini (Reasoning)1001001001009899.6%
Llama 3.1 70B1001001001009899.5%
Llama 3.1 8B1001001001009699.2%
GPT-5.4 (Reasoning, Low)1001001001009498.8%
Mistral Large 31001001001008296.5%
Gemini 2.5 Flash (Reasoning)1001001001008296.4%
GPT-5.4 Nano (Reasoning)1001001001008296.3%
Claude Sonnet 41001001001007695.2%
GPT-5.4 Nano10010096908794.6%
Stealth: Aurora Alpha100100100908394.5%
GPT-5.4 Nano (Reasoning, Low)1009898908394.0%
Qwen3 235B A22B Instruct 25071001001001006893.6%
Z.AI GLM 4.51001001001006593.1%
Ministral 3 14B10010097848092.3%
Mistral Small Creative1009696858392.0%
Claude Sonnet 4.51001001001005891.5%
Rocinante 12B1001001001005791.4%
Qwen 2.5 72B1001001001005791.3%
Writer: Palmyra X5100100100975991.2%
GPT-4o, Aug. 6th (temp=0)100100100995691.1%
Claude Opus 4.61001001001004989.8%
DeepSeek V3 (2025-03-24)10010096936089.7%
GPT-5.410010098866589.7%
Claude Sonnet 4.6 (Reasoning)100100100925589.4%
Mistral Small 4 (Reasoning)100100100826389.1%
Ministral 3 3B10010092787388.7%
Grok 4.20 (Beta)10010088847288.7%
Ministral 3B1001001001004288.4%
MiniMax M2.510010096885687.9%
MoonshotAI: Kimi K2.5100100100696486.6%
Grok 4.20 (Beta, Reasoning)10010095785285.1%
Claude Haiku 4.51001001001002584.9%
Claude 3 Haiku969484846684.8%
Claude Opus 4.510010087726384.5%
Z.AI GLM 5 Turbo100100100803983.9%
Mistral Small 4100100100843383.5%
Ministral 8B100100100852882.6%
Mistral Medium 3.1100100100595482.6%
GPT-5.4 Mini10010072706581.2%
ByteDance Seed 1.6 Flash100100100881380.2%
GPT-4o, May 13th (temp=0)100100100100080.0%
Claude Opus 4100100100100079.9%
Grok 4.1 Fast10010088654679.8%
o4 Mini High10010010096079.2%
Ministral 3 8B1009379744979.1%
Qwen 3 32B1009895821377.5%
GPT-4.1 Mini988582605676.2%
Nemotron 3 Nano1008570626175.4%
ByteDance Seed 1.610010066604674.4%
Qwen 3.5 Plus (2026-02-15)1008479683873.8%
Mistral Large10010067525073.7%
MiniMax M2.7100100100462273.6%
Grok 4 Fast10010064534672.7%
Claude Opus 4.6 (Reasoning)1007965635271.8%
Claude 3.7 Sonnet1008367524970.0%
GPT-5 Nano988263633568.3%
WizardLM 2 8x22b1009440393762.1%
Mistral Small 3.2 24B10010061222160.7%
Z.AI GLM 51006653463860.5%
GPT-4.11001001001060.3%
GPT-4o, Aug. 6th (temp=1)10010045391760.2%
Mistral Large 21009344362960.2%
GPT-5.11001001000060.0%
o4 Mini100100971059.6%
GPT-4o Mini (temp=1)1001007417659.4%
GPT-4o, May 13th (temp=1)100896838058.9%
Qwen 3.5 397B A17B896253433857.0%
ByteDance Seed 2.0 Mini100100810056.2%
Cohere Command R+ (Aug. 2024)100964930055.2%
DeepSeek-V2 Chat100856526055.2%
ByteDance Seed 2.0 Lite100846417053.1%
DeepSeek V3 (2024-12-26)100100610052.3%
DeepSeek V3.210088640050.5%
Hermes 3 405B676450353249.6%
Claude 3.5 Sonnet766444441648.9%
Qwen 3.5 9B96844816048.8%
Gemini 3.1 Pro (Preview)100554937048.2%
Z.AI GLM 4.61007720201045.4%
Hermes 3 70B99583831045.1%
Gemini 3 Flash (Preview)83603736043.1%
Gemini 3 Flash (Preview, Reasoning)797530151142.0%
Z.AI GLM 4.7 Flash81645212041.9%
LFM2 24B1004928171541.8%
Gemini 2.5 Flash10010080041.6%
Grok 410047464039.5%
GPT-5 Mini10057305138.5%
Mistral NeMO9452332036.3%
Qwen 3.5 122B52453835735.4%
GPT-573422820834.2%
Qwen 3.5 27B10037300033.4%
Gemini 2.5 Flash Lite (Reasoning)100252018333.2%
Gemini 2.5 Flash Lite9542158032.1%
Qwen 3.5 Flash702618151428.8%
Stealth: Hunter Alpha7735129026.7%
Nemotron 3 Super695030024.4%
Qwen 3.5 35B615720023.9%
Aion 2.0981530023.4%
Arcee AI: Trinity Large (Preview)100500021.1%
Stealth: Healer Alpha36232212419.5%
Gemini 2.5 Pro691100016.0%
Gemini 3 Pro (Preview)67000013.5%
Gemma 3 12B23155008.6%
Gemma 3 27B2498008.2%
Arcee AI: Trinity Mini1777006.1%
GPT-4o Mini (temp=0)2400004.9%
Gemini 3.1 Flash Lite (Preview)2030004.6%
DeepSeek V3.11400002.9%
Inception Mercury770002.8%
Z.AI GLM 4.7700001.4%
GPT-4.1 Nano220000.9%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001009899.6%
Claude 3 Haiku1001001001009799.4%
MiniMax M2.71001001001009398.6%
GPT-5 Mini100100100999298.2%
GPT-5.11001001001009198.1%
Nemotron 3 Super1001001001008997.8%
GPT-4o, Aug. 6th (temp=1)1001001001008997.8%
Stealth: Hunter Alpha1001001001008897.6%
GPT-4o, May 13th (temp=1)1001001001008797.3%
GPT-4o, Aug. 6th (temp=0)1001001001008797.3%
GPT-5.4 Mini (Reasoning)1001001001008396.6%
Aion 2.0100100100929096.5%
Claude Opus 4.61001001001008095.9%
Gemini 2.5 Flash Lite100100100918494.9%
Mistral Small 3.2 24B1001001001007194.3%
MiniMax M2.51001001001007194.3%
Grok 4 Fast100100100908194.3%
DeepSeek V3.11001001001007194.1%
Hermes 3 70B100100100947593.6%
Qwen 3.5 397B A17B10010096927793.0%
GPT-4o Mini (temp=0)100100100857892.5%
Qwen 2.5 72B100100100807891.7%
Claude Sonnet 4.610010096927091.7%
o4 Mini High1001001001005791.4%
Claude Sonnet 4.6 (Reasoning)1001001001005691.2%
Gemini 2.5 Flash100100100827591.2%
WizardLM 2 8x22b1009896926991.2%
Rocinante 12B10010099906089.7%
Gemini 2.5 Flash (Reasoning)1009788817788.6%
ByteDance Seed 2.0 Lite1009791837388.6%
o4 Mini100100100736086.7%
Gemini 3 Pro (Preview)1008883827886.2%
DeepSeek V3.2989388767586.1%
Cohere Command R+ (Aug. 2024)10010096785685.9%
Mistral Small Creative10010088766285.1%
Gemini 3 Flash (Preview)1009979766583.8%
Gemma 3 12B10010099605182.0%
Hermes 3 405B10010098703780.9%
GPT-51008980775980.9%
Z.AI GLM 4.71009181745580.2%
ByteDance Seed 1.610010080713777.6%
Z.AI GLM 4.7 Flash908884843776.5%
GPT-4o Mini (temp=1)918971665774.7%
Qwen 3.5 122B1009291672274.5%
Mistral Large1008468615573.8%
Gemini 3 Flash (Preview, Reasoning)1008278753373.4%
Mistral Large 2100100100452073.0%
Qwen 3.5 Flash10010079602372.3%
Mistral NeMO937271645671.2%
Arcee AI: Trinity Mini10010070444171.0%
Z.AI GLM 4.6968782454470.8%
Stealth: Healer Alpha928469594770.0%
GPT-4.1888072594969.6%
Arcee AI: Trinity Large (Preview)897167605167.5%
ByteDance Seed 2.0 Mini908058524565.1%
Gemini 3.1 Pro (Preview)10010042414064.6%
Gemma 3 4B1007263454264.3%
Gemini 2.5 Flash Lite (Reasoning)1008174312061.3%
Mistral Large 3886865473660.8%
Qwen 3.5 9B1001001000060.0%
Gemini 2.5 Pro806855494759.7%
Qwen 3.5 35B1006457413459.1%
Inception Mercury1001004646058.5%
Gemma 3 27B1007849431256.4%
Gemini 3.1 Flash Lite (Preview)1009042291555.3%
Qwen 3.5 27B78625747048.7%
GPT-4.1 Nano655044332643.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude 3.5 Haiku1001001001009498.7%
Mistral Small Creative100100100989698.7%
Llama 3.1 70B1001001001009298.3%
GPT-5.4 Nano (Reasoning)1001001001009198.2%
GPT-5.4 Nano10010099989197.6%
Llama 3.1 8B1001001001007594.9%
Grok 4.1 Fast100100100997194.0%
Claude Sonnet 4.5100100100986592.6%
Claude 3 Haiku100100100817992.0%
Z.AI GLM 4.5100100100916791.6%
DeepSeek V3 (2025-03-24)10010097827991.6%
GPT-5.4100100100847391.5%
GPT-5.210010095866789.5%
GPT-5.4 Nano (Reasoning, Low)100100100924887.9%
GPT-5.4 (Reasoning, Low)1009691777287.2%
Llama 3.1 Nemotron 70B10010099725184.3%
Claude Sonnet 4100100100744684.0%
Mistral Small 4 (Reasoning)100100100665383.8%
GPT-5.4 Mini (Reasoning)1008584795981.2%
GPT-5 Nano10010095733480.4%
GPT-5.4 Mini1009085675879.9%
Inception Mercury 2969377676479.3%
Ministral 3 3B10010010096079.1%
Ministral 3 8B10010068646078.4%
Stealth: Aurora Alpha10010078605177.9%
Ministral 3 14B1009993544177.5%
Grok 4.20 (Beta, Reasoning)1009369625776.0%
Rocinante 12B1001009673374.6%
Cohere Command R+ (Aug. 2024)10010090523074.3%
Ministral 8B1009487513573.5%
GPT-5.4 Mini (Reasoning, Low)1008881761772.4%
Mistral Large 31001009167071.5%
Claude Haiku 4.51008985463571.0%
Writer: Palmyra X51007670565270.8%
Qwen 3.5 Plus (2026-02-15)988767534970.7%
Z.AI GLM 5 Turbo938765524267.8%
Claude Opus 4.61006562614666.8%
Mistral Large1007953493863.7%
Mistral Small 41008552453363.0%
Mistral Medium 3.1877062563862.6%
GPT-4o, May 13th (temp=0)100767062061.6%
Qwen 2.5 72B90816967061.4%
ByteDance Seed 1.610010071241061.0%
Ministral 3B1009844432060.9%
Grok 4.20 (Beta)847749474560.5%
Z.AI GLM 5100928317960.3%
GPT-5.4 (Reasoning)787064612359.4%
Hermes 3 70B100905147057.5%
Claude Opus 4.6 (Reasoning)848344403457.3%
Claude 3.5 Sonnet1009955161657.3%
Claude 3.7 Sonnet968551441157.3%
Nemotron 3 Nano100100756156.4%
Stealth: Hunter Alpha100796428054.3%
Qwen 3 32B87796144054.1%
Qwen3 235B A22B Instruct 2507856552392352.6%
MiniMax M2.5756460511252.4%
GPT-4.1 Mini1007735242051.1%
Claude Opus 4100736615050.9%
MoonshotAI: Kimi K2.5757347322049.4%
GPT-4o, Aug. 6th (temp=1)100814620049.3%
Claude Opus 4.5855840332949.0%
GPT-4o, May 13th (temp=1)100664520947.8%
Grok 4 Fast634744423446.1%
Arcee AI: Trinity Mini84624039045.1%
Claude Sonnet 4.6774747371745.0%
Arcee AI: Trinity Large (Preview)83625027044.4%
Mistral Small 3.2 24B100483818041.0%
DeepSeek-V2 Chat10010000040.0%
GPT-4o, Aug. 6th (temp=0)84484125039.7%
Grok 4664235351839.1%
LFM2 24B67634514037.6%
Qwen 3.5 27B71433834037.3%
Mistral Large 278472624034.9%
Mistral NeMO6757415234.2%
WizardLM 2 8x22b65323125030.5%
Stealth: Healer Alpha9443102029.8%
Gemini 2.5 Flash Lite46393427029.2%
Aion 2.05446433029.2%
Z.AI GLM 4.653353026028.6%
o4 Mini6352270028.4%
ByteDance Seed 1.6 Flash50443013027.3%
Qwen 3.5 9B705800025.4%
Qwen 3.5 Flash6049162025.3%
DeepSeek V3.25343263024.8%
Qwen 3.5 397B A17B363326191024.5%
Qwen 3.5 35B54302513024.4%
GPT-5 Mini37373315024.3%
GPT-5.164222014123.9%
MiniMax M2.7625420023.7%
Qwen 3.5 122B703140020.9%
Z.AI GLM 4.7 Flash4236200019.7%
Inception Mercury702700019.4%
Gemini 2.5 Flash (Reasoning)434260018.2%
Claude Sonnet 4.6 (Reasoning)242015111015.9%
Gemini 3.1 Pro (Preview)541500013.8%
Gemini 3 Flash (Preview)68000013.5%
Nemotron 3 Super412000012.3%
Gemma 3 12B491300012.2%
GPT-4o Mini (temp=1)332410011.4%
Gemini 2.5 Flash282620011.3%
DeepSeek V3.125220009.4%
GPT-524128008.5%
Z.AI GLM 4.724160008.0%
GPT-4.13242007.6%
Hermes 3 405B20170007.3%
Gemini 2.5 Flash Lite (Reasoning)3600007.1%
Gemma 3 4B2810005.9%
GPT-4.1 Nano2330005.3%
Gemini 2.5 Pro2060005.2%
Gemma 3 27B1160003.4%
Gemini 3 Pro (Preview)700001.3%
GPT-4o Mini (temp=0)600001.1%
DeepSeek V3 (2024-12-26)500001.0%
o4 Mini High300000.6%
Gemini 3 Flash (Preview, Reasoning)000000.0%
ByteDance Seed 2.0 Mini000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
ByteDance Seed 2.0 Lite000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral Small 4100100100100100100.0%
GPT-5.4 Mini (Reasoning)10010010010010099.9%
MoonshotAI: Kimi K2.51001001001009999.8%
Claude Haiku 4.51001001001009999.8%
Mistral Medium 3.11001001001009899.6%
Qwen 3 32B1001001001009899.5%
Grok 4100100100999899.5%
Rocinante 12B1001001001009799.5%
Z.AI GLM 4.5100100100999799.3%
Grok 4.20 (Beta)1001001001009699.2%
Qwen 3.5 Plus (2026-02-15)1001001001009599.0%
Gemma 3 4B1001001001009498.8%
GPT-5.4 Mini (Reasoning, Low)1001001001009398.5%
Claude Opus 4.6100100100979698.5%
Mistral Small Creative1001001001009298.4%
Grok 4.20 (Beta, Reasoning)1001001001009298.3%
Stealth: Aurora Alpha1001001001009198.2%
Ministral 8B1001001001009098.0%
MiniMax M2.51001001001008997.7%
GPT-5.2100100100959497.7%
Z.AI GLM 51001001001008797.3%
Hermes 3 405B1001001001008597.0%
Ministral 3 3B1001001001008396.7%
Ministral 3B1001001001008296.3%
Claude Opus 4.6 (Reasoning)1001001001008196.3%
ByteDance Seed 1.6 Flash100100100958495.8%
Mistral Small 4 (Reasoning)1001001001007695.2%
Aion 2.0100100100898594.7%
Ministral 3 8B1001001001007294.5%
Ministral 3 14B1001001001007294.4%
GPT-4o, Aug. 6th (temp=1)100100100937793.9%
Mistral Large1001001001007093.9%
GPT-5.4 Nano (Reasoning)1009792909093.8%
LFM2 24B100100100937693.8%
Claude Sonnet 4.6 (Reasoning)100100100947593.8%
Llama 3.1 8B1001001001006893.7%
Z.AI GLM 5 Turbo1009992908593.3%
Stealth: Hunter Alpha100100100808091.9%
WizardLM 2 8x22b10010097936891.5%
GPT-5.4 Nano (Reasoning, Low)100100100965890.8%
Gemini 2.5 Flash (Reasoning)100100100866590.1%
Qwen 3.5 397B A17B1009793916889.9%
Grok 4 Fast100100100816288.5%
GPT-4o, Aug. 6th (temp=0)100100100766487.9%
Gemini 2.5 Flash Lite (Reasoning)10010092846287.6%
Qwen 2.5 72B10010088807087.6%
GPT-4o, May 13th (temp=1)1009190747385.5%
Gemini 3 Flash (Preview)989788747085.3%
Gemini 3.1 Pro (Preview)949386816483.4%
Gemini 2.5 Flash1008783826483.3%
Nemotron 3 Super1008987835482.6%
DeepSeek V3.21009685686482.5%
Cohere Command R+ (Aug. 2024)100100100684382.2%
GPT-4o, May 13th (temp=0)1009286844982.1%
Z.AI GLM 4.61009785656181.6%
GPT-4o Mini (temp=1)1008979716781.2%
GPT-5.4 Nano938781806480.9%
Gemini 2.5 Flash Lite909085786280.8%
ByteDance Seed 1.61008783815280.5%
Hermes 3 70B1008787754478.4%
Gemma 3 12B10010072615978.3%
GPT-5 Nano928476736477.5%
Arcee AI: Trinity Mini10010087792077.1%
DeepSeek V3 (2024-12-26)100100100542976.7%
Claude Sonnet 4.61009471625075.4%
Gemini 3 Flash (Preview, Reasoning)949182624775.3%
GPT-5.1807977646272.4%
DeepSeek-V2 Chat10010062484671.3%
o4 Mini938481484770.7%
GPT-4o Mini (temp=0)847770575668.6%
DeepSeek V3.1948070524367.6%
Gemini 2.5 Pro827568684467.4%
GPT-4.1 Nano1006563594866.9%
Stealth: Healer Alpha1007365623466.8%
o4 Mini High867473722666.1%
Arcee AI: Trinity Large (Preview)10010068471064.8%
Gemma 3 27B777770544264.1%
Qwen 3.5 Flash968542412658.1%
Z.AI GLM 4.7 Flash936462441656.1%
ByteDance Seed 2.0 Mini878437242050.5%
GPT-5 Mini605651493550.0%
GPT-574696033548.2%
Z.AI GLM 4.777705826948.1%
Gemini 3 Pro (Preview)100764317047.3%
Qwen 3.5 9B1006437211347.0%
ByteDance Seed 2.0 Lite69535235041.8%
Qwen 3.5 122B71633423038.2%
Mistral NeMO7770420037.8%
Qwen 3.5 35B55493929435.0%
Inception Mercury59542722733.8%
Qwen 3.5 27B71352921031.5%
Gemini 3.1 Flash Lite (Preview)58523011030.3%
Mistral Small 3.2 24B600001.3%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
DeepSeek V3.21001001001009999.7%
GPT-4o, May 13th (temp=1)1001001001009999.7%
GPT-4o Mini (temp=0)1001001001009899.7%
GPT-4.11001001001009899.6%
Claude Sonnet 41001001001009799.5%
Rocinante 12B1001001001009799.4%
Ministral 3 14B1001001001009699.2%
Grok 4.20 (Beta, Reasoning)1001001001009699.2%
Writer: Palmyra X51001001001009599.1%
Gemini 2.5 Flash1001001001009599.0%
Hermes 3 70B1001001001009498.8%
GPT-5.11001001001009498.7%
Z.AI GLM 51001001001009298.4%
GPT-5.4 Nano (Reasoning, Low)100100100969698.4%
DeepSeek V3 (2024-12-26)1001001001009198.1%
GPT-5.4 Nano (Reasoning)1001001001008997.9%
Inception Mercury 21001001001008897.5%
Claude 3.5 Sonnet1001001001008797.3%
GPT-5.2100100100969097.1%
GPT-5 Mini10010099988796.7%
MiniMax M2.51001001001008496.7%
GPT-4o, Aug. 6th (temp=1)1001001001008296.3%
Gemma 3 27B1001001001008196.3%
o4 Mini100100100948796.2%
Mistral Small 4 (Reasoning)100100100938896.1%
Z.AI GLM 4.7100100100998296.1%
ByteDance Seed 1.6 Flash1001001001007995.9%
Gemma 3 12B1001001001007995.8%
Ministral 8B100100100898895.5%
GPT-4.1 Mini1001001001007595.0%
MoonshotAI: Kimi K2.5100100100938395.0%
Gemini 3 Flash (Preview)10010092929094.8%
Ministral 3 3B100100100928294.7%
GPT-5.4 Nano10010098918294.1%
Ministral 3 8B100100100957393.7%
MiniMax M2.710010097927893.5%
Gemini 2.5 Flash (Reasoning)100100100848293.2%
Claude 3 Haiku1001001001006593.0%
Mistral Small Creative10010094878493.0%
Stealth: Healer Alpha100100100887792.9%
Llama 3.1 Nemotron 70B1001001001006392.6%
Arcee AI: Trinity Large (Preview)1001001001006392.5%
Gemini 3 Flash (Preview, Reasoning)10010097897492.1%
GPT-51009493908392.1%
GPT-4o, May 13th (temp=0)100100100966391.9%
Aion 2.010010087878591.7%
Gemini 2.5 Flash Lite1009692878491.6%
Stealth: Aurora Alpha100100100887091.5%
Ministral 3B1001001001005791.4%
Claude Opus 4.6 (Reasoning)1001001001005791.3%
Nemotron 3 Super10010095857591.0%
o4 Mini High999889848490.8%
WizardLM 2 8x22b100100100827290.8%
Stealth: Hunter Alpha1009992877490.3%
Gemini 3 Pro (Preview)1009592896788.8%
Nemotron 3 Nano10010090905887.6%
Claude Sonnet 4.610010088787187.4%
Arcee AI: Trinity Mini10010082777687.0%
Gemma 3 4B10010096706987.0%
DeepSeek V3.11009187817185.8%
Qwen 3.5 397B A17B10010088726885.6%
Gemini 2.5 Pro969387817085.3%
Qwen 3.5 27B10010090676684.6%
Qwen 3.5 Flash1009386736282.7%
Qwen 3.5 122B998580757181.8%
GPT-4.1 Nano968787726781.6%
Z.AI GLM 4.6959381605677.2%
Claude Opus 4.61008481763675.5%
ByteDance Seed 2.0 Mini999961565674.3%
Qwen 3.5 35B1007969626073.9%
GPT-4o, Aug. 6th (temp=0)1008280733273.4%
ByteDance Seed 2.0 Lite1009570702972.7%
Claude Sonnet 4.6 (Reasoning)1009467583470.5%
Gemini 3.1 Flash Lite (Preview)929279474370.4%
Gemini 2.5 Flash Lite (Reasoning)858372585269.9%
Inception Mercury1001007468068.3%
Qwen 3.5 9B1007971601665.2%
Z.AI GLM 4.7 Flash897755544463.9%
Mistral NeMO494536241634.1%
Mistral Small 3.2 24B7239390030.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Grok 4 Fast1001001001009999.9%
Claude 3.7 Sonnet1001001001009899.6%
Z.AI GLM 5 Turbo1001001001009899.6%
Grok 4.20 (Beta)1001001001009799.5%
Grok 41001001001009799.4%
Rocinante 12B1001001001009699.0%
GPT-5.4 Mini100100100989799.0%
GPT-4.1 Mini1001001001009498.8%
GPT-5.4 Mini (Reasoning)100100100989498.4%
Qwen3 235B A22B Instruct 250710010099999398.2%
Writer: Palmyra X51001001001009098.0%
Claude 3 Haiku1001001001008096.0%
Hermes 3 405B10010096968795.7%
Arcee AI: Trinity Large (Preview)100100100938495.3%
LFM2 24B100100100918595.1%
Claude Haiku 4.51001001001007494.8%
GPT-5.4 Mini (Reasoning, Low)1009996958394.6%
GPT-4o Mini (temp=1)100100100888594.6%
Mistral Small Creative1009594918993.6%
Ministral 3B100100100947493.6%
Mistral Medium 3.11001001001006492.9%
Qwen 3 32B1001001001006492.7%
Hermes 3 70B100100100877792.7%
MoonshotAI: Kimi K2.510010097957192.4%
Ministral 3 8B10010098857591.6%
Mistral Large10010095887190.7%
ByteDance Seed 1.6 Flash10010098837190.4%
Mistral Large 210010090897089.9%
Claude Opus 4.6 (Reasoning)100100100777289.8%
Grok 4.20 (Beta, Reasoning)10010086818089.5%
WizardLM 2 8x22b100100100826489.2%
Cohere Command R+ (Aug. 2024)100100100736988.4%
GPT-4o, Aug. 6th (temp=0)100100100855587.9%
Ministral 3 14B949291807887.1%
Ministral 8B1009490806886.4%
Qwen 2.5 72B100100100696286.3%
GPT-5.1939089847085.2%
GPT-5.4 Nano (Reasoning, Low)958786847385.0%
Arcee AI: Trinity Mini10010084726884.8%
Z.AI GLM 51008882787684.7%
GPT-4o, May 13th (temp=1)10010094696084.5%
Mistral Large 310010085726584.5%
DeepSeek V3 (2024-12-26)10010075726682.5%
Mistral Small 4 (Reasoning)10010084685881.9%
GPT-5 Nano1009189676181.5%
GPT-5.2888583806880.8%
Nemotron 3 Nano938881756480.3%
DeepSeek V3.2848180797680.0%
Inception Mercury 21009279705879.8%
Gemini 3.1 Pro (Preview)1008478785779.4%
Stealth: Aurora Alpha999686665079.1%
GPT-4.1938877756278.9%
Mistral Small 41009478665678.8%
MiniMax M2.5999282664677.1%
GPT-4o Mini (temp=0)1009381605177.0%
Gemma 3 12B1008567655875.2%
MiniMax M2.710010064555474.5%
Gemma 3 27B928985753174.5%
Qwen 3.5 Plus (2026-02-15)887473696874.4%
Stealth: Healer Alpha928872655273.8%
GPT-5.4 Nano897874656273.6%
DeepSeek-V2 Chat100100100551173.3%
GPT-5 Mini957871625271.7%
GPT-5.4 Nano (Reasoning)787674656571.6%
Claude Sonnet 4.6807573725671.2%
Claude Opus 4.6898175565370.6%
Gemini 2.5 Flash837671694969.5%
Stealth: Hunter Alpha727070675266.1%
Ministral 3 3B927876582666.1%
o4 Mini High906760503760.6%
Nemotron 3 Super856860494060.6%
Gemini 2.5 Flash (Reasoning)897354473860.1%
Z.AI GLM 4.7736961563959.8%
Claude Sonnet 4.6 (Reasoning)756565573158.5%
Aion 2.0736455544658.4%
Inception Mercury85816957058.2%
Qwen 3.5 397B A17B696055535157.7%
Z.AI GLM 4.6676157484755.8%
DeepSeek V3.1716559512854.7%
Gemma 3 4B726764363354.4%
Gemini 3 Flash (Preview, Reasoning)827256372253.7%
Gemini 3 Pro (Preview)686056483753.5%
o4 Mini806059491252.0%
GPT-4.1 Nano645958433251.1%
Gemini 2.5 Flash Lite825342413851.0%
Gemini 2.5 Flash Lite (Reasoning)634947403947.5%
Z.AI GLM 4.7 Flash615943412746.0%
Gemini 2.5 Pro625249412545.8%
Qwen 3.5 Flash58535250944.3%
Gemini 3 Flash (Preview)624642363443.9%
GPT-5675037242340.1%
Qwen 3.5 122B524236343339.2%
ByteDance Seed 1.6734934271238.9%
GPT-4o, May 13th (temp=0)565147181838.0%
Qwen 3.5 35B583732302536.4%
Mistral NeMO574239251736.0%
ByteDance Seed 2.0 Mini54504911032.8%
Qwen 3.5 9B494138211031.8%
ByteDance Seed 2.0 Lite65403013029.5%
Qwen 3.5 27B5942287227.7%
Gemini 3.1 Flash Lite (Preview)41403510525.9%
Mistral Small 3.2 24B3500007.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.11001001001009599.1%
Z.AI GLM 510010099968395.5%
Gemma 3 12B1001001001007695.2%
Claude 3.5 Haiku1001001001006993.8%
GPT-5.2100100100976091.5%
GPT-5.4 Mini (Reasoning, Low)100100100727188.6%
GPT-4o, May 13th (temp=1)1001001001004388.6%
Llama 3.1 8B100100100806388.6%
Claude Opus 4.6 (Reasoning)959392907288.4%
GPT-5.4 (Reasoning)100100100745986.5%
Grok 4.20 (Beta, Reasoning)1009391747486.4%
Llama 3.1 70B1001001001002785.4%
Llama 3.1 Nemotron 70B1009082787584.9%
Mistral Large10010075736682.9%
Claude Opus 410010083834081.3%
DeepSeek V3 (2025-03-24)100100100100681.3%
GPT-5.4 Nano (Reasoning)918481766379.0%
Arcee AI: Trinity Large (Preview)1008371706678.1%
GPT-5.4 (Reasoning, Low)10010071605677.5%
Hermes 3 70B10010090801376.6%
WizardLM 2 8x22b1009887623576.5%
MiniMax M2.710010010082076.4%
Qwen 2.5 72B10010073595076.3%
Mistral Large 2100100100562576.1%
Claude Sonnet 4.5100100100671175.5%
Claude Opus 4.510010072623874.3%
Claude Opus 4.610010093562274.2%
Mistral Small 3.2 24B100100100521473.2%
GPT-5.4 Mini1008270585472.8%
Grok 4.1 Fast1009278642972.5%
Claude Sonnet 41008169575071.4%
GPT-5.4 Nano (Reasoning, Low)1007169664269.7%
Z.AI GLM 4.5949468573569.5%
Claude Haiku 4.51007978491363.5%
Ministral 3B1006561543763.5%
Grok 4.20 (Beta)747158575663.2%
Claude 3.7 Sonnet1006559433861.1%
MoonshotAI: Kimi K2.51007257433060.3%
Cohere Command R+ (Aug. 2024)1007455512060.0%
Claude Sonnet 4.610099950058.9%
GPT-5.4 Nano706459514557.9%
Grok 4 Fast100746941457.7%
Nemotron 3 Nano915849474357.5%
Writer: Palmyra X51001007114057.1%
Claude 3 Haiku1006047453056.5%
Claude 3.5 Sonnet1001004137256.1%
Hermes 3 405B846757353355.1%
Qwen 3.5 Plus (2026-02-15)907143392954.4%
Mistral Small 4905352472954.2%
Gemini 3 Pro (Preview)757264431353.3%
GPT-5100585751053.3%
Mistral Medium 3.1888564161153.0%
Rocinante 12B10082728052.3%
Gemini 3 Flash (Preview, Reasoning)826150392852.1%
MiniMax M2.51009327141349.5%
GPT-5.4635352403949.5%
Claude Sonnet 4.6 (Reasoning)735447462348.6%
Mistral NeMO1001002220048.4%
DeepSeek V3 (2024-12-26)716057292248.0%
GPT-5 Mini73736726047.9%
Z.AI GLM 5 Turbo92785614047.8%
GPT-4o, Aug. 6th (temp=1)1005133302347.5%
Grok 4705549332045.2%
LFM2 24B685243322744.7%
Gemini 2.5 Flash (Reasoning)100812616044.5%
GPT-4o Mini (temp=1)100473929744.3%
GPT-4o, May 13th (temp=0)10077420043.7%
Mistral Small Creative66594537742.8%
GPT-4o, Aug. 6th (temp=0)9078400041.6%
Mistral Large 372585223041.0%
Ministral 3 3B1004530161341.0%
ByteDance Seed 2.0 Lite63494638840.9%
ByteDance Seed 1.6 Flash100652810040.5%
Gemini 3.1 Pro (Preview)66524141040.0%
Stealth: Healer Alpha9175320039.5%
ByteDance Seed 2.0 Mini89742310039.0%
Ministral 3 14B545149241738.9%
Mistral Small 4 (Reasoning)58574332037.9%
Ministral 8B51494536036.1%
ByteDance Seed 1.67753335033.5%
Gemini 2.5 Flash Lite80531110531.7%
Ministral 3 8B55532814531.2%
GPT-4o Mini (temp=0)10027208031.0%
Qwen3 235B A22B Instruct 250764592011030.9%
Gemini 2.5 Flash Lite (Reasoning)6052343029.7%
Qwen 3 32B762018161629.2%
DeepSeek-V2 Chat1002660026.4%
Qwen 3.5 35B6745160025.7%
Gemini 3 Flash (Preview)39322927025.6%
GPT-4.18233100025.1%
Nemotron 3 Super4834339024.9%
GPT-5 Nano303029211124.0%
GPT-4.1 Mini6920129723.4%
Qwen 3.5 27B713760022.6%
Arcee AI: Trinity Mini6033200022.5%
Stealth: Hunter Alpha6131113021.1%
Qwen 3.5 397B A17B5528180020.4%
Aion 2.0462040013.9%
Inception Mercury 2332960013.7%
Stealth: Aurora Alpha3814123013.4%
Gemma 3 27B3022110012.7%
Gemini 2.5 Flash322030010.9%
Z.AI GLM 4.62018140010.4%
Z.AI GLM 4.7 Flash51000010.3%
Qwen 3.5 Flash51000010.1%
Qwen 3.5 9B4040008.6%
Inception Mercury19170007.2%
Gemini 2.5 Pro2350005.6%
DeepSeek V3.22300004.6%
GPT-4.1 Nano1660004.4%
Gemini 3.1 Flash Lite (Preview)2000004.0%
Z.AI GLM 4.71600003.2%
o4 Mini1040002.7%
Qwen 3.5 122B300000.6%
o4 Mini High000000.0%
DeepSeek V3.1000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Mistral Small 410010010010010099.9%
GPT-5.4 Mini (Reasoning, Low)10010010010010099.9%
Qwen 2.5 72B1001001001009999.8%
Cohere Command R+ (Aug. 2024)1001001001009999.8%
Llama 3.1 70B1001001001009999.8%
GPT-5.2100100100999698.9%
Gemma 3 12B1001001001009498.9%
GPT-5.4 Nano1001001001009498.8%
Z.AI GLM 5100100100989498.4%
Nemotron 3 Nano1001001001009198.3%
Z.AI GLM 4.51001001001009198.2%
Claude 3.5 Haiku1001001001009198.2%
GPT-4o, May 13th (temp=1)1001001001009198.2%
Mistral Small 4 (Reasoning)1001001001009098.0%
GPT-5.4 Nano (Reasoning)100100100969498.0%
Gemini 3.1 Pro (Preview)1001001001008897.6%
ByteDance Seed 1.6 Flash1001001001008897.6%
Aion 2.0100100100949497.5%
Hermes 3 405B1001001001008797.3%
Qwen3 235B A22B Instruct 25071001001001008597.0%
Writer: Palmyra X5100100100949197.0%
Inception Mercury 21001001001008597.0%
GPT-4.1 Mini1001001001008496.9%
Mistral Small Creative1001001001008296.4%
MiniMax M2.5100100100988396.3%
Gemma 3 4B100100100919196.3%
DeepSeek-V2 Chat100100100938796.0%
WizardLM 2 8x22b10010097958795.8%
Arcee AI: Trinity Large (Preview)100100100918795.6%
GPT-5 Nano100100100918595.3%
DeepSeek V3.210010097928795.2%
DeepSeek V3 (2024-12-26)1001001001007695.2%
Claude Opus 4.6 (Reasoning)10010098908895.1%
Gemini 2.5 Flash (Reasoning)1001001001007595.0%
ByteDance Seed 2.0 Lite10010095948795.0%
o4 Mini High10010099928394.8%
Z.AI GLM 5 Turbo100100100938094.7%
Ministral 3 8B100100100888594.6%
GPT-4o Mini (temp=0)10010099947894.3%
Qwen 3.5 Plus (2026-02-15)10010098928194.1%
Gemma 3 27B1001001001006793.5%
Claude Sonnet 4.610010099986993.4%
GPT-4o, Aug. 6th (temp=1)1001001001006693.2%
Gemini 2.5 Flash10010094917892.5%
LFM2 24B1001001001006292.3%
Qwen 3 32B1001001001006292.3%
GPT-5 Mini10010089888191.6%
Gemini 2.5 Flash Lite100100100896891.5%
Grok 4.20 (Beta)10010095897090.9%
Mistral Large 310010089887189.5%
Nemotron 3 Super1009187858589.4%
Gemini 2.5 Flash Lite (Reasoning)1009985837888.9%
GPT-4.1100100100816388.8%
Ministral 3 3B1001001001004488.8%
Qwen 3.5 397B A17B979290867688.1%
Gemini 2.5 Pro1009694796987.7%
MiniMax M2.710010093826287.4%
Gemini 3 Pro (Preview)1009987797287.2%
Mistral Medium 3.11009182818086.7%
GPT-5.11008783807885.7%
Claude Opus 4.61009892765984.8%
Stealth: Healer Alpha968887777584.5%
Rocinante 12B100100100655684.3%
GPT-4o Mini (temp=1)10010089765383.6%
o4 Mini919179777682.9%
ByteDance Seed 1.610010082755882.9%
Stealth: Hunter Alpha10010099575081.1%
DeepSeek V3.11009891734280.8%
Hermes 3 70B100100100682879.2%
ByteDance Seed 2.0 Mini10010087545378.7%
Z.AI GLM 4.61009375615977.6%
Z.AI GLM 4.7909079705877.5%
Gemini 3 Flash (Preview)1009876713676.1%
Z.AI GLM 4.7 Flash998779594673.9%
Claude Sonnet 4.6 (Reasoning)1008177634172.4%
Stealth: Aurora Alpha100938662068.1%
GPT-5807972703867.7%
Gemini 3 Flash (Preview, Reasoning)1007770483064.8%
Qwen 3.5 Flash896462563160.3%
Qwen 3.5 9B1006254424159.9%
GPT-4.1 Nano787061502857.4%
Qwen 3.5 122B757445414155.2%
Mistral Small 3.2 24B100925331055.1%
Gemini 3.1 Flash Lite (Preview)795447423251.0%
Qwen 3.5 35B646151502950.8%
Arcee AI: Trinity Mini625249443648.5%
Mistral NeMO100443831744.0%
Qwen 3.5 27B555238332039.7%
Inception Mercury10069150036.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Llama 3.1 8B100100100100100100.0%
Llama 3.1 Nemotron 70B1001001001009899.6%
Mistral Large1001001001008997.7%
GPT-5.4 Mini10010096949196.1%
Llama 3.1 70B10010098919095.9%
Claude 3.5 Haiku10010093796687.5%
Mistral Large 31009893745483.8%
GPT-5.4 Mini (Reasoning)10010080726282.8%
GPT-5.4 Mini (Reasoning, Low)1008679796982.5%
DeepSeek V3 (2025-03-24)10010084725381.9%
GPT-5.4 Nano (Reasoning, Low)1009489655881.1%
Rocinante 12B10010078656181.0%
Ministral 3 3B1008375756880.0%
Mistral Small Creative100100100603579.1%
Mistral Small 4 (Reasoning)979567605574.8%
GPT-5.4928474625272.8%
Claude 3 Haiku997862585470.3%
GPT-5.4 Nano (Reasoning)959174612970.1%
Ministral 3B958269525169.5%
Mistral Large 210010076412368.2%
Qwen 2.5 72B996161605467.2%
GPT-5.4 Nano807260565263.8%
GPT-5.4 (Reasoning)876862564563.5%
Grok 4 Fast757262555363.4%
Ministral 3 8B1008945393862.5%
GPT-5.4 (Reasoning, Low)846864573661.9%
GPT-4o, May 13th (temp=1)100877539060.0%
GPT-5.2775755545158.9%
Grok 4.1 Fast706656492954.1%
Grok 4.20 (Beta, Reasoning)715949493953.4%
Z.AI GLM 5 Turbo80745449051.4%
Qwen 3 32B1006045381451.4%
Ministral 8B846758252251.1%
Mistral Small 41007241261350.6%
MiniMax M2.7595747474250.5%
LFM2 24B956350241750.0%
Claude 3.7 Sonnet685857422449.9%
Grok 4.20 (Beta)725449462148.5%
Gemma 3 27B100100420048.3%
Claude Sonnet 4.51004735302547.4%
WizardLM 2 8x22b787244202046.7%
Z.AI GLM 567635932946.1%
Mistral Small 3.2 24B655858351245.7%
Claude Opus 4.558585747745.5%
Claude Haiku 4.5784442372344.9%
Mistral Medium 3.175733835244.6%
Claude Sonnet 4945434311044.6%
Hermes 3 405B100674212044.1%
Ministral 3 14B645741332042.8%
Qwen 3.5 Plus (2026-02-15)83613426942.3%
Cohere Command R+ (Aug. 2024)75733112438.7%
Hermes 3 70B1009300038.6%
MoonshotAI: Kimi K2.556555422037.3%
Claude Opus 4.67663266034.2%
Mistral NeMO7749413034.0%
Gemini 3 Flash (Preview)543531222232.8%
Claude Opus 46766221031.3%
ByteDance Seed 1.68142330031.0%
Gemini 3 Flash (Preview, Reasoning)6860231030.4%
Z.AI GLM 4.55147369028.5%
Gemini 3.1 Pro (Preview)434227181028.1%
DeepSeek V3 (2024-12-26)8920148026.1%
GPT-5 Nano6730253024.9%
GPT-4o, Aug. 6th (temp=1)873500024.3%
DeepSeek V3.26527188324.2%
GPT-4o, Aug. 6th (temp=0)694200022.1%
Arcee AI: Trinity Large (Preview)902000022.1%
Grok 4561614131022.0%
GPT-4o, May 13th (temp=0)852000021.0%
Qwen3 235B A22B Instruct 25074639141020.1%
MiniMax M2.55028144019.2%
GPT-4.1 Mini31291711017.6%
DeepSeek-V2 Chat5715140017.3%
Writer: Palmyra X53330230017.1%
Gemini 2.5 Flash5118160017.1%
Claude Opus 4.6 (Reasoning)3228222016.8%
Claude 3.5 Sonnet27241511015.5%
GPT-4.123202014015.3%
GPT-5 Mini3127108015.2%
o4 Mini3530100015.0%
Gemini 2.5 Pro2220188414.5%
Gemma 3 12B462320014.2%
Stealth: Aurora Alpha3127130014.2%
Qwen 3.5 397B A17B3024170014.2%
Stealth: Healer Alpha332890014.0%
GPT-5.129191111013.9%
Gemini 2.5 Flash Lite3220120012.9%
GPT-4o Mini (temp=1)2020107612.5%
Inception Mercury352700012.5%
ByteDance Seed 1.6 Flash60000012.0%
Nemotron 3 Nano262560011.2%
Stealth: Hunter Alpha2413127011.2%
Gemini 3 Pro (Preview)351280010.9%
Z.AI GLM 4.7282400010.3%
Aion 2.042810010.3%
Qwen 3.5 122B22129309.2%
GPT-4.1 Nano3590008.8%
Claude Sonnet 4.63842008.6%
ByteDance Seed 2.0 Lite4100008.1%
Gemini 2.5 Flash Lite (Reasoning)20119008.0%
Arcee AI: Trinity Mini24160008.0%
DeepSeek V3.118184008.0%
Qwen 3.5 Flash3500007.1%
Qwen 3.5 27B2270005.7%
GPT-4o Mini (temp=0)2800005.6%
Inception Mercury 22010004.2%
GPT-52000004.0%
Qwen 3.5 9B1100002.2%
Z.AI GLM 4.6910002.1%
Gemma 3 4B500001.0%
Qwen 3.5 35B300000.6%
Nemotron 3 Super110000.5%
Claude Sonnet 4.6 (Reasoning)000000.0%
o4 Mini High000000.0%
ByteDance Seed 2.0 Mini000000.0%
Gemini 2.5 Flash (Reasoning)000000.0%
Gemini 3.1 Flash Lite (Preview)000000.0%
Z.AI GLM 4.7 Flash000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
GPT-5.4 Mini (Reasoning)1001001001009999.8%
Claude Sonnet 41001001001009999.7%
Claude Opus 41001001001009899.7%
GPT-5.4 (Reasoning, Low)1001001001009899.7%
GPT-5.41001001001009799.3%
ByteDance Seed 1.6 Flash100100100989798.9%
Mistral Small Creative10010098989598.2%
Mistral Large 31001001001009198.2%
Mistral Large1001001001009098.1%
Claude Haiku 4.5100100100969397.8%
Claude Opus 4.51001001001008797.3%
Grok 4.20 (Beta, Reasoning)100100100949297.0%
Mistral Medium 3.110010097949396.8%
Grok 4.1 Fast10010099958996.5%
Gemini 2.5 Flash1001001001007995.8%
Rocinante 12B1001001001007995.8%
Grok 4100100100968295.7%
GPT-5.210010098918995.7%
Arcee AI: Trinity Large (Preview)1001001001007995.7%
Z.AI GLM 5 Turbo100100100898895.5%
DeepSeek V3 (2025-03-24)1001001001007695.3%
Mistral Large 21001001001007595.0%
Mistral Small 4 (Reasoning)100100100997494.6%
Ministral 3 14B100100100987394.3%
Cohere Command R+ (Aug. 2024)10010095928093.4%
Qwen 3 32B10010099877391.9%
Writer: Palmyra X510010098926791.3%
GPT-5.4 Mini (Reasoning, Low)10010097847390.9%
GPT-5 Nano100100100817390.9%
LFM2 24B100100100896490.6%
MiniMax M2.510010088838290.6%
Ministral 3 8B10010099985590.3%
MiniMax M2.71009185858589.4%
GPT-5.11009994866789.0%
Z.AI GLM 510010094806988.5%
Arcee AI: Trinity Mini100100100766588.3%
GPT-5.4 Nano (Reasoning, Low)1009085828287.8%
DeepSeek-V2 Chat10010091846287.3%
GPT-4o, Aug. 6th (temp=0)979791876387.0%
Hermes 3 405B100100100755586.0%
Qwen3 235B A22B Instruct 250710010094835285.8%
Nemotron 3 Super1009181807785.8%
Grok 4 Fast1001001001002685.1%
GPT-4o, May 13th (temp=0)10010081796484.8%
Qwen 2.5 72B1009890746084.4%
Nemotron 3 Nano1009090726783.7%
Ministral 8B1009992675983.3%
MoonshotAI: Kimi K2.51008787756582.7%
Gemini 2.5 Flash (Reasoning)1009281716882.5%
GPT-4o, May 13th (temp=1)1009182716782.3%
GPT-4o Mini (temp=1)1008781736280.7%
Gemini 2.5 Flash Lite (Reasoning)1009676676380.6%
Gemini 3.1 Pro (Preview)989391615880.4%
Ministral 3 3B1009479764979.7%
GPT-5.4 Nano (Reasoning)858282777279.6%
Gemma 3 12B1009785644979.0%
Stealth: Hunter Alpha1008780675978.5%
GPT-5.4 Nano978179756078.3%
Qwen 3.5 Plus (2026-02-15)1007675726978.2%
Ministral 3B10010075724077.3%
Gemma 3 4B1008979664876.4%
GPT-4.110010063615676.0%
Aion 2.01009268635275.0%
o4 Mini1008681614775.0%
Mistral Small 41007268686274.1%
DeepSeek V3.2897976655973.6%
ByteDance Seed 2.0 Lite1008780732072.0%
Qwen 3.5 397B A17B847574635970.9%
GPT-4o, Aug. 6th (temp=1)898070704370.3%
GPT-4o Mini (temp=0)787369686170.0%
Claude Opus 4.6 (Reasoning)797972714569.3%
Stealth: Healer Alpha957572713268.9%
Gemini 2.5 Flash Lite1007967524768.9%
Inception Mercury 2987758535067.2%
Z.AI GLM 4.61008764572666.8%
Claude Opus 4.6929161444366.5%
WizardLM 2 8x22b1001007953066.3%
o4 Mini High926867663666.0%
Gemini 2.5 Pro858057524964.5%
GPT-5 Mini827067564163.1%
Gemini 3 Flash (Preview, Reasoning)987950433861.7%
Gemma 3 27B818050423357.3%
DeepSeek V3.1757459333154.5%
GPT-4.1 Nano85766044754.2%
Gemini 3 Pro (Preview)615452514452.5%
ByteDance Seed 2.0 Mini625752423349.5%
Qwen 3.5 9B86685636049.2%
Gemini 3 Flash (Preview)625943403748.1%
Z.AI GLM 4.7615749462046.6%
Qwen 3.5 27B716538322446.0%
Hermes 3 70B84625722045.0%
Stealth: Aurora Alpha79773126042.6%
ByteDance Seed 1.6685047201539.8%
Z.AI GLM 4.7 Flash65514137539.7%
Claude Sonnet 4.6603935332838.9%
Claude Sonnet 4.6 (Reasoning)585142231838.2%
Gemini 3.1 Flash Lite (Preview)484339292937.5%
Qwen 3.5 35B7758416236.7%
GPT-5453730281731.4%
Qwen 3.5 Flash453433231529.8%
Mistral NeMO553125201829.8%
Qwen 3.5 122B6231129022.6%
Mistral Small 3.2 24B302500011.1%
Inception Mercury3880009.3%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-5.110010010010010099.9%
Z.AI GLM 51001001001009899.7%
Ministral 3 3B1001001001009899.5%
Claude Haiku 4.51001001001009799.5%
Arcee AI: Trinity Large (Preview)1001001001009699.2%
GPT-4o Mini (temp=0)1001001001009599.0%
MiniMax M2.51001001001009298.5%
Claude Opus 4.61001001001009198.3%
Claude Sonnet 41001001001009198.2%
Ministral 3 8B1001001001009198.2%
GPT-4.11001001001009198.1%
Gemini 2.5 Flash100100100959497.9%
Qwen 3.5 27B1001001001008897.6%
Gemma 3 4B1001001001008797.3%
GPT-5.4 Nano100100100998697.1%
Inception Mercury 210010097969196.9%
Gemini 2.5 Flash Lite (Reasoning)1001001001008496.7%
DeepSeek V3.21001001001008296.4%
Stealth: Hunter Alpha1001001001008096.1%
Arcee AI: Trinity Mini1009999978395.8%
GPT-4o, May 13th (temp=0)1001001001007795.5%
GPT-5 Mini100100100997695.0%
Gemini 3 Pro (Preview)100100100967894.8%
Claude Sonnet 4.6100100100898494.6%
Z.AI GLM 4.71001001001007394.6%
Stealth: Aurora Alpha100100100878594.4%
DeepSeek V3.1100100100878393.9%
o4 Mini100100100996592.7%
Z.AI GLM 4.7 Flash10010092918092.7%
Qwen 3.5 122B100100100966692.5%
GPT-5100100100936792.1%
Qwen 3.5 Flash10010094907591.8%
ByteDance Seed 1.6 Flash10010095887691.7%
WizardLM 2 8x22b100100100926591.5%
Qwen 3.5 35B100100100827591.3%
Nemotron 3 Super100100100857191.3%
Aion 2.010010094877190.3%
Stealth: Healer Alpha10010099796889.3%
Gemini 3.1 Flash Lite (Preview)10010097975289.2%
o4 Mini High10010093816988.7%
Gemini 2.5 Flash Lite10010095737188.0%
Gemini 2.5 Pro1009891836186.6%
Z.AI GLM 4.61009675746281.5%
ByteDance Seed 2.0 Mini1007674747379.6%
Qwen 3.5 9B1008282634574.5%
Mistral Small 3.2 24B1006149431854.2%
Mistral NeMO100634132047.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Mistral Large 310010010010010099.9%
Claude Opus 4.51001001001009999.8%
Llama 3.1 70B1001001001009899.6%
Claude 3 Haiku1001001001009799.5%
Hermes 3 70B1001001001009799.4%
Mistral Large 21001001001009799.3%
LFM2 24B100100100989899.1%
MiniMax M2.71001001001009398.6%
Z.AI GLM 51001001001009198.2%
Claude Opus 4.6100100100989398.1%
Z.AI GLM 4.51001001001009098.0%
Grok 4.20 (Beta)1001001001009097.9%
Writer: Palmyra X5100100100959497.8%
Mistral Small 4 (Reasoning)1001001001008997.8%
DeepSeek V3 (2025-03-24)1001001001008997.8%
GPT-5.11001001001008797.3%
Ministral 3B1001001001008797.3%
Ministral 3 8B10010096959296.8%
Mistral Large1001001001008396.7%
GPT-5 Nano100100100919096.3%
Qwen3 235B A22B Instruct 2507100100100928895.9%
GPT-4o, Aug. 6th (temp=1)100100100968395.7%
Llama 3.1 Nemotron 70B1001001001007795.3%
GPT-5.4 Mini1001001001007695.1%
Qwen 3 32B10010099987794.8%
Arcee AI: Trinity Large (Preview)1001001001007494.8%
MiniMax M2.510010096948394.6%
Mistral Small Creative100100100957594.0%
Inception Mercury 2100100100927693.6%
Claude Opus 4.6 (Reasoning)100100100937293.1%
ByteDance Seed 1.6 Flash100100100887692.7%
GPT-5.210010091888492.6%
Gemma 3 27B1009896937291.8%
GPT-5.4 Mini (Reasoning, Low)10010094927391.6%
GPT-4.1 Mini1009892858291.4%
Gemma 3 12B10010099886891.0%
Qwen 3.5 397B A17B10010095837490.4%
GPT-4o Mini (temp=1)1009894856788.6%
Ministral 3 14B100100100855788.5%
Hermes 3 405B100100100994288.3%
Ministral 8B100100100776287.8%
GPT-4o, Aug. 6th (temp=0)1009997925087.5%
GPT-4o, May 13th (temp=1)10010098766387.3%
Grok 41009695776887.1%
Nemotron 3 Nano979492797286.8%
Qwen 3.5 Plus (2026-02-15)1009285847086.3%
GPT-4.110010098943986.2%
Cohere Command R+ (Aug. 2024)100100100775386.1%
GPT-5.4 Mini (Reasoning)10010090726886.0%
Ministral 3 3B1008783827685.6%
GPT-5.4 Nano (Reasoning)1009591706484.2%
DeepSeek V3 (2024-12-26)100100100645383.3%
Claude Sonnet 4.6 (Reasoning)989692675581.8%
GPT-5.4 Nano909088835080.2%
Gemini 2.5 Flash1008377736880.2%
MoonshotAI: Kimi K2.51008880676479.8%
Qwen 2.5 72B1009983704479.3%
Gemini 3 Flash (Preview, Reasoning)928977705877.2%
Grok 4.20 (Beta, Reasoning)929084753975.9%
Stealth: Aurora Alpha969488643375.0%
Stealth: Hunter Alpha907775706374.7%
Aion 2.0928576615674.0%
Qwen 3.5 Flash898873665073.3%
DeepSeek V3.11007370705072.6%
GPT-5.4 Nano (Reasoning, Low)907269676572.6%
Qwen 3.5 27B1007776683170.3%
Z.AI GLM 4.6877672664769.6%
Gemini 2.5 Flash (Reasoning)918658565168.2%
Stealth: Healer Alpha1009264493668.2%
Gemini 3 Pro (Preview)767268675267.0%
Gemini 3 Flash (Preview)838259565166.5%
DeepSeek V3.2937258534964.8%
Gemini 2.5 Flash Lite937371464164.7%
WizardLM 2 8x22b756766654964.6%
GPT-4o Mini (temp=0)907159573963.3%
Z.AI GLM 4.7776763624462.5%
o4 Mini777566643162.4%
GPT-5 Mini956360533160.2%
GPT-4.1 Nano755959544658.5%
Gemini 2.5 Flash Lite (Reasoning)916749483558.0%
GPT-4o, May 13th (temp=0)776361463556.4%
o4 Mini High918146372455.8%
DeepSeek-V2 Chat10010033301555.5%
Qwen 3.5 35B1006644353155.3%
Claude Sonnet 4.6876949402955.0%
GPT-5706951383552.4%
Nemotron 3 Super1006254331152.0%
Qwen 3.5 122B89665437550.1%
Arcee AI: Trinity Mini746556301447.8%
Qwen 3.5 9B100952516047.3%
Gemini 2.5 Pro595147423346.3%
Z.AI GLM 4.7 Flash65645820342.1%
Inception Mercury8174346039.0%
Gemma 3 4B615630202037.3%
ByteDance Seed 2.0 Lite60583614835.2%
Gemini 3.1 Flash Lite (Preview)47423130029.8%
Mistral NeMO621674017.9%
ByteDance Seed 2.0 Mini4223220017.3%
Mistral Small 3.2 24B5220100016.5%
ByteDance Seed 1.6700001.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Claude 3.5 Haiku1001001001009799.4%
Llama 3.1 8B1001001001009398.5%
GPT-5.4 (Reasoning)1001001001007595.0%
Grok 4.20 (Beta, Reasoning)100100100967894.8%
Claude 3.5 Sonnet100100100878794.7%
Gemini 3.1 Pro (Preview)100100100908093.9%
Claude Sonnet 4.51001001001006793.4%
Claude Opus 41001001001006593.0%
Mistral Small 4 (Reasoning)10010092887290.4%
GPT-5.4 Nano (Reasoning, Low)1009693797488.4%
Llama 3.1 70B1001001001003587.1%
Z.AI GLM 5 Turbo1001001001003486.9%
Claude Opus 4.610010099765786.4%
Z.AI GLM 51001001001002284.5%
Claude 3.7 Sonnet10010091834984.5%
Claude Opus 4.6 (Reasoning)100100100783983.4%
GPT-4o, Aug. 6th (temp=1)100100100634782.0%
Qwen 2.5 72B100100100100080.0%
Claude Sonnet 4.6 (Reasoning)100100100622276.9%
Mistral Small 3.2 24B10010010080075.9%
Ministral 3 14B1009368625575.7%
Mistral Medium 3.11007574655974.6%
GPT-5 Nano1009663635174.5%
Z.AI GLM 4.510010067594473.9%
GPT-5.4 Nano (Reasoning)797976726273.6%
Qwen 3.5 Plus (2026-02-15)10010087601472.2%
GPT-5.4917269685971.6%
Grok 4.1 Fast10010094313070.9%
Writer: Palmyra X5948774524770.7%
Mistral Large 2959467524069.6%
DeepSeek V3 (2025-03-24)10010010047069.3%
Grok 4.20 (Beta)937870683769.1%
MiniMax M2.7100100100331268.9%
Claude Sonnet 410010065502067.0%
Ministral 3 3B1008275571866.5%
GPT-5.4 Nano1007470474166.3%
Rocinante 12B977164593364.8%
Gemini 3 Pro (Preview)100929132063.0%
Mistral Small 4908356453762.0%
MoonshotAI: Kimi K2.51006354523861.3%
Hermes 3 70B837068454061.2%
GPT-4.11009048442060.3%
Arcee AI: Trinity Large (Preview)1008478261159.8%
Ministral 3 8B94916845059.6%
Qwen3 235B A22B Instruct 25071007061491358.7%
Mistral Small Creative87816955058.3%
Claude Haiku 4.51001004542057.4%
Ministral 3B10010052181356.5%
Grok 4 Fast757253482855.2%
Claude 3 Haiku100855129854.6%
ByteDance Seed 1.6 Flash967260321254.5%
Nemotron 3 Super100625146953.7%
Gemini 3 Flash (Preview, Reasoning)805345424152.2%
Mistral Large 3100915211050.9%
Hermes 3 405B95636233050.6%
Mistral NeMO100924217050.2%
Gemma 3 12B1001002313748.5%
o4 Mini High100100400048.0%
Gemini 2.5 Flash (Reasoning)100803620047.2%
DeepSeek-V2 Chat100100330046.7%
Gemini 2.5 Flash10096370046.7%
Mistral Large100100320046.4%
Gemini 3.1 Flash Lite (Preview)8981601046.2%
GPT-4.1 Mini895730271243.1%
GPT-4o, May 13th (temp=1)100463626141.8%
Gemini 2.5 Flash Lite10083250041.5%
LFM2 24B76634215039.2%
Z.AI GLM 4.7100521817538.4%
GPT-51008740038.1%
GPT-5.4 Mini654034312037.9%
GPT-4o, Aug. 6th (temp=0)71454133037.7%
Qwen 3 32B703636301036.5%
Gemini 3 Flash (Preview)464439351736.2%
Cohere Command R+ (Aug. 2024)8276153034.9%
ByteDance Seed 2.0 Mini6354531034.5%
WizardLM 2 8x22b9850240034.4%
GPT-4o Mini (temp=1)7549348033.2%
Z.AI GLM 4.7 Flash80481811031.3%
Stealth: Aurora Alpha10035200031.0%
Qwen 3.5 27B5751416030.9%
Inception Mercury1002880027.1%
Nemotron 3 Nano6347250027.0%
Ministral 8B10022110026.6%
ByteDance Seed 2.0 Lite41292927025.1%
Qwen 3.5 9B5831280023.5%
DeepSeek V3 (2024-12-26)6726200022.5%
Aion 2.0823000022.4%
GPT-4o, May 13th (temp=0)100700021.5%
Gemini 2.5 Flash Lite (Reasoning)6036100021.2%
Qwen 3.5 Flash633930020.9%
Gemma 3 27B4530300020.9%
Stealth: Hunter Alpha100200020.4%
o4 Mini100000020.0%
Qwen 3.5 35B682400018.3%
ByteDance Seed 1.6652400018.0%
Qwen 3.5 397B A17B28181411214.6%
Qwen 3.5 122B49900011.7%
Z.AI GLM 4.6302800011.7%
Inception Mercury 257000011.4%
Stealth: Healer Alpha4600009.2%
GPT-5 Mini4600009.2%
Grok 43500007.0%
Arcee AI: Trinity Mini3300006.6%
DeepSeek V3.22700005.5%
Gemini 2.5 Pro2070005.3%
GPT-4o Mini (temp=0)1640004.1%
DeepSeek V3.1000000.0%
GPT-4.1 Nano000000.0%
Gemma 3 4B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Mistral Small 4100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
o4 Mini10010010010010099.9%
Gemma 3 12B10010010010010099.9%
Cohere Command R+ (Aug. 2024)1001001001009999.8%
GPT-4o, May 13th (temp=0)100100100999999.6%
Claude Sonnet 4.51001001001009899.6%
DeepSeek V3 (2024-12-26)1001001001009899.6%
Mistral Small 4 (Reasoning)1001001001009899.6%
Qwen3 235B A22B Instruct 25071001001001009799.4%
Mistral Large 21001001001009699.3%
GPT-5.1100100100989899.2%
Claude Opus 4.61001001001009498.9%
Gemma 3 27B1001001001009398.5%
Gemini 3.1 Pro (Preview)1001001001009398.5%
Nemotron 3 Nano1001001001009098.1%
LFM2 24B1001001001008997.9%
Qwen 3.5 Plus (2026-02-15)1001001001008897.6%
Stealth: Aurora Alpha1001001001008897.6%
GPT-4o, May 13th (temp=1)1001001001008797.3%
GPT-5 Mini100100100949297.3%
ByteDance Seed 1.6 Flash1001001001008597.0%
Gemini 3 Pro (Preview)1001001001008597.0%
GPT-5.4 Nano100100100988596.7%
Ministral 3 8B100100100988496.5%
Claude 3.5 Haiku100100100958796.3%
Rocinante 12B100100100978295.9%
Mistral Small Creative100100100918795.5%
Nemotron 3 Super1001001001007695.3%
MoonshotAI: Kimi K2.51001001001007194.3%
GPT-4o, Aug. 6th (temp=0)100100100997194.0%
GPT-4o Mini (temp=0)1001001001006893.7%
MiniMax M2.51001001001006693.1%
Qwen 2.5 72B10010098927693.1%
Mistral Large 3100100100976893.0%
Hermes 3 405B1001001001006492.9%
Arcee AI: Trinity Large (Preview)100100100907392.5%
Mistral Large100100100936792.0%
Qwen 3 32B100100100907091.9%
Claude Haiku 4.5100100100995991.5%
Z.AI GLM 4.7 Flash10010093847991.3%
Mistral Medium 3.110010096887291.1%
Writer: Palmyra X51001001001005591.1%
Gemini 3 Flash (Preview)10010095787589.4%
GPT-4.1 Mini100100100826288.8%
Gemini 2.5 Flash10010085807988.7%
Aion 2.0100100100836088.7%
GPT-4o Mini (temp=1)100100100845888.3%
Qwen 3.5 397B A17B10010084787687.5%
MiniMax M2.71009785837287.2%
Z.AI GLM 4.7100100100874285.7%
Hermes 3 70B10010093765484.6%
Claude Sonnet 4.610010084795984.4%
GPT-5998980777684.3%
DeepSeek-V2 Chat100100100873484.1%
GPT-4.1100100100774283.7%
Stealth: Healer Alpha909087876483.5%
Inception Mercury10010078755581.7%
ByteDance Seed 2.0 Mini1009379745981.0%
Gemini 2.5 Flash Lite (Reasoning)898279787280.0%
WizardLM 2 8x22b10010087822779.1%
DeepSeek V3.21008576686579.1%
ByteDance Seed 2.0 Lite10010085624879.1%
Gemini 2.5 Flash (Reasoning)1008583636178.5%
Qwen 3.5 Flash1009289595278.3%
Qwen 3.5 27B10010097702478.2%
Stealth: Hunter Alpha988280656177.1%
GPT-4o, Aug. 6th (temp=1)10010089494777.1%
Gemini 2.5 Flash Lite10010081683677.0%
Qwen 3.5 122B1008575715376.8%
Gemma 3 4B1009794623176.8%
Qwen 3.5 9B1009587861576.4%
Gemini 3 Flash (Preview, Reasoning)877773706975.0%
Z.AI GLM 4.61007977635374.4%
Gemini 2.5 Pro949068675073.8%
o4 Mini High1007573714572.7%
DeepSeek V3.11008060484566.6%
Qwen 3.5 35B94877471566.1%
Arcee AI: Trinity Mini797171623363.4%
GPT-4.1 Nano817461603662.6%
Gemini 3.1 Flash Lite (Preview)595352493749.8%
ByteDance Seed 1.6715449353448.6%
Mistral NeMO515041261636.8%
Mistral Small 3.2 24B1003800027.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Ministral 3B100100100100100100.0%
Llama 3.1 8B100100100999598.8%
Llama 3.1 Nemotron 70B1001001001008496.8%
Z.AI GLM 5 Turbo1009995898393.2%
GPT-5.21001001001006292.3%
Gemini 3.1 Pro (Preview)979589878289.9%
Mistral Large 3100100100875889.0%
Claude 3.5 Haiku10010087807788.8%
Ministral 3 3B1001001001001382.6%
GPT-5.410010074736482.4%
GPT-5.4 Nano (Reasoning)948583776981.5%
GPT-5.4 Nano (Reasoning, Low)10010077676081.0%
Mistral Medium 3.1100100100851379.7%
Mistral Large 210010076615879.0%
GPT-5.4 Nano1008274726678.7%
Mistral Small 4 (Reasoning)888583805578.5%
Llama 3.1 70B10010095841077.9%
GPT-5.4 Mini (Reasoning)1008280725377.4%
Mistral Small 410010085713177.2%
Z.AI GLM 4.510010076575276.9%
GPT-5.4 Mini1008179754375.4%
Claude 3 Haiku838077696174.2%
Ministral 3 14B898873685174.0%
Claude Sonnet 41008275752471.1%
Ministral 8B988162555469.9%
GPT-5.4 Mini (Reasoning, Low)858358555366.8%
Claude 3.7 Sonnet1001007548766.0%
Rocinante 12B1001008041064.2%
GPT-5 Nano100938540063.8%
Claude Haiku 4.510010047363363.2%
GPT-5.4 (Reasoning, Low)1006157504863.1%
Mistral Small Creative928369373162.6%
Grok 4.1 Fast1005655534962.5%
Grok 4.20 (Beta)867162543862.0%
Grok 4.20 (Beta, Reasoning)767356524860.9%
Hermes 3 405B1001005845260.8%
Ministral 3 8B100796937858.6%
Z.AI GLM 5806353504057.3%
Arcee AI: Trinity Large (Preview)757170442556.9%
Qwen 3 32B100716547056.5%
Gemini 3 Flash (Preview)878146372354.7%
Hermes 3 70B100767020053.1%
Inception Mercury 277656255051.8%
DeepSeek V3 (2024-12-26)10075740049.8%
GPT-5.4 (Reasoning)676647412949.7%
GPT-4o, Aug. 6th (temp=1)8974694147.3%
Claude Sonnet 4.510066662046.8%
GPT-4o, Aug. 6th (temp=0)9668599046.6%
Claude Opus 4.5635543363446.2%
Claude Opus 4.6 (Reasoning)585350452245.9%
Qwen 3.5 Plus (2026-02-15)724341402944.9%
Claude Opus 4.6715452221342.2%
Mistral Large70685220041.9%
Gemini 3 Flash (Preview, Reasoning)64574137741.3%
GPT-4o, May 13th (temp=1)78644716041.0%
Qwen 2.5 72B10052379640.9%
Writer: Palmyra X5100612217040.1%
DeepSeek V3 (2025-03-24)605433272339.5%
Grok 4 Fast585645201238.2%
Mistral Small 3.2 24B9943409038.2%
Claude Opus 453484138035.9%
MoonshotAI: Kimi K2.554494824235.3%
Qwen3 235B A22B Instruct 2507100382712035.2%
Z.AI GLM 4.6913418151234.2%
Gemma 3 12B59403923633.2%
GPT-4.1 Mini6851400031.9%
GPT-5.153332721828.5%
Nemotron 3 Super57561511027.9%
Qwen 3.5 397B A17B503319141225.7%
Claude Sonnet 4.6 (Reasoning)412827201225.4%
ByteDance Seed 1.6 Flash4746276025.3%
LFM2 24B5239290024.2%
Cohere Command R+ (Aug. 2024)842370022.7%
Claude 3.5 Sonnet34262020520.9%
DeepSeek V3.25537120020.8%
Qwen 3.5 35B4234270020.7%
Qwen 3.5 9B494500018.8%
ByteDance Seed 1.6732000018.7%
Gemma 3 27B30271715017.8%
Z.AI GLM 4.73924159017.5%
Grok 44332101017.1%
DeepSeek-V2 Chat641480017.1%
MiniMax M2.7434200017.0%
GPT-4.1691400016.6%
ByteDance Seed 2.0 Lite4620150016.2%
Stealth: Healer Alpha5116140016.1%
Mistral NeMO483200016.0%
Qwen 3.5 27B561240014.5%
Gemini 3 Pro (Preview)3820120014.1%
Stealth: Hunter Alpha3224120013.7%
Arcee AI: Trinity Mini431200011.1%
Stealth: Aurora Alpha2220130010.9%
o4 Mini53000010.5%
Gemini 2.5 Flash (Reasoning)391200010.4%
Claude Sonnet 4.624117208.7%
GPT-4o Mini (temp=1)201310008.7%
WizardLM 2 8x22b3440007.7%
MiniMax M2.52863007.4%
Qwen 3.5 Flash2790007.2%
ByteDance Seed 2.0 Mini17140006.3%
Gemma 3 4B17140006.2%
o4 Mini High2074006.2%
Gemini 2.5 Flash2370006.0%
GPT-5 Mini2800005.6%
DeepSeek V3.12800005.5%
Inception Mercury2700005.4%
Gemini 3.1 Flash Lite (Preview)2600005.1%
Z.AI GLM 4.7 Flash2320005.1%
GPT-4o, May 13th (temp=0)2500005.1%
Aion 2.01550004.2%
Qwen 3.5 122B1520003.5%
Gemini 2.5 Flash Lite1700003.5%
GPT-4.1 Nano1700003.4%
Nemotron 3 Nano1600003.3%
Gemini 2.5 Pro400000.8%
Gemini 2.5 Flash Lite (Reasoning)400000.7%
GPT-5000000.0%
GPT-4o Mini (temp=0)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
LFM2 24B100100100100100100.0%
Mistral Small 41001001001009999.8%
Mistral Large 31001001001009899.7%
Grok 4 Fast1001001001009899.6%
Ministral 3 14B1001001001009799.5%
GPT-5.4 (Reasoning, Low)1001001001009598.9%
Ministral 3 3B1001001001009398.7%
GPT-4.1 Mini100100100989398.2%
GPT-5.4 (Reasoning)1001001001009198.1%
Grok 4.20 (Beta, Reasoning)100100100969497.9%
Z.AI GLM 5 Turbo100100100969397.9%
GPT-5.21001001001008897.6%
Qwen 3.5 Plus (2026-02-15)1001001001008897.6%
GPT-5.4 Mini (Reasoning)100100100988997.5%
Stealth: Hunter Alpha100100100939297.1%
Qwen 3 32B10010096959396.6%
Claude 3.5 Sonnet100100100998396.4%
Claude Opus 4.6 (Reasoning)1001001001008296.3%
Claude 3.5 Haiku1001001001008096.0%
Ministral 3B100100100988296.0%
Qwen3 235B A22B Instruct 25071001001001007895.6%
Mistral Large100100100977995.4%
Hermes 3 70B10010097908995.3%
Cohere Command R+ (Aug. 2024)1001001001007695.1%
GPT-5.4 Mini (Reasoning, Low)100100100908394.6%
MiniMax M2.5100100100937994.3%
GPT-4o, May 13th (temp=1)10010099878594.0%
Gemini 3.1 Pro (Preview)1001001001006893.6%
Arcee AI: Trinity Mini100100100848193.0%
Z.AI GLM 5100100100927092.4%
GPT-5 Nano100100100847892.3%
Llama 3.1 70B1001001001006292.3%
Mistral Small Creative100100100946792.1%
Claude Opus 4.61001001001005591.0%
MiniMax M2.7100100100886290.1%
Grok 410010098826789.4%
Z.AI GLM 4.6100100100905789.3%
Arcee AI: Trinity Large (Preview)10010091837189.0%
Gemini 2.5 Flash Lite100100100875889.0%
GPT-4o, Aug. 6th (temp=0)10010088787788.6%
Ministral 3 8B10010095905688.2%
Writer: Palmyra X510010092856087.5%
DeepSeek-V2 Chat100100100686786.9%
Gemini 2.5 Flash10010083747386.2%
Gemini 2.5 Flash Lite (Reasoning)10010088776385.7%
Stealth: Aurora Alpha100100100755385.6%
Gemma 3 12B10010083806685.6%
GPT-5.4 Nano (Reasoning, Low)1008887797485.5%
MoonshotAI: Kimi K2.5100100100755285.4%
GPT-4.1100100100923385.1%
GPT-5.11009387846184.9%
GPT-4o, May 13th (temp=0)10010076737284.2%
GPT-4o, Aug. 6th (temp=1)10010089814582.8%
Ministral 8B100100100595081.8%
Rocinante 12B100100100792580.8%
Qwen 2.5 72B10010094604980.6%
Nemotron 3 Super1008982774779.0%
Gemini 3 Flash (Preview)969090714978.9%
Gemma 3 27B878578756978.8%
Z.AI GLM 4.7888786825078.3%
Nemotron 3 Nano10010070675478.2%
Gemini 3 Pro (Preview)989590554877.1%
Gemini 2.5 Flash (Reasoning)908884774777.0%
Aion 2.01008478725177.0%
GPT-5.4 Nano898680656276.5%
GPT-4o Mini (temp=1)10010081792276.4%
Stealth: Healer Alpha959379605075.3%
Gemma 3 4B10010090463975.1%
GPT-5.4 Nano (Reasoning)918574655974.6%
Gemini 3 Flash (Preview, Reasoning)848278686074.3%
o4 Mini High1009288602572.8%
Inception Mercury 21001009964172.8%
GPT-4.1 Nano1008578544472.2%
ByteDance Seed 1.6 Flash848366665470.7%
o4 Mini100947674870.1%
DeepSeek V3.2757471676169.6%
DeepSeek V3.1847067633964.6%
Claude Sonnet 4.61008079303063.9%
WizardLM 2 8x22b927369443362.2%
Qwen 3.5 397B A17B92898332059.1%
Qwen 3.5 Flash827262452958.1%
Qwen 3.5 9B948440393257.8%
Qwen 3.5 27B755958494056.5%
Mistral Small 3.2 24B787870411456.0%
Gemini 3.1 Flash Lite (Preview)847745443055.9%
Gemini 2.5 Pro656250463652.1%
ByteDance Seed 2.0 Lite70705858452.0%
ByteDance Seed 2.0 Mini645352513651.0%
DeepSeek V3 (2024-12-26)1004839322348.3%
GPT-577575744447.9%
Z.AI GLM 4.7 Flash91595138047.7%
GPT-5 Mini765442322746.3%
Qwen 3.5 122B795943232245.0%
Claude Sonnet 4.6 (Reasoning)91564424644.1%
GPT-4o Mini (temp=0)535240343342.3%
ByteDance Seed 1.6787129141340.8%
Qwen 3.5 35B66444440038.5%
Mistral NeMO59454311933.2%
Inception Mercury2300004.5%