"Not X but Y" pattern overuse

Test: Bad Writing Habits

Avg. Score
85.1%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Claude 3 Haiku100.0%$0.002514.9s100%
2Qwen 2.5 72B99.6%$0.001036.7s95%
3GPT-4o Mini (temp=0)99.6%$0.001234.8s93%
4o4 Mini High100.0%$0.02547.2s100%
5Mistral NeMO98.0%$0.000510.1s77%
6Llama 3.1 70B98.1%$0.001529.4s81%
7Grok 4.20 (Beta)98.5%$0.01815.8s81%
8GPT-4o, Aug. 6th (temp=0)98.6%$0.02322.7s84%
9Hermes 3 405B98.6%$0.003253.2s82%
10Gemini 3 Flash (Preview)96.5%$0.007819.6s75%
11Gemini 3 Flash (Preview, Reasoning)97.3%$0.01230.1s79%
12GPT-4o Mini (temp=1)97.0%$0.001234.8s71%
13Stealth: Aurora Alpha95.7%$0.00009.8s63%
14Llama 3.1 8B98.2%$0.00031.3m77%
15ByteDance Seed 2.0 Lite99.7%$0.0122.2m96%
16GPT-4o, May 13th (temp=0)97.6%$0.03514.1s74%
17Hermes 3 70B96.8%$0.00101.2m76%
18o4 Mini96.1%$0.01525.7s68%
19Mistral Large95.2%$0.01430.9s63%
20GPT-4o, May 13th (temp=1)96.0%$0.03314.4s64%
21Ministral 8B90.7%$0.000410.4s49%
22Mistral Large 293.5%$0.01329.4s57%
23Grok 4.1 Fast92.9%$0.001837.8s53%
24Inception Mercury 290.3%$0.00327.0s47%
25Ministral 3 14B89.6%$0.000711.7s47%
26Rocinante 12B92.6%$0.001438.4s51%
27GPT-5.4 Mini (Reasoning)93.2%$0.02228.1s56%
28Inception Mercury91.6%$0.01117.6s48%
29Llama 3.1 Nemotron 70B90.7%$0.003831.7s48%
30Qwen 3.5 Flash91.7%$0.002547.5s50%
31Gemma 3 12B91.4%$0.000441.3s47%
32Mistral Large 390.1%$0.003330.3s47%
33Qwen 3.5 27B95.6%$0.0201.6m69%
34DeepSeek V3 (2024-12-26)91.9%$0.002154.6s51%
35Z.AI GLM 4.793.7%$0.0101.4m61%
36Qwen 3.5 9B92.5%$0.00111.4m57%
37Gemini 3.1 Flash Lite (Preview)87.0%$0.00308.4s41%
38Cohere Command R+ (Aug. 2024)92.4%$0.02052.5s56%
39Ministral 3 8B88.1%$0.000819.6s41%
40Grok 4.20 (Beta, Reasoning)93.9%$0.03934.0s55%
41Qwen 3.5 122B94.2%$0.0251.1m56%
42Z.AI GLM 5 Turbo88.3%$0.008133.2s44%
43GPT-5.4 Mini (Reasoning, Low)87.6%$0.01516.8s42%
44Z.AI GLM 4.589.1%$0.005142.1s43%
45ByteDance Seed 1.697.2%$0.0132.5m70%
46GPT-5.4 Mini86.5%$0.01516.8s43%
47Mistral Small Creative85.1%$0.00079.1s35%
48Gemini 3.1 Pro (Preview)99.9%$0.1071.8m98%
49Ministral 3B84.2%$0.00018.1s35%
50Z.AI GLM 4.7 Flash88.9%$0.00171.2m47%
51DeepSeek-V2 Chat88.4%$0.002153.3s42%
52GPT-4.1 Nano84.7%$0.000713.3s32%
53Qwen 3.5 Plus (2026-02-15)82.1%$0.006031.5s42%
54Ministral 3 3B82.7%$0.000511.1s32%
55Arcee AI: Trinity Large (Preview)85.9%$0.000043.6s37%
56GPT-5.4 Nano (Reasoning)83.0%$0.006124.5s37%
57GPT-4o, Aug. 6th (temp=1)87.1%$0.01824.4s37%
58Gemini 3 Pro (Preview)92.1%$0.05554.4s57%
59Mistral Medium 3.185.5%$0.004836.5s36%
60Qwen 3.5 35B90.0%$0.0181.0m44%
61Z.AI GLM 589.0%$0.00841.2m44%
62Claude 3.5 Sonnet89.5%$0.04835.5s50%
63MiniMax M2.786.3%$0.00401.1m41%
64GPT-4.1 Mini82.0%$0.002719.0s30%
65MiniMax M2.587.5%$0.00341.3m40%
66Mistral Small 4 (Reasoning)81.2%$0.002230.2s31%
67GPT-5.4 Nano (Reasoning, Low)80.5%$0.005520.6s30%
68ByteDance Seed 2.0 Mini98.9%$0.00454.9m89%
69Qwen 3 32B83.0%$0.001554.6s33%
70GPT-5.192.4%$0.0541.8m62%
71Mistral Small 3.2 24B100.0%$0.00695.7m100%
72Arcee AI: Trinity Mini74.9%$0.00039.2s26%
73Gemma 3 4B78.3%$0.000220.0s25%
74Z.AI GLM 4.682.9%$0.006551.5s32%
75Gemma 3 27B81.9%$0.000652.6s28%
76DeepSeek V3 (2025-03-24)81.5%$0.001439.4s25%
77WizardLM 2 8x22b85.6%$0.00261.8m41%
78Gemini 2.5 Pro83.6%$0.03636.2s37%
79Qwen 3.5 397B A17B92.9%$0.0143.0m58%
80Claude Sonnet 4.584.3%$0.03538.1s34%
81ByteDance Seed 1.6 Flash75.9%$0.001327.3s25%
82GPT-5.4 Nano77.8%$0.005726.3s23%
83Claude Haiku 4.577.7%$0.01121.6s24%
84Grok 4 Fast73.3%$0.001724.1s23%
85Claude 3.5 Haiku76.7%$0.003510.8s16%
86Claude Sonnet 483.6%$0.03243.7s31%
87Qwen3 235B A22B Instruct 250777.7%$0.001159.2s26%
88Gemini 2.5 Flash Lite72.2%$0.00099.5s18%
89DeepSeek V3.182.8%$0.00201.8m35%
90Claude Opus 4.588.2%$0.07053.4s44%
91GPT-4.179.3%$0.01844.7s26%
92GPT-5.4 (Reasoning, Low)87.8%$0.0551.4m45%
93Gemini 2.5 Flash70.9%$0.005210.6s19%
94Mistral Small 471.2%$0.001418.2s19%
95Claude Sonnet 4.680.4%$0.03139.3s26%
96Nemotron 3 Nano76.2%$0.00101.1m23%
97GPT-5.485.3%$0.0491.4m41%
98Writer: Palmyra X570.8%$0.01122.0s16%
99LFM2 24B67.7%$0.000228.4s15%
100Claude 3.7 Sonnet78.2%$0.04246.7s25%
101GPT-5 Mini69.0%$0.010057.4s19%
102Grok 482.3%$0.0481.7m33%
103Claude Opus 4.681.3%$0.0781.2m37%
104Claude Opus 4.6 (Reasoning)84.8%$0.0881.4m38%
105Gemini 2.5 Flash (Reasoning)60.8%$0.01121.5s11%
106Stealth: Healer Alpha58.3%$0.000023.7s9%
107Aion 2.066.0%$0.00641.3m18%
108DeepSeek V3.271.2%$0.00141.9m21%
109GPT-589.3%$0.0652.8m47%
110GPT-5.4 (Reasoning)89.6%$0.0892.6m54%
111GPT-5.276.9%$0.0561.5m26%
112Claude Sonnet 4.6 (Reasoning)74.6%$0.0601.2m21%
113Stealth: Hunter Alpha56.0%$0.000055.0s8%
114Gemini 2.5 Flash Lite (Reasoning)47.5%$0.002830.8s5%
115Nemotron 3 Super54.8%$0.00001.4m7%
116MoonshotAI: Kimi K2.559.6%$0.0193.2m10%
117Claude Opus 486.4%$0.2091.4m37%
118GPT-5 Nano14.2%$0.00421.4m0%
85.13%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
GPT-4o, May 13th (temp=1)1001001001009699.3%
Hermes 3 405B1001001001009098.1%
Grok 4.20 (Beta)1001001001009097.9%
Z.AI GLM 4.51001001001008597.1%
Gemini 3 Flash (Preview)1001001001008196.2%
DeepSeek V3.21001001001007895.6%
Qwen 2.5 72B1001001001007895.5%
GPT-4o, Aug. 6th (temp=1)1001001001007795.3%
Qwen 3.5 27B1001001001007394.7%
Cohere Command R+ (Aug. 2024)1001001001006292.4%
Mistral Large100100100916891.9%
Mistral Large 21001001001005891.6%
ByteDance Seed 2.0 Mini1001001001005190.3%
Mistral Large 3100100100905088.1%
Rocinante 12B1001001001003887.7%
Grok 4 Fast1001001001003787.5%
Nemotron 3 Nano100100100696787.2%
Z.AI GLM 5 Turbo1001001001003386.6%
Gemma 3 12B1001001001003186.2%
MiniMax M2.51001001001003086.0%
Grok 41001001001002484.9%
GPT-5.1100100100784183.9%
Gemma 3 27B1001001001001783.3%
GPT-4.1 Nano1001001001001683.2%
Z.AI GLM 4.7 Flash10010098744383.1%
Qwen 3.5 122B100100100100981.8%
Gemini 3 Pro (Preview)10010097733380.6%
Z.AI GLM 4.7100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Ministral 3B100100100100080.0%
GPT-510010010083076.5%
DeepSeek V3.110010010069073.8%
Claude Opus 4.61001008779073.2%
Qwen 3 32B10010079651972.5%
Claude Sonnet 4.510010010055672.1%
MiniMax M2.710010010059071.9%
Claude 3.5 Sonnet10010062484771.4%
Qwen 3.5 397B A17B10010010047069.4%
Gemini 3.1 Flash Lite (Preview)10010010045069.0%
Claude Opus 410010010042068.5%
Aion 2.010010010041068.2%
Llama 3.1 Nemotron 70B10010010037067.3%
Claude Sonnet 4.610010010034066.9%
GPT-5.4 Nano (Reasoning)1001008337063.9%
Ministral 3 3B1001006452063.2%
GPT-4.1 Mini10010010011062.3%
GPT-5.4 Mini (Reasoning)1001001004060.9%
Qwen 3.5 Plus (2026-02-15)1001009113060.9%
GPT-4.11001001000060.0%
DeepSeek V3 (2025-03-24)1001001000060.0%
Qwen3 235B A22B Instruct 25071001001000060.0%
Writer: Palmyra X51001001000060.0%
Ministral 3 14B1001001000060.0%
Gemini 2.5 Flash Lite1001006040059.9%
GPT-5.4100866645059.3%
Mistral Small 4 (Reasoning)100100940058.7%
Z.AI GLM 4.610097950058.2%
GPT-5.4 Mini (Reasoning, Low)100100860057.3%
Claude 3.7 Sonnet100100800056.0%
LFM2 24B10094860056.0%
Gemini 2.5 Flash (Reasoning)1006551431655.0%
GPT-5.4 (Reasoning)100796036054.9%
GPT-5.4 (Reasoning, Low)10094770054.1%
Claude Opus 4.5100100660053.3%
Z.AI GLM 5100100630052.6%
Arcee AI: Trinity Mini100100630052.6%
Arcee AI: Trinity Large (Preview)100100483050.3%
Claude Opus 4.6 (Reasoning)1001002418048.4%
Claude Sonnet 4100100328048.0%
GPT-5.4 Mini100792915746.1%
Ministral 8B10097250044.5%
WizardLM 2 8x22b100100190043.9%
Claude Sonnet 4.6 (Reasoning)100100190043.7%
GPT-5.4 Nano (Reasoning, Low)10010000040.0%
Ministral 3 8B10010000040.0%
Mistral Small Creative1009200038.4%
GPT-5.210069140036.6%
Mistral Small 41006700033.5%
Gemini 2.5 Flash10035280032.5%
Claude Haiku 4.51006200032.5%
Nemotron 3 Super10035180030.7%
Mistral Medium 3.11004600029.2%
ByteDance Seed 1.6 Flash5855130025.3%
GPT-5 Mini922500023.5%
Gemini 2.5 Pro742700020.2%
Claude 3.5 Haiku100000020.0%
GPT-5.4 Nano100000020.0%
Gemma 3 4B100000020.0%
MoonshotAI: Kimi K2.598000019.7%
Stealth: Healer Alpha77000015.4%
Stealth: Hunter Alpha000000.0%
Gemini 2.5 Flash Lite (Reasoning)000000.0%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Cohere Command R+ (Aug. 2024)1001001001009999.7%
Mistral Medium 3.11001001001009498.7%
ByteDance Seed 2.0 Lite1001001001009298.5%
Claude Sonnet 4.6 (Reasoning)1001001001008897.7%
Claude Opus 4.61001001001008897.6%
Mistral Large 21001001001008797.4%
Mistral Large1001001001008496.8%
Llama 3.1 70B1001001001007494.9%
GPT-5.4 (Reasoning)100100100948094.8%
Gemini 3 Flash (Preview)1001001001006993.8%
Z.AI GLM 4.71001001001005290.4%
Grok 4 Fast1001001001004589.0%
Mistral Small 41001001001003787.4%
Qwen 3 32B1001001001003587.0%
MiniMax M2.51001001001002785.4%
ByteDance Seed 1.61001001001001382.7%
GPT-5.4 Mini (Reasoning, Low)100100100575682.5%
Nemotron 3 Nano100100100703080.0%
Qwen 3.5 122B100100100100080.0%
Claude Opus 4.5100100100100080.0%
Claude Sonnet 4.5100100100100080.0%
Qwen 3.5 35B100100100100080.0%
Gemini 3.1 Flash Lite (Preview)100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Claude 3.7 Sonnet100100100100080.0%
Writer: Palmyra X5100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Ministral 3 8B100100100100080.0%
Ministral 8B100100100100080.0%
Ministral 3B100100100100080.0%
GPT-4o, Aug. 6th (temp=1)10010010097079.4%
Hermes 3 70B100100100603077.8%
Claude Opus 4.6 (Reasoning)10010086691774.5%
Mistral Small 4 (Reasoning)10010083483873.7%
Ministral 3 14B10010010048069.6%
Gemma 3 12B10010010042068.4%
GPT-4.1 Nano100100100231567.6%
WizardLM 2 8x22b1009952463666.7%
Z.AI GLM 510010010034066.7%
MiniMax M2.71007772402963.6%
GPT-5.110010010011062.2%
Qwen 3.5 Plus (2026-02-15)100948032061.2%
Claude Sonnet 4.61001001003260.8%
Z.AI GLM 4.51001001002060.5%
GPT-4.11001001000060.0%
Claude Haiku 4.51001001000060.0%
DeepSeek V3 (2024-12-26)1001001000060.0%
DeepSeek V3 (2025-03-24)1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
GPT-5.4 Nano1001001000060.0%
Ministral 3 3B1001001000060.0%
Z.AI GLM 4.7 Flash1001001000059.9%
DeepSeek V3.2100875049057.2%
DeepSeek V3.1100100810056.2%
Z.AI GLM 4.6100100760055.3%
GPT-5.4 Nano (Reasoning)1001004925054.7%
Arcee AI: Trinity Large (Preview)1001004529054.7%
GPT-5.4 Mini100100721054.6%
Claude Opus 4100100657054.4%
Inception Mercury100100600051.9%
Inception Mercury 2100100510050.1%
GPT-5.4 Nano (Reasoning, Low)1005239381849.4%
MoonshotAI: Kimi K2.510087590049.3%
LFM2 24B10066650046.3%
DeepSeek-V2 Chat1001001811045.8%
Aion 2.010071532045.2%
Grok 410010000040.0%
Gemma 3 4B10010000040.0%
Gemma 3 27B1009900039.8%
Qwen3 235B A22B Instruct 25071009200038.4%
Stealth: Healer Alpha6663610038.0%
GPT-5.410051390038.0%
GPT-4.1 Mini1008610037.3%
ByteDance Seed 1.6 Flash6556500034.2%
Stealth: Hunter Alpha1006150033.3%
GPT-5 Mini1004970031.1%
Nemotron 3 Super6843220026.7%
Gemini 2.5 Flash (Reasoning)1002400024.8%
Arcee AI: Trinity Mini4738140019.7%
GPT-5.2582700017.0%
Gemini 2.5 Flash Lite (Reasoning)82000016.5%
Gemini 2.5 Flash Lite100000.2%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, May 13th (temp=1)10010010010010099.9%
LFM2 24B1001001001009699.2%
DeepSeek V3.21001001001008296.4%
Qwen 3.5 Plus (2026-02-15)1001001001007895.6%
GPT-5.4 Mini1001001001007795.5%
GPT-4o, Aug. 6th (temp=0)100100100957694.2%
WizardLM 2 8x22b1001001001006492.8%
ByteDance Seed 1.6 Flash1001001001006292.5%
Gemini 2.5 Flash (Reasoning)1001001001006192.1%
Stealth: Healer Alpha100100100986292.0%
Gemini 2.5 Flash Lite1001001001005991.8%
GPT-5 Mini100100100827691.7%
DeepSeek V3.11001001001005591.0%
MoonshotAI: Kimi K2.51001001001005490.8%
GPT-4o Mini (temp=1)1001001001005290.5%
Writer: Palmyra X51001001001005090.0%
Mistral Small 4100100100945589.8%
Mistral Small 4 (Reasoning)1001001001004789.4%
Mistral Medium 3.11001001001004588.9%
GPT-4o, Aug. 6th (temp=1)100100100983887.1%
Aion 2.01001001001003586.9%
Gemini 3.1 Flash Lite (Preview)1001001001003586.9%
Qwen 3 32B100100100765786.6%
GPT-4.1 Mini100100100774985.2%
Gemma 3 12B1001001001002484.8%
GPT-5.4 Nano100100100932984.4%
GPT-5.4 Nano (Reasoning, Low)1001001001002184.2%
Cohere Command R+ (Aug. 2024)100100100981482.5%
Nemotron 3 Super1001001001001282.3%
Gemini 2.5 Flash Lite (Reasoning)100100100851880.5%
Gemini 2.5 Flash100100100100080.0%
Llama 3.1 Nemotron 70B100100100100080.0%
Claude 3.7 Sonnet10010010086077.3%
GPT-5.4 (Reasoning, Low)10010010077576.4%
Stealth: Hunter Alpha100836160060.9%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral NeMO100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-5.4 (Reasoning, Low)1001001001009999.9%
Gemini 2.5 Pro1001001001009999.7%
Ministral 8B1001001001009298.5%
Inception Mercury 21001001001009298.4%
ByteDance Seed 2.0 Mini100100100999298.2%
Mistral Large1001001001008897.7%
Gemini 3 Flash (Preview)1001001001008797.4%
Llama 3.1 Nemotron 70B1001001001008496.9%
GPT-4.1 Mini1001001001008396.7%
Stealth: Healer Alpha1001001001008396.6%
Stealth: Hunter Alpha1001001001008296.4%
Cohere Command R+ (Aug. 2024)100100100929096.4%
Claude Opus 4.6 (Reasoning)1001001001008296.3%
Z.AI GLM 4.51001001001008296.3%
Nemotron 3 Nano1001001001007595.0%
MiniMax M2.51001001001007194.2%
Claude 3.7 Sonnet1001001001006993.8%
DeepSeek V3.21001001001006893.7%
GPT-4o, Aug. 6th (temp=1)100100100877892.9%
Gemma 3 4B1001001001006392.5%
Z.AI GLM 4.71001001001005991.8%
Mistral Small 41001001001005991.7%
LFM2 24B100100100896791.1%
Ministral 3 14B1001001001005190.2%
Gemini 2.5 Flash Lite1001001001004989.8%
Claude Sonnet 4.51001001001004889.7%
Mistral Small 4 (Reasoning)10010094787288.9%
GPT-5.4 (Reasoning)1001001001004288.5%
Gemini 3 Pro (Preview)10010089757287.3%
Qwen 3.5 Plus (2026-02-15)100100100815186.2%
Llama 3.1 70B1001001001003086.0%
Nemotron 3 Super100100100951782.4%
ByteDance Seed 1.6100100100100881.7%
GPT-5.110010081705581.0%
Claude Sonnet 4.6 (Reasoning)100100100100080.0%
Grok 4100100100100080.0%
Gemini 3.1 Flash Lite (Preview)100100100100080.0%
GPT-5.4100100100100080.0%
DeepSeek V3 (2025-03-24)100100100100080.0%
Grok 4.1 Fast10010010095079.0%
GPT-5.4 Nano (Reasoning, Low)10010010088077.7%
GPT-510010010088077.6%
DeepSeek V3.1100100100503677.1%
Arcee AI: Trinity Mini100100100463977.1%
Claude Sonnet 410010010081076.3%
Claude 3.5 Sonnet10010010072074.4%
GPT-5.4 Nano (Reasoning)10010083672174.2%
Llama 3.1 8B10010010068073.6%
GPT-5 Mini1001009171072.3%
Ministral 3 3B10010010061072.1%
Claude Opus 4.610010010059071.8%
DeepSeek V3 (2024-12-26)100100100351169.1%
Gemini 2.5 Flash10010010042068.5%
GPT-5.4 Nano1007570651565.0%
Ministral 3B10010010019063.8%
Mistral Small Creative1001009213061.1%
Qwen3 235B A22B Instruct 25071001001000060.0%
Mistral Medium 3.11001001000060.0%
Qwen 3 32B1001007917059.3%
GPT-4.1100100370047.3%
GPT-5.210010000040.0%
Gemini 2.5 Flash (Reasoning)10010000040.0%
Gemini 2.5 Flash Lite (Reasoning)10010000040.0%
MoonshotAI: Kimi K2.510058170034.9%
Arcee AI: Trinity Large (Preview)1007200034.5%
Claude 3.5 Haiku100000020.0%
GPT-5 Nano72000014.3%
Grok 4 Fast57000011.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Mistral Large 3100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.61001001001009999.7%
GPT-5.4 Mini (Reasoning, Low)1001001001009699.3%
GPT-4.11001001001009599.1%
ByteDance Seed 2.0 Mini1001001001009498.9%
Hermes 3 405B1001001001009298.4%
Qwen 2.5 72B1001001001008697.2%
Llama 3.1 Nemotron 70B1001001001008697.2%
Claude Opus 4.6 (Reasoning)1001001001008697.1%
Claude 3.7 Sonnet1001001001008697.1%
Z.AI GLM 5 Turbo1001001001008496.7%
Ministral 3 3B1001001001008196.1%
GPT-5.4 Nano (Reasoning)1001001001008096.0%
Qwen 3.5 397B A17B1001001001007294.4%
Claude Sonnet 4.6 (Reasoning)1001001001007194.2%
GPT-5 Mini1001001001006593.1%
Gemma 3 4B1001001001006492.8%
Claude Sonnet 4.51001001001006392.5%
Gemini 2.5 Flash Lite100100100976692.5%
Claude 3.5 Sonnet1001001001006292.3%
Claude Sonnet 41001001001006092.1%
WizardLM 2 8x22b1001001001005591.0%
Ministral 3B1001001001005591.0%
GPT-5.4 (Reasoning)100100100807390.7%
Qwen 3.5 Plus (2026-02-15)1001001001005089.9%
Stealth: Hunter Alpha1001001001004889.6%
Ministral 3 14B100100100965089.0%
Grok 4 Fast1001001001004589.0%
Grok 41001001001004188.1%
Qwen3 235B A22B Instruct 25071001001001003286.5%
Gemini 2.5 Pro1001001001003086.1%
GPT-4o, May 13th (temp=0)1001001001002284.4%
Cohere Command R+ (Aug. 2024)1001001001001883.6%
GPT-5.21001001001001182.1%
Z.AI GLM 4.7100100100100781.5%
Z.AI GLM 4.6100100100100080.0%
ByteDance Seed 1.6 Flash100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
LFM2 24B100100100100080.0%
Stealth: Healer Alpha10010085673577.4%
DeepSeek V3.110010010073074.6%
Claude Haiku 4.510010010054070.8%
Nemotron 3 Super10010056484469.6%
Mistral Large 210010010018564.7%
GPT-5.4 Nano (Reasoning, Low)1001008036063.3%
GPT-5.4 Nano100100938060.2%
Gemini 2.5 Flash Lite (Reasoning)1001006513055.7%
Aion 2.01001003021050.2%
GPT-5 Nano10080470045.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Rocinante 12B100100100100100100.0%
MiniMax M2.51001001001009899.6%
GPT-51001001001009498.9%
Grok 4.20 (Beta, Reasoning)1001001001008697.1%
Z.AI GLM 4.7 Flash1001001001008597.0%
GPT-5 Mini1001001001008597.0%
MoonshotAI: Kimi K2.5100100100948996.7%
GPT-5.4100100100988596.6%
ByteDance Seed 2.0 Mini1001001001008196.2%
Gemini 3 Flash (Preview, Reasoning)1001001001008095.9%
GPT-5.4 Nano100100100908995.9%
DeepSeek-V2 Chat1001001001007595.0%
GPT-5.11001001001007494.8%
GPT-4.1 Mini100100100957794.3%
DeepSeek V3.11001001001006793.4%
Claude 3.5 Sonnet1001001001006192.3%
o4 Mini1001001001005891.7%
Gemini 3.1 Flash Lite (Preview)1001001001005590.9%
Claude Opus 4.6100100100886490.3%
Gemma 3 27B1001001001005190.1%
Grok 4 Fast100100100826789.8%
Gemma 3 4B1001001001004588.9%
Nemotron 3 Nano1001001001004288.4%
Mistral Large 31001001001003887.5%
Claude Sonnet 41001001001003586.9%
Aion 2.01001001001003486.9%
Grok 41001001001003186.3%
GPT-4.11001001001002985.9%
Z.AI GLM 5 Turbo100100100982885.2%
Claude Sonnet 4.6 (Reasoning)1001001001002384.6%
Z.AI GLM 4.61001001001001783.3%
Gemini 2.5 Pro1001001001001683.3%
GPT-5.2100100100922182.6%
GPT-4o, Aug. 6th (temp=1)1001001001001081.9%
Gemini 3 Pro (Preview)100100100100080.0%
Claude Haiku 4.5100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
GPT-5.4 Nano (Reasoning)100100100100080.0%
Gemma 3 12B100100100100080.0%
GPT-5.4 Nano (Reasoning, Low)1001009691979.1%
Writer: Palmyra X510010010091078.3%
Cohere Command R+ (Aug. 2024)10010010084076.8%
DeepSeek V3.210010010078075.5%
Qwen 3.5 Plus (2026-02-15)10010086484074.6%
Gemini 2.5 Flash (Reasoning)100999467072.1%
Ministral 3 3B100100100471271.7%
WizardLM 2 8x22b10010060312262.7%
Stealth: Hunter Alpha1001004744058.1%
Stealth: Healer Alpha100100480049.5%
Gemini 2.5 Flash100100460049.2%
Gemini 2.5 Flash Lite (Reasoning)100100200044.1%
GPT-5 Nano100663118042.9%
LFM2 24B1006200032.3%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
o4 Mini High100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Hermes 3 405B100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
ByteDance Seed 2.0 Lite1001001001008697.3%
Gemini 3 Pro (Preview)1001001001008296.5%
Llama 3.1 Nemotron 70B1001001001008196.1%
Rocinante 12B1001001001007995.9%
Claude 3.5 Sonnet100100100918895.8%
Ministral 3 3B1001001001007895.6%
Llama 3.1 8B1001001001007194.1%
o4 Mini1001001001005390.7%
Gemini 3 Flash (Preview, Reasoning)1001001001005089.9%
Gemini 2.5 Pro1001001001004288.4%
Mistral Small Creative10010092915487.4%
GPT-4o, Aug. 6th (temp=0)1001001001002785.3%
Ministral 8B1001001001002585.0%
Z.AI GLM 4.71001001001002184.2%
DeepSeek-V2 Chat100100100803582.8%
Z.AI GLM 51001001001001482.7%
GPT-5.110010097595682.5%
Ministral 3 14B1001001001001182.1%
Arcee AI: Trinity Mini10010094912081.1%
Claude 3.5 Haiku100100100100280.4%
Grok 4.1 Fast100100100100080.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
GPT-5.4 Nano100100100100080.0%
GPT-5.210010090535178.7%
Mistral Large 310010010086578.2%
Mistral Large 2100100100541674.0%
GPT-5.4 (Reasoning)10010010063072.6%
GPT-51008778771972.4%
Arcee AI: Trinity Large (Preview)100100100402072.1%
Gemini 3 Flash (Preview)10010010060071.9%
Claude Sonnet 4.5100100100342571.8%
Qwen 3.5 27B10010010059071.8%
Qwen 3.5 Plus (2026-02-15)1001008173070.7%
MiniMax M2.710010010053070.6%
Mistral Large10010010048069.6%
Gemini 3.1 Flash Lite (Preview)10010010046069.3%
Qwen 3.5 9B10010055474168.5%
Inception Mercury 2100100100261367.7%
Mistral Medium 3.11001007948365.9%
Z.AI GLM 4.610010010025065.0%
GPT-4.1 Nano1001009918864.8%
Gemma 3 4B10010010014664.0%
Ministral 3B1001006840061.6%
Qwen 3.5 Flash1001001007061.4%
Grok 4 Fast100938228060.6%
GPT-4o, Aug. 6th (temp=1)1001001000060.0%
DeepSeek V3.11001001000060.0%
GPT-5.4 Nano (Reasoning, Low)1001001000060.0%
Gemini 2.5 Flash100897730059.3%
GPT-5.4 Nano (Reasoning)1001007022058.4%
Qwen 3.5 397B A17B1001005933058.4%
Claude Opus 4.6 (Reasoning)100100920058.3%
DeepSeek V3.2100100870057.5%
Grok 4.20 (Beta, Reasoning)1001006018055.6%
GPT-5.4 Mini10010036261555.3%
GPT-5.4100874132052.0%
Claude Sonnet 4.6 (Reasoning)100100580051.5%
Qwen3 235B A22B Instruct 2507100100560051.3%
GPT-4.1100100560051.2%
GPT-5.4 (Reasoning, Low)10087630050.0%
Claude Haiku 4.5100100500049.9%
Z.AI GLM 4.5100100450049.2%
Gemma 3 27B100992420048.5%
Gemini 2.5 Flash Lite100100400047.9%
GPT-5.4 Mini (Reasoning)10073568047.4%
Z.AI GLM 5 Turbo100100370047.4%
Claude Sonnet 4.6100594830047.4%
Claude Opus 4.610084367045.5%
Qwen 3.5 122B100100200044.0%
Claude Opus 410010080041.6%
Qwen 3.5 35B10066400041.3%
Grok 410010000040.0%
DeepSeek V3 (2025-03-24)10010000040.0%
Nemotron 3 Nano10010000040.0%
Qwen 3 32B1009400038.8%
GPT-4.1 Mini100451912035.2%
MiniMax M2.51007400034.8%
Claude Opus 4.510053200034.6%
Mistral Small 4 (Reasoning)10036320033.8%
GPT-5.4 Mini (Reasoning, Low)10053150033.6%
GPT-5 Mini10040150030.9%
Aion 2.06647210026.6%
ByteDance Seed 1.6 Flash1001100022.1%
Claude 3.7 Sonnet921600021.6%
MoonshotAI: Kimi K2.5100500020.9%
Claude Sonnet 4100000020.0%
Stealth: Hunter Alpha100000020.0%
Stealth: Healer Alpha90000018.0%
Gemini 2.5 Flash Lite (Reasoning)74000014.8%
Nemotron 3 Super441500011.7%
Mistral Small 4381500010.7%
Gemini 2.5 Flash (Reasoning)2100004.1%
GPT-5 Nano000000.0%
Writer: Palmyra X5000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
GPT-4.1100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Claude 3 Haiku1001001001009999.8%
Mistral NeMO1001001001008697.2%
Ministral 3 14B1001001001008396.6%
ByteDance Seed 1.6 Flash1001001001007494.8%
Z.AI GLM 51001001001006793.4%
Arcee AI: Trinity Large (Preview)1001001001006593.0%
Mistral Medium 3.11001001001006392.6%
Gemini 3 Flash (Preview, Reasoning)100100100907292.4%
DeepSeek V3 (2025-03-24)1001001001006192.3%
Llama 3.1 70B1001001001005991.9%
GPT-5.4 Mini (Reasoning, Low)1001001001005190.2%
Z.AI GLM 5 Turbo1001001001004889.7%
Claude Opus 4.610010095896189.1%
GPT-5.4 (Reasoning)10010087837589.0%
Qwen 3.5 27B1001001001004488.8%
Z.AI GLM 4.7100100100826188.6%
GPT-4o Mini (temp=1)1001001001004288.5%
Ministral 8B100100100736186.8%
Inception Mercury 210010095853783.5%
Claude Sonnet 4.6100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Claude Opus 4.5100100100722579.4%
Claude Opus 4.6 (Reasoning)10010091545279.3%
GPT-5.410010010092078.3%
Gemini 2.5 Pro10010067655777.9%
Z.AI GLM 4.7 Flash10010085484575.8%
GPT-5.110010073693475.4%
Qwen 3.5 Flash10010010076075.1%
GPT-510010010076075.1%
GPT-5.4 Mini10010010073074.7%
Claude 3.7 Sonnet10010091602174.6%
Z.AI GLM 4.510010010073074.6%
o4 Mini100100100601274.4%
Grok 4.1 Fast10010059584873.0%
Grok 410010010057071.4%
Claude Sonnet 4.6 (Reasoning)10010067513470.4%
Cohere Command R+ (Aug. 2024)10010010052070.3%
Mistral Small Creative10010010049069.9%
Ministral 3B1001008064068.6%
Claude Opus 410010085421568.3%
Z.AI GLM 4.610010058483367.8%
Mistral Small 41001007152064.5%
Arcee AI: Trinity Mini938073561663.7%
Gemma 3 27B10010010013062.6%
Gemini 3 Pro (Preview)10010010012062.3%
Gemini 2.5 Flash1001005843861.9%
GPT-4.1 Mini1001001009061.7%
Gemma 3 12B1001001000060.0%
Llama 3.1 Nemotron 70B1001001000060.0%
Ministral 3 8B1001001000060.0%
Rocinante 12B1001001000060.0%
Inception Mercury1001006234059.1%
Qwen 3.5 9B100776849058.9%
MiniMax M2.7100100910058.2%
Grok 4.20 (Beta, Reasoning)10096940058.0%
Qwen3 235B A22B Instruct 250710010036252457.0%
Mistral Small 4 (Reasoning)100100830056.6%
DeepSeek V3.1100706249056.3%
MiniMax M2.5100100560051.2%
Qwen 3.5 Plus (2026-02-15)1005954241750.9%
Gemini 3.1 Flash Lite (Preview)100100480049.6%
Stealth: Aurora Alpha100100269648.2%
Claude Sonnet 4.5100100350047.0%
Claude Haiku 4.5100100260045.1%
GPT-5.4 (Reasoning, Low)90614818043.3%
GPT-5.210094150041.8%
Qwen 3 32B10010040040.8%
GPT-5.4 Nano (Reasoning, Low)10010030040.6%
Gemma 3 4B10010000040.0%
Grok 4 Fast10069190037.7%
Stealth: Healer Alpha1008300036.6%
Gemini 2.5 Flash (Reasoning)91332921035.0%
Gemini 2.5 Flash Lite (Reasoning)8677110034.7%
Qwen 3.5 35B1005700031.4%
GPT-5.4 Nano585100021.9%
MoonshotAI: Kimi K2.5832200021.1%
GPT-4.1 Nano100500021.1%
Gemini 2.5 Flash Lite100000020.0%
Writer: Palmyra X5100000020.0%
Nemotron 3 Super84000016.8%
GPT-5.4 Nano (Reasoning)3531150016.2%
DeepSeek V3.2612000016.1%
Aion 2.079000015.8%
Stealth: Hunter Alpha67000013.4%
LFM2 24B59000011.8%
Nemotron 3 Nano3200006.4%
GPT-5 Mini000000.0%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
GPT-4.1 Nano1001001001009999.9%
Claude 3.5 Sonnet1001001001009999.8%
Rocinante 12B1001001001009999.8%
Gemini 3 Pro (Preview)1001001001009298.4%
GPT-5.21001001001008897.7%
Claude Opus 4.6 (Reasoning)1001001001008797.4%
GPT-5.4 Mini (Reasoning)100100100949197.0%
GPT-5.4 (Reasoning)1001001001008496.8%
DeepSeek V3 (2025-03-24)1001001001008196.3%
Z.AI GLM 4.7 Flash1001001001007795.5%
DeepSeek-V2 Chat1001001001007795.5%
GPT-5.4 Nano1001001001007695.2%
Z.AI GLM 4.71001001001007394.7%
Gemini 3 Flash (Preview)1001001001007394.6%
Qwen 3.5 35B1001001001007394.5%
Ministral 3B1001001001007294.4%
Claude Sonnet 41001001001007194.1%
Inception Mercury 21001001001006893.6%
GPT-4o Mini (temp=0)1001001001006693.2%
DeepSeek V3.2100100100828192.6%
Qwen 3.5 Plus (2026-02-15)1001001001006292.4%
Mistral Small Creative1001001001005691.3%
ByteDance Seed 1.6 Flash1001001001005190.3%
GPT-5.4 Mini100100100826689.6%
Writer: Palmyra X51001001001004088.0%
GPT-5 Mini100100100984287.8%
Z.AI GLM 5 Turbo100100100924487.3%
Hermes 3 70B1001001001003687.2%
Stealth: Hunter Alpha100100100944287.2%
Arcee AI: Trinity Large (Preview)1001001001003687.1%
Aion 2.01001001001003587.1%
Z.AI GLM 4.51001001001003586.9%
Grok 4 Fast100100100923585.3%
Cohere Command R+ (Aug. 2024)1001001001002685.3%
Gemini 2.5 Flash Lite (Reasoning)100100100804184.1%
Qwen 3 32B1001001001001282.5%
Llama 3.1 Nemotron 70B1001001001001282.4%
Gemini 2.5 Pro100100100882482.4%
GPT-5.410010087626282.1%
Gemini 2.5 Flash Lite100100100100480.7%
Claude 3.5 Haiku100100100100280.3%
GPT-5.4 Mini (Reasoning, Low)100100100100080.1%
Grok 4.20 (Beta, Reasoning)100100100100080.0%
MoonshotAI: Kimi K2.5100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Gemini 2.5 Flash (Reasoning)100100100100080.0%
GPT-4o, May 13th (temp=1)100100100100080.0%
GPT-4.1 Mini100100100100080.0%
Qwen3 235B A22B Instruct 2507100100100100080.0%
Ministral 3 3B10010010079075.7%
LFM2 24B1001009945068.8%
Nemotron 3 Nano10010010025065.0%
Claude 3.7 Sonnet10010010022064.3%
Claude Opus 4.61001006845563.7%
Mistral Small 41001001006061.2%
Claude Sonnet 4.6 (Reasoning)1001001003060.6%
Z.AI GLM 51001001000060.0%
Stealth: Healer Alpha100100750054.9%
Nemotron 3 Super966200031.6%
GPT-5 Nano1000002.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-4.1 Nano1001001001009899.7%
Nemotron 3 Nano1001001001009899.6%
GPT-5.41001001001009899.5%
GPT-5.11001001001009799.5%
Cohere Command R+ (Aug. 2024)1001001001009799.4%
Claude 3.5 Sonnet1001001001008897.6%
Grok 4.20 (Beta)1001001001008797.4%
GPT-4o, Aug. 6th (temp=0)1001001001008597.1%
Z.AI GLM 4.71001001001008596.9%
Z.AI GLM 4.7 Flash1001001001008396.6%
LFM2 24B1001001001008396.5%
DeepSeek-V2 Chat1001001001007795.4%
MiniMax M2.51001001001007795.4%
Gemini 3 Flash (Preview)1001001001007595.1%
Qwen 3.5 Plus (2026-02-15)1001001001007494.9%
Gemini 2.5 Pro1001001001007394.6%
WizardLM 2 8x22b1001001001007394.6%
Hermes 3 70B1001001001007294.4%
GPT-5.4 Mini (Reasoning)1001001001007094.0%
Qwen 3.5 9B1001001001006893.6%
Mistral Large 31001001001006893.5%
Inception Mercury 21001001001006793.5%
Z.AI GLM 51001001001006793.4%
Qwen 3.5 27B1001001001006593.1%
Gemini 2.5 Flash Lite1001001001006392.6%
o4 Mini1001001001006292.3%
Qwen 3.5 122B1001001001005791.5%
Z.AI GLM 5 Turbo1001001001005791.3%
Qwen 3.5 Flash1001001001005591.1%
GPT-5.4 Nano (Reasoning)100100100926491.1%
Z.AI GLM 4.51001001001005390.6%
Gemini 3 Pro (Preview)100100100866690.4%
GPT-5.4 (Reasoning, Low)100100100955790.3%
GPT-5.4 (Reasoning)1001001001004989.8%
Gemini 3.1 Flash Lite (Preview)1001001001004589.1%
Claude Opus 4.6 (Reasoning)100100100983887.0%
Ministral 3 8B100100100983285.9%
Z.AI GLM 4.6100100100666385.7%
Gemma 3 12B1001001001002785.5%
MiniMax M2.7100100100864185.4%
Claude Haiku 4.51001001001002785.4%
GPT-5.4 Mini (Reasoning, Low)1001001001002685.1%
Claude Opus 4.6100100100715384.8%
Ministral 3 3B100100100784183.8%
Grok 4.1 Fast1001001001001883.6%
Claude Sonnet 4.51001001001001883.6%
GPT-5.4 Nano (Reasoning, Low)100100100654782.5%
GPT-5 Mini10010094813581.9%
Hermes 3 405B100100100911881.8%
Claude 3.7 Sonnet100100100100581.0%
Claude Sonnet 4100100100100480.7%
GPT-5.210010091585480.6%
Gemma 3 27B100100100100080.0%
Ministral 8B100100100841078.8%
Qwen 3.5 397B A17B100100100692278.2%
GPT-5.4 Mini100100100612176.3%
GPT-4.110010010065273.4%
Qwen 3.5 35B10010010062072.3%
DeepSeek V3.1100100100421571.2%
Aion 2.010010010053070.6%
Mistral Small 4 (Reasoning)1001009153068.7%
Ministral 3B10010010037067.4%
Gemini 2.5 Flash10010072412066.7%
Gemini 2.5 Flash Lite (Reasoning)10010010033066.6%
Gemini 2.5 Flash (Reasoning)10010010021064.1%
DeepSeek V3.210010010019063.9%
Stealth: Healer Alpha1009273361663.5%
DeepSeek V3 (2025-03-24)10010010015063.0%
Mistral Medium 3.11001001009061.8%
DeepSeek V3 (2024-12-26)10010067281461.6%
Arcee AI: Trinity Mini1001006441060.9%
Grok 41001008319060.6%
Claude Sonnet 4.61001001000060.0%
Qwen 3 32B1001001000060.0%
GPT-5.4 Nano1001001000060.0%
Nemotron 3 Super100100920058.5%
ByteDance Seed 1.6 Flash100100350047.0%
Grok 4 Fast10087230041.9%
Stealth: Hunter Alpha10010000040.0%
Claude 3.5 Haiku10010000040.0%
MoonshotAI: Kimi K2.510064290038.6%
Claude Sonnet 4.6 (Reasoning)1006684035.6%
Writer: Palmyra X555383736033.0%
Mistral Small 410024180028.5%
GPT-5 Nano1200002.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.61001001001009999.8%
GPT-5.4 Nano1001001001009999.8%
Claude 3.5 Sonnet1001001001009498.9%
DeepSeek V3.11001001001009398.6%
Qwen 3 32B1001001001009198.2%
Gemini 3.1 Pro (Preview)1001001001008997.8%
Hermes 3 70B1001001001008797.5%
Gemini 3 Flash (Preview)1001001001008697.1%
Llama 3.1 Nemotron 70B1001001001008697.1%
Claude Haiku 4.51001001001008196.2%
GPT-5.21001001001008095.9%
DeepSeek V3 (2025-03-24)1001001001007795.3%
Qwen 3.5 27B1001001001007695.2%
GPT-51001001001007294.4%
Aion 2.01001001001006893.6%
Mistral Small 4 (Reasoning)1001001001006893.5%
GPT-4.1 Mini1001001001006693.2%
Z.AI GLM 4.7 Flash1001001001005791.4%
GPT-5.4 Nano (Reasoning, Low)1001001001005190.2%
Qwen 3.5 9B1001001001004689.3%
Claude Opus 4.6 (Reasoning)1001001001004488.9%
Gemini 3.1 Flash Lite (Preview)1001001001003687.3%
GPT-5.4 (Reasoning)1001001001003586.9%
GPT-5.4 Nano (Reasoning)100100100765886.7%
Gemini 2.5 Flash Lite (Reasoning)100100100785285.9%
MiniMax M2.71001001001003085.9%
GPT-5 Mini1001001001002885.5%
GPT-4o, May 13th (temp=0)1001001001001683.2%
Gemma 3 27B1001001001001583.1%
Z.AI GLM 5 Turbo1001001001001382.6%
Qwen3 235B A22B Instruct 2507100100100664381.9%
GPT-4.1100100100614781.6%
Z.AI GLM 4.6100100100100080.1%
o4 Mini100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
GPT-4o, Aug. 6th (temp=1)100100100100080.0%
DeepSeek V3.2100100100100080.0%
Ministral 3B100100100100080.0%
MoonshotAI: Kimi K2.510010010098079.6%
Gemini 2.5 Pro10010010088077.5%
Gemini 2.5 Flash (Reasoning)10010010064774.2%
Arcee AI: Trinity Mini10010076691171.3%
Nemotron 3 Super10010010053070.6%
Gemma 3 4B100100100222168.6%
Claude Sonnet 4.610010010031066.2%
Mistral Small 4100938054165.6%
Stealth: Hunter Alpha1009243423662.6%
LFM2 24B10099760055.2%
Stealth: Healer Alpha1001004618052.9%
Nemotron 3 Nano100774336051.1%
GPT-5 Nano887200031.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.1100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
MiniMax M2.5100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
Qwen 3 32B1001001001009298.3%
Ministral 8B1001001001009198.3%
Stealth: Aurora Alpha1001001001008997.9%
WizardLM 2 8x22b1001001001008997.8%
Mistral Medium 3.11001001001008997.8%
Inception Mercury1001001001008897.6%
GPT-5100100100948896.4%
Gemini 3 Flash (Preview)1001001001008096.0%
Grok 4.1 Fast1001001001008095.9%
GPT-4.11001001001007795.5%
GPT-5.2100100100898694.9%
Z.AI GLM 51001001001006392.6%
ByteDance Seed 1.6 Flash1001001001006192.3%
Mistral Small Creative1001001001005991.8%
Gemini 2.5 Pro1001001001004989.8%
DeepSeek V3 (2024-12-26)1001001001004789.5%
Claude Opus 41001001001004789.3%
Qwen3 235B A22B Instruct 25071001001001003987.8%
Qwen 3.5 27B1001001001003787.4%
Gemma 3 27B1001001001003787.3%
GPT-5.4 Nano (Reasoning, Low)1001001001003687.2%
Claude Opus 4.5100100100904587.0%
Z.AI GLM 4.7100100100953986.9%
Cohere Command R+ (Aug. 2024)1001001001003286.3%
Gemini 3 Pro (Preview)100100100953285.4%
Claude Sonnet 41001001001002584.9%
Writer: Palmyra X51001001001002484.8%
Rocinante 12B100100100893284.1%
Claude Sonnet 4.6 (Reasoning)100100100655584.1%
Ministral 3 3B1001001001001983.9%
GPT-5.4 (Reasoning)100100100932583.6%
Claude Sonnet 4.6100100100655183.1%
Qwen 3.5 Plus (2026-02-15)10010089605981.6%
Gemini 2.5 Flash100100100100280.5%
Z.AI GLM 5 Turbo100100100100080.0%
Z.AI GLM 4.6100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Mistral Small 4 (Reasoning)100100100100080.0%
LFM2 24B10010010097079.5%
GPT-5 Mini100100100574079.3%
Qwen 3.5 397B A17B100100100672879.1%
Claude Opus 4.61001009691077.3%
MiniMax M2.7100100100751177.2%
GPT-5.4 Nano (Reasoning)100100100591875.5%
Claude 3.5 Sonnet10010010072074.4%
Aion 2.010010010056872.8%
Gemini 2.5 Flash Lite1001008558068.6%
DeepSeek-V2 Chat10010010026065.2%
Nemotron 3 Super10010010025065.0%
GPT-5.4 Nano10010010011062.2%
Qwen 3.5 9B1001001008061.5%
Gemini 2.5 Flash (Reasoning)100785224050.7%
MoonshotAI: Kimi K2.5100100460049.2%
Qwen 3.5 Flash100100386048.8%
Stealth: Hunter Alpha100100340046.8%
Gemini 2.5 Flash Lite (Reasoning)10083440045.5%
DeepSeek V3.29483443044.9%
Stealth: Healer Alpha10010000040.0%
Mistral Small 41008200036.4%
GPT-5 Nano3300006.6%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Claude Opus 4.6 (Reasoning)10010010010010099.9%
Qwen 3.5 9B1001001001009999.7%
Gemini 3 Flash (Preview)1001001001009999.7%
Stealth: Aurora Alpha1001001001009498.8%
Gemini 3 Flash (Preview, Reasoning)1001001001009398.5%
GPT-5.4 Nano (Reasoning)1001001001009298.5%
ByteDance Seed 2.0 Lite1001001001009198.1%
GPT-5.11001001001008897.7%
Qwen 3 32B100100100979097.5%
GPT-5.4 Nano (Reasoning, Low)1001001001008496.9%
Qwen 3.5 Plus (2026-02-15)100100100937593.6%
Qwen 3.5 397B A17B1001001001006693.1%
GPT-4o, May 13th (temp=0)1001001001004789.5%
LFM2 24B1001001001004188.2%
Qwen 3.5 Flash1001001001002084.1%
Z.AI GLM 4.7 Flash1001001001001583.0%
Grok 4.20 (Beta)1001001001001583.0%
GPT-5.4 (Reasoning, Low)1009783785682.7%
Grok 410010097902381.8%
GPT-5.4 Mini100100100684081.5%
GPT-4o, May 13th (temp=1)100100100100581.0%
Mistral Large 2100100100100280.4%
Grok 4.1 Fast100100100100080.0%
Inception Mercury 2100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Inception Mercury100100100100080.0%
Gemma 3 27B100100100100080.0%
Gemma 3 4B100100100100080.0%
Ministral 8B100100100100080.0%
Rocinante 12B100100100100080.0%
Claude Sonnet 4.6100100100100080.0%
GPT-5.410010010097079.3%
Ministral 3B100100100534178.7%
Z.AI GLM 4.610010010093078.5%
GPT-51001009785076.4%
Claude Opus 4.61001008478072.4%
DeepSeek-V2 Chat10010010062072.3%
Claude Opus 4.510010010059071.8%
GPT-5.4 Mini (Reasoning, Low)10010010052070.3%
GPT-4.110010010051070.2%
Z.AI GLM 4.510010010048069.6%
Z.AI GLM 510010010034066.9%
Ministral 3 14B1001009338066.3%
Gemini 2.5 Pro10010010030065.9%
MiniMax M2.71001009137065.7%
GPT-5.210010010026065.3%
Claude 3.7 Sonnet10010010012062.4%
Grok 4 Fast1001007932062.2%
Mistral Large1001001007061.5%
Claude Haiku 4.51001001005061.1%
WizardLM 2 8x22b1001001005061.1%
Arcee AI: Trinity Large (Preview)1001001004060.7%
Ministral 3 8B1001005547060.5%
MiniMax M2.51001001000060.0%
Claude 3.5 Sonnet1001001000060.0%
Nemotron 3 Nano1001001000060.0%
GPT-4.1 Nano1001001000060.0%
Arcee AI: Trinity Mini1001001000060.0%
Mistral Small 4 (Reasoning)100100970059.5%
Qwen3 235B A22B Instruct 25071001005638058.7%
Mistral Small Creative100100910058.1%
Nemotron 3 Super100100840056.9%
DeepSeek V3.11001005117053.5%
Stealth: Healer Alpha100787118053.3%
DeepSeek V3.2100686020049.5%
Mistral Medium 3.1100100450048.9%
Claude Sonnet 4100100190043.9%
Gemini 2.5 Flash Lite10094100040.8%
GPT-4.1 Mini10010010040.1%
Aion 2.010057440040.0%
Claude Opus 410010000040.0%
ByteDance Seed 1.6 Flash10010000040.0%
Claude Sonnet 4.6 (Reasoning)1009800039.6%
Mistral Small 410067310039.6%
GPT-5 Mini100353121237.7%
Gemini 2.5 Flash (Reasoning)1007900035.8%
GPT-4o, Aug. 6th (temp=1)1003600027.2%
Ministral 3 3B10017152026.9%
Writer: Palmyra X51001900023.8%
DeepSeek V3 (2025-03-24)1001200022.4%
Mistral Large 386000017.2%
Gemini 2.5 Flash54820012.9%
MoonshotAI: Kimi K2.5491000011.6%
Claude Sonnet 4.54600009.2%
Gemini 2.5 Flash Lite (Reasoning)2800005.6%
Stealth: Hunter Alpha000000.0%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
o4 Mini High100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Mistral NeMO100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Gemma 3 12B1001001001009999.9%
GPT-4.1 Mini1001001001009699.2%
GPT-5.4 Nano1001001001009198.2%
GPT-5.4 Nano (Reasoning, Low)1001001001008997.8%
Claude Sonnet 41001001001008897.6%
Qwen 3.5 122B1001001001008897.6%
ByteDance Seed 2.0 Mini1001001001008597.1%
Gemini 3 Flash (Preview)1001001001008597.0%
Hermes 3 405B1001001001008597.0%
GPT-4o, May 13th (temp=1)1001001001008096.0%
Qwen 3 32B100100100928595.4%
Mistral Small Creative1001001001007494.7%
Gemini 3 Flash (Preview, Reasoning)1001001001007094.1%
DeepSeek V3 (2024-12-26)1001001001007093.9%
GPT-5.11001001001007093.9%
Qwen 3.5 9B1001001001006693.2%
Z.AI GLM 4.71001001001005891.6%
GPT-5.4 (Reasoning)1001001001005591.0%
Cohere Command R+ (Aug. 2024)100100100857090.9%
Mistral Small 4 (Reasoning)1001001001005290.3%
GPT-5.4100100100985390.2%
Qwen 3.5 27B1001001001004889.5%
Ministral 3 14B100100100766888.8%
DeepSeek V3.11001001001004388.6%
GPT-5.4 (Reasoning, Low)10010099904987.4%
Claude 3.5 Sonnet1001001001003687.2%
Ministral 3 8B1001001001003286.4%
Hermes 3 70B100100100745585.8%
ByteDance Seed 1.61001001001002785.4%
Grok 4100100100636084.6%
GPT-5.4 Mini1001001001002184.2%
Ministral 8B100100100734483.5%
Claude Opus 4.51008080797582.7%
Llama 3.1 Nemotron 70B100100100595181.9%
Grok 4.1 Fast100100100100080.0%
Claude Sonnet 4.5100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Stealth: Aurora Alpha100100100100080.0%
GPT-4o, Aug. 6th (temp=1)100100100100080.0%
Rocinante 12B100100100100080.0%
Gemini 3 Pro (Preview)100100100584079.6%
GPT-5.4 Mini (Reasoning, Low)10010094564679.1%
Claude Sonnet 4.61001009896078.8%
GPT-5.210010010092078.5%
GPT-5.4 Mini (Reasoning)10010096672878.2%
Mistral Small 410010010091078.2%
Gemini 2.5 Pro10010010090078.1%
Mistral Large 310010010089077.9%
Z.AI GLM 4.610010095741977.7%
MiniMax M2.710010010071074.2%
Mistral Large 2100100100363574.1%
Gemma 3 4B100100100322571.5%
MiniMax M2.510010071382967.5%
Claude Opus 4.610010010027065.5%
ByteDance Seed 1.6 Flash1006657554764.9%
Writer: Palmyra X51001009233064.9%
Gemini 3.1 Flash Lite (Preview)10010049393564.8%
Grok 4 Fast100988242064.6%
Claude Sonnet 4.6 (Reasoning)10010010019063.7%
Z.AI GLM 5 Turbo1001001000060.0%
GPT-4.11001001000060.0%
DeepSeek V3 (2025-03-24)1001001000060.0%
Gemini 2.5 Flash1001001000060.0%
Inception Mercury1001001000060.0%
Ministral 3 3B1001001000060.0%
Ministral 3B1001001000060.0%
LFM2 24B1001001000060.0%
GPT-51001005735459.1%
Inception Mercury 2100100895058.7%
Gemma 3 27B100928516058.6%
Arcee AI: Trinity Mini100100820056.5%
MoonshotAI: Kimi K2.5100100750054.9%
Qwen 3.5 Plus (2026-02-15)1007646242053.3%
Claude Haiku 4.51001003322051.0%
Claude 3.7 Sonnet1001003812049.9%
Z.AI GLM 4.7 Flash10072669049.3%
WizardLM 2 8x22b10075620047.5%
Gemini 2.5 Flash Lite100100320046.4%
GPT-4.1 Nano100100150043.1%
Claude Opus 4.6 (Reasoning)10010000040.0%
Stealth: Healer Alpha10010000040.0%
Stealth: Hunter Alpha10054460040.0%
Aion 2.010065320039.4%
GPT-5 Mini1007900035.8%
DeepSeek V3.210042300034.4%
Gemini 2.5 Flash (Reasoning)1001400022.8%
Nemotron 3 Super100000020.0%
Nemotron 3 Nano100000020.0%
Gemini 2.5 Flash Lite (Reasoning)000000.0%
GPT-5 Nano000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Ministral 3 14B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Llama 3.1 Nemotron 70B1001001001009999.8%
GPT-4.1 Mini100100100999899.4%
GPT-5 Mini1001001001009598.9%
Mistral NeMO1001001001009598.9%
GPT-4o, Aug. 6th (temp=0)1001001001009398.7%
Gemini 2.5 Flash1001001001009097.9%
GPT-5.21001001001008697.3%
Grok 41001001001008597.1%
LFM2 24B1001001001007895.7%
WizardLM 2 8x22b1001001001007394.5%
Ministral 3B1001001001006593.0%
Mistral Large 3100100100947092.8%
Stealth: Hunter Alpha1001001001006392.7%
Claude Opus 4.6 (Reasoning)1001001001005791.4%
Qwen3 235B A22B Instruct 25071001001001005691.1%
Claude Sonnet 4.6 (Reasoning)1001001001005490.7%
Claude 3.7 Sonnet1001001001005190.3%
Gemini 2.5 Pro1001001001005090.1%
Nemotron 3 Nano1001001001004989.8%
Gemma 3 4B1001001001004087.9%
MoonshotAI: Kimi K2.5100100100914887.9%
GPT-4.11001001001003787.3%
Writer: Palmyra X51001001001002985.9%
GPT-5.4100100100715685.4%
Grok 4 Fast100100100625583.3%
GPT-4o, Aug. 6th (temp=1)1001001001001583.1%
Claude Sonnet 4.61001001001001382.5%
Claude Sonnet 4100100100100380.6%
Aion 2.0100100100100080.0%
Z.AI GLM 4.5100100100100080.0%
Stealth: Healer Alpha100100100100080.0%
GPT-4o, May 13th (temp=1)100100100100080.0%
Arcee AI: Trinity Mini10010010082076.4%
GPT-5.4 (Reasoning)10010010065874.6%
Claude Haiku 4.510010057433667.1%
Gemini 2.5 Flash Lite (Reasoning)100886957964.6%
Nemotron 3 Super10010000040.0%
GPT-5 Nano472700014.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Gemini 2.5 Pro1001001001009899.6%
Z.AI GLM 4.51001001001009899.6%
Claude Opus 4.51001001001009799.3%
Hermes 3 70B1001001001009599.1%
GPT-4o, May 13th (temp=1)1001001001009599.0%
GPT-5.11001001001009398.7%
Nemotron 3 Nano1001001001009398.5%
Claude Opus 41001001001008997.8%
Stealth: Aurora Alpha1001001001008897.7%
Stealth: Hunter Alpha1001001001008797.4%
GPT-4.1 Mini1001001001008797.3%
Mistral Small 41001001001008697.2%
Ministral 3B1001001001008596.9%
Ministral 8B1001001001008396.7%
Nemotron 3 Super1001001001008096.1%
GPT-5 Mini1001001001008096.0%
GPT-5.4 (Reasoning, Low)1001001001007995.7%
Mistral Medium 3.11001001001007494.9%
GPT-5.21001001001007494.9%
Arcee AI: Trinity Large (Preview)1001001001006492.9%
Gemma 3 4B1001001001006492.8%
Llama 3.1 70B1001001001006292.4%
Qwen 3.5 Flash1001001001005390.6%
Gemini 2.5 Flash Lite1001001001005290.5%
Claude Opus 4.6 (Reasoning)1001001001005190.3%
Claude 3.7 Sonnet1001001001005190.1%
WizardLM 2 8x22b1001001001004889.7%
GPT-4.1 Nano100100100963185.3%
Claude Sonnet 4.6 (Reasoning)1001001001002585.0%
Claude 3.5 Sonnet100100100635483.3%
GPT-4o Mini (temp=1)1001001001001382.7%
GPT-5100100100921581.5%
Gemini 3 Flash (Preview, Reasoning)100100100782881.1%
Qwen 3 32B1009996861980.2%
Stealth: Healer Alpha100100100100080.0%
Z.AI GLM 4.7 Flash100100100100080.0%
Mistral Small 4 (Reasoning)100100100100080.0%
Gemma 3 12B100100100100080.0%
Writer: Palmyra X510010010095079.0%
Claude Opus 4.610010090564678.2%
Ministral 3 8B100100100413976.0%
DeepSeek V3.210010010077075.5%
MoonshotAI: Kimi K2.5100100100541373.5%
Qwen 3.5 Plus (2026-02-15)10010010066073.2%
Qwen3 235B A22B Instruct 2507100100100312471.0%
Mistral NeMO10010010054070.8%
LFM2 24B1001008362069.2%
Grok 4917371643566.7%
Gemini 2.5 Flash Lite (Reasoning)10010056443066.0%
Gemini 2.5 Flash1001008331062.8%
GPT-4.110010010014062.7%
Gemma 3 27B1001008718060.8%
Claude Sonnet 4.61001001000060.0%
Gemini 2.5 Flash (Reasoning)1001001000060.0%
Claude 3.5 Haiku1001001000060.0%
Mistral Small Creative100100950058.9%
Aion 2.0100685238051.4%
Grok 4 Fast1005600031.3%
GPT-5 Nano600001.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Claude Opus 4100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
Mistral Large 2100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Mistral Small 4100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-5.4 Mini1001001001009899.7%
Claude Sonnet 41001001001009799.3%
GPT-4o, Aug. 6th (temp=0)1001001001009498.8%
GPT-5.4 Nano (Reasoning)1001001001009198.1%
DeepSeek V3.11001001001009097.9%
DeepSeek V3 (2025-03-24)1001001001008997.9%
Qwen 3 32B1001001001008697.3%
GPT-5.4 Nano1001001001008697.3%
Mistral NeMO1001001001008597.0%
Mistral Large1001001001008596.9%
Ministral 3 8B1001001001008496.8%
GPT-4o, May 13th (temp=1)1001001001007595.0%
Grok 4.20 (Beta)1001001001007394.5%
Gemini 2.5 Flash1001001001007294.4%
GPT-4.1 Mini1001001001007194.3%
Aion 2.01001001001007093.9%
GPT-4.11001001001006292.3%
Ministral 3 14B1001001001005490.8%
Gemini 2.5 Pro1001001001004889.7%
LFM2 24B100100100806689.2%
Claude Opus 4.61001001001004088.0%
Qwen3 235B A22B Instruct 25071001001001003987.8%
Mistral Large 31001001001003186.2%
Writer: Palmyra X51001001001003186.1%
Grok 4 Fast1001001001002785.3%
Stealth: Healer Alpha100100100912282.6%
Z.AI GLM 5 Turbo100100100713981.9%
Gemini 2.5 Flash Lite (Reasoning)100100100100480.9%
Claude Sonnet 4.6 (Reasoning)100100100100080.0%
Stealth: Hunter Alpha100100100100080.0%
DeepSeek-V2 Chat100100100100080.0%
Gemini 2.5 Flash Lite100100100100080.0%
Rocinante 12B100100100100080.0%
Arcee AI: Trinity Mini100100100731778.1%
MoonshotAI: Kimi K2.5100100100662377.9%
Gemma 3 4B100100100282771.1%
Nemotron 3 Super70493418034.2%
GPT-5 Nano6725130021.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.7100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Qwen 3.5 9B100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Qwen 3 32B100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Mistral Large100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Mistral NeMO100100100100100100.0%
Ministral 8B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
LFM2 24B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Grok 41001001001009999.7%
Claude Sonnet 41001001001009799.4%
GPT-4o, May 13th (temp=1)1001001001009398.5%
DeepSeek V3.11001001001009398.5%
Qwen 3.5 Plus (2026-02-15)1001001001009398.5%
WizardLM 2 8x22b1001001001009298.4%
Gemini 3 Pro (Preview)1001001001009198.2%
Mistral Small 41001001001009098.1%
Gemini 3 Flash (Preview)1001001001008997.8%
Gemini 2.5 Flash Lite1001001001008697.2%
Claude Opus 41001001001007695.2%
GPT-5.41001001001007494.8%
Llama 3.1 Nemotron 70B1001001001006893.6%
GPT-5.11001001001006893.5%
Z.AI GLM 51001001001006693.3%
Nemotron 3 Super1001001001006392.7%
Claude Sonnet 4.5100100100996192.1%
Nemotron 3 Nano1001001001005791.3%
Stealth: Hunter Alpha100100100816990.0%
Gemini 2.5 Flash1001001001004889.5%
Z.AI GLM 4.7 Flash1001001001004488.8%
Mistral Small 4 (Reasoning)100100100934788.0%
MoonshotAI: Kimi K2.51001001001003386.6%
Qwen3 235B A22B Instruct 25071001001001003286.4%
Gemini 2.5 Flash (Reasoning)1001001001002284.5%
Claude 3.5 Sonnet1001001001002284.5%
GPT-4o Mini (temp=1)1001001001001883.6%
ByteDance Seed 1.6 Flash1001001001001382.7%
Claude Sonnet 4.6 (Reasoning)100100100100881.5%
Gemini 2.5 Pro100100100100581.0%
Claude Sonnet 4.6100100100100380.6%
Mistral Small Creative100100100901180.4%
Aion 2.0100100100100080.0%
Ministral 3 14B10010010095079.1%
Stealth: Healer Alpha10010010089077.7%
Claude Opus 4.6 (Reasoning)10010010088077.7%
GPT-5 Nano1001008873072.3%
Arcee AI: Trinity Mini10010077661471.3%
Gemini 2.5 Flash Lite (Reasoning)1001009854070.4%
Claude 3.7 Sonnet10010010049069.9%
Claude Opus 4.61009690331166.1%
DeepSeek V3.2100100490049.8%