Past progressive (was/were + -ing) overuse

Test: Bad Writing Habits

Avg. Score
90.2%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-4.1 Mini100.0%$0.002719.0s100%
2Grok 4 Fast100.0%$0.001724.1s100%
3Grok 4.1 Fast100.0%$0.001837.8s100%
4GPT-4o, Aug. 6th (temp=1)100.0%$0.01824.4s100%
5o4 Mini99.8%$0.01525.7s97%
6Stealth: Aurora Alpha98.7%$0.00009.8s86%
7GPT-4o Mini (temp=1)99.5%$0.001234.8s90%
8o4 Mini High100.0%$0.02547.2s100%
9GPT-4.199.8%$0.01844.7s96%
10GPT-5 Mini99.7%$0.010057.4s96%
11GPT-5 Nano99.9%$0.00421.4m98%
12GPT-4o Mini (temp=0)98.5%$0.001234.8s83%
13Claude 3 Haiku96.6%$0.002514.9s78%
14GPT-4o, Aug. 6th (temp=0)98.7%$0.02322.7s82%
15GPT-4o, May 13th (temp=1)98.1%$0.03314.4s77%
16Grok 4100.0%$0.0481.7m100%
17GPT-5.2100.0%$0.0561.5m100%
18DeepSeek-V2 Chat96.7%$0.002153.3s74%
19ByteDance Seed 1.6 Flash95.9%$0.001327.3s68%
20GPT-5.1100.0%$0.0541.8m100%
21Llama 3.1 Nemotron 70B96.1%$0.003831.7s68%
22Writer: Palmyra X595.8%$0.01122.0s67%
23DeepSeek V3 (2025-03-24)96.1%$0.001439.4s66%
24Qwen 3.5 Plus (2026-02-15)94.9%$0.006031.5s64%
25GPT-4o, May 13th (temp=0)95.8%$0.03514.1s69%
26Arcee AI: Trinity Mini93.1%$0.00039.2s59%
27Gemini 2.5 Flash94.0%$0.005210.6s59%
28Hermes 3 405B95.3%$0.003253.2s66%
29Hermes 3 70B95.9%$0.00101.2m68%
30Cohere Command R+ (Aug. 2024)96.4%$0.02052.5s68%
31Qwen 2.5 72B92.6%$0.001036.7s64%
32Gemini 2.5 Flash Lite93.0%$0.00099.5s56%
33Claude 3.7 Sonnet96.8%$0.04246.7s71%
34Z.AI GLM 4.593.4%$0.005142.1s59%
35Mistral Medium 3.192.2%$0.004836.5s59%
36DeepSeek V3 (2024-12-26)93.3%$0.002154.6s59%
37GPT-5100.0%$0.0652.8m100%
38Claude Sonnet 4.593.7%$0.03538.1s62%
39Claude 3.5 Sonnet94.3%$0.04835.5s64%
40Gemma 3 12B90.9%$0.000441.3s52%
41GPT-4.1 Nano88.9%$0.000713.3s49%
42Arcee AI: Trinity Large (Preview)90.8%$0.000043.6s51%
43Mistral NeMO88.0%$0.000510.1s48%
44Gemma 3 4B88.5%$0.000220.0s50%
45MoonshotAI: Kimi K2.597.8%$0.0193.2m80%
46Claude Sonnet 492.3%$0.03243.7s58%
47Llama 3.1 8B92.5%$0.00031.3m51%
48Mistral Large88.9%$0.01430.9s50%
49Mistral Large 289.0%$0.01329.4s49%
50Mistral Large 387.9%$0.003330.3s46%
51WizardLM 2 8x22b91.8%$0.00261.8m57%
52ByteDance Seed 1.695.0%$0.0132.5m64%
53Rocinante 12B87.6%$0.001438.4s44%
54Llama 3.1 70B85.6%$0.001529.4s40%
55Mistral Small Creative83.8%$0.00079.1s37%
56Qwen 3.5 397B A17B92.8%$0.0143.0m61%
57Gemini 3.1 Pro (Preview)96.2%$0.1071.8m72%
58DeepSeek V3.286.9%$0.00141.9m49%
59Ministral 3 3B82.2%$0.000511.1s31%
60Z.AI GLM 585.8%$0.00841.2m41%
61Ministral 8B79.0%$0.000410.4s35%
62Claude Opus 4.587.8%$0.07053.4s50%
63Claude Haiku 4.579.4%$0.01121.6s34%
64Gemma 3 27B80.7%$0.000652.6s32%
65Claude Opus 4.688.3%$0.0781.2m50%
66Ministral 3 14B76.4%$0.000711.7s28%
67Ministral 3 8B76.7%$0.000819.6s29%
68Minimax M2.580.5%$0.00341.3m34%
69Gemini 2.5 Pro80.0%$0.03636.2s38%
70Claude 3.5 Haiku76.3%$0.003510.8s22%
71Gemini 3 Flash (Preview)72.1%$0.007819.6s25%
72Ministral 3B70.1%$0.00018.1s19%
73DeepSeek V3.176.3%$0.00201.8m32%
74Gemini 3 Pro (Preview)76.5%$0.05554.4s30%
75Claude Opus 494.0%$0.2091.4m62%
76Z.AI GLM 4.669.7%$0.006551.5s22%
77Claude Sonnet 4.671.3%$0.03139.3s22%
78Z.AI GLM 4.766.7%$0.0101.4m17%
79Z.AI GLM 4.7 Flash60.8%$0.00171.2m14%
80Mistral Small 3.2 24B82.8%$0.00695.7m33%
90.16%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Rocinante 12B100100100100100100.0%
GPT-4o, May 13th (temp=1)1001001001009799.4%
Mistral Medium 3.11001001001009498.8%
Claude Sonnet 4.51001001001008296.4%
Mistral Large 31001001001007895.6%
Mistral Small Creative1001001001007595.0%
Claude Sonnet 41001001001007394.5%
Claude Sonnet 4.61001001001007194.2%
Mistral Large 21001001001007094.0%
Mistral Large100100100976892.9%
Gemma 3 27B100100100867792.7%
DeepSeek V3.21001001001005090.0%
Z.AI GLM 4.61001001001004889.7%
Qwen 2.5 72B100100100747389.4%
Arcee AI: Trinity Mini1001001001004789.3%
WizardLM 2 8x22b1001001001004488.7%
Z.AI GLM 4.7100100100874486.2%
Arcee AI: Trinity Large (Preview)1001001001002184.2%
Mistral Small 3.2 24B10010089864283.4%
Gemini 3 Pro (Preview)100100100100080.0%
Claude 3.5 Haiku100100100100080.0%
Llama 3.1 8B100100100100080.0%
Z.AI GLM 4.7 Flash1001008683073.8%
Ministral 3B10010010050070.0%
Ministral 3 8B10010010028867.1%
Ministral 3 14B10010010010262.3%
Gemini 3 Flash (Preview)93845150055.5%
Ministral 8B91815450055.2%
Ministral 3 3B10089640050.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Z.AI GLM 51001001001009899.6%
DeepSeek V3 (2024-12-26)1001001001009398.6%
Cohere Command R+ (Aug. 2024)1001001001008897.6%
GPT-5 Nano10010099969397.5%
Llama 3.1 Nemotron 70B1001001001008597.0%
Claude Opus 4.61001001001008596.9%
ByteDance Seed 1.61001001001008296.5%
Mistral Large 3100100100938996.4%
Claude Opus 4.5100100100958395.7%
Claude 3 Haiku1001001001007895.6%
DeepSeek V3 (2025-03-24)1001001001006993.9%
Mistral Small Creative1001001001006593.0%
Minimax M2.510010099986793.0%
Ministral 3 3B100100100946792.2%
Mistral Small 3.2 24B1001001001005991.8%
Z.AI GLM 4.5100100100806989.9%
GPT-4o, May 13th (temp=0)1001001001004689.3%
Arcee AI: Trinity Mini1001001001004188.3%
Ministral 3B100100100894887.5%
Arcee AI: Trinity Large (Preview)100100100874385.9%
Claude Sonnet 4.61001001001002584.9%
Stealth: Aurora Alpha100100100724683.5%
GPT-4o, Aug. 6th (temp=0)1001001001001783.5%
Gemma 3 4B100100100585682.8%
Claude 3.7 Sonnet100100100951381.7%
Mistral Large 2100100100881981.4%
Gemini 2.5 Pro10010092594980.0%
Claude Sonnet 4.5100100100100080.0%
Mistral Large100100100100080.0%
Ministral 3 14B100100100100080.0%
GPT-4.1 Nano100100100643479.6%
Gemini 3 Pro (Preview)10010010090178.3%
WizardLM 2 8x22b100100100484378.2%
Llama 3.1 70B10010086544877.7%
DeepSeek V3.210010097481872.6%
Writer: Palmyra X51001008180072.2%
Claude 3.5 Haiku10010010061072.2%
Qwen 2.5 72B10010069662572.2%
Hermes 3 405B10010010037067.5%
Rocinante 12B10010010028666.9%
Gemma 3 27B1001006960466.7%
Claude Sonnet 41001007246063.6%
Gemini 3 Flash (Preview)1001006933060.4%
Z.AI GLM 4.71001001000060.0%
Llama 3.1 8B1001001000060.0%
Gemini 2.5 Flash Lite100100800056.0%
Ministral 8B100807126055.4%
Mistral NeMO1006664262155.3%
Z.AI GLM 4.6100746630054.1%
Claude Haiku 4.5100882910045.2%
DeepSeek V3.110068335041.2%
Ministral 3 8B10040284034.4%
Z.AI GLM 4.7 Flash10037290033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Large100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Mistral Large 21001001001009899.6%
Z.AI GLM 4.5100100100979698.5%
Hermes 3 405B1001001001009298.5%
DeepSeek V3.21001001001009298.4%
Claude 3.7 Sonnet100100100959598.0%
Ministral 3B1001001001008496.8%
Ministral 3 14B1001001001006793.5%
Gemma 3 12B100100100976592.3%
Gemma 3 27B100100100847391.3%
Z.AI GLM 51001001001005390.7%
Mistral Small Creative1001001001003687.2%
Arcee AI: Trinity Large (Preview)100100100854686.2%
Claude Sonnet 4.5100100100774584.5%
DeepSeek V3.1100100100655283.3%
Claude Haiku 4.5100100100534980.4%
Gemini 3 Flash (Preview)100100100100080.0%
Llama 3.1 70B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Gemini 3 Pro (Preview)1001009284075.4%
Ministral 3 8B10010072502870.2%
Ministral 8B1001009132064.7%
Rocinante 12B1001005750061.4%
Z.AI GLM 4.7 Flash100714838051.4%
Z.AI GLM 4.795824720048.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Large100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude 3 Haiku1001001001009498.8%
Llama 3.1 70B1001001001009298.5%
Qwen 2.5 72B1001001001009298.3%
Mistral Large 2100100100989197.7%
Mistral NeMO1001001001008697.2%
Gemini 2.5 Flash1001001001008296.4%
Z.AI GLM 4.61001001001008296.3%
Claude 3.7 Sonnet1001001001008096.0%
Mistral Large 31001001001008095.9%
Gemma 3 4B1001001001007695.2%
ByteDance Seed 1.61001001001006893.7%
DeepSeek V3.21001001001006492.8%
Gemini 2.5 Flash Lite1001001001006492.8%
Ministral 3 8B1001001001006292.4%
GPT-4o Mini (temp=0)100100100945389.4%
ByteDance Seed 1.6 Flash1001001001004488.8%
GPT-4o, May 13th (temp=0)100100100994188.1%
Claude Sonnet 4.51001001001003186.1%
Claude Haiku 4.5100100100675384.0%
DeepSeek V3.11001001001002084.0%
Claude Sonnet 4.6100100100941181.1%
Ministral 3B100100100100480.8%
Mistral Small Creative100100100100080.0%
Ministral 3 3B100100100100080.0%
Arcee AI: Trinity Mini10010010092078.5%
DeepSeek V3 (2024-12-26)10010010091078.1%
Arcee AI: Trinity Large (Preview)100100100362872.7%
GPT-4.1 Nano10010010054070.8%
Mistral Small 3.2 24B10010010053070.6%
Minimax M2.510010010040068.1%
Claude Opus 4.51001008648568.0%
Z.AI GLM 510010010036067.3%
Ministral 8B1001008350066.6%
Claude 3.5 Haiku10010010015063.0%
Gemma 3 27B10010010012062.4%
Ministral 3 14B100886458062.0%
Gemini 3 Pro (Preview)1001005624055.9%
Gemini 2.5 Pro1001002611147.6%
Gemini 3 Flash (Preview)100684619046.6%
Z.AI GLM 4.71007800035.5%
Z.AI GLM 4.7 Flash902100022.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Claude Sonnet 4.51001001001009398.7%
Z.AI GLM 4.51001001001008797.4%
Claude Haiku 4.51001001001008496.9%
Gemma 3 27B1001001001008496.9%
DeepSeek V3.21001001001008496.7%
ByteDance Seed 1.6 Flash100100100938695.7%
Ministral 3 14B1001001001007795.3%
Claude Opus 4.5100100100968095.2%
Gemini 2.5 Flash Lite1001001001007695.2%
Gemini 3 Flash (Preview)1001001001007294.4%
Arcee AI: Trinity Mini1001001001007194.1%
Gemini 3 Pro (Preview)1001001001005891.5%
Gemma 3 12B1001001001005390.6%
Arcee AI: Trinity Large (Preview)1001001001005090.0%
Ministral 8B100100100915388.9%
Gemini 2.5 Pro1001001001004088.0%
Mistral Large 31001001001003887.7%
Ministral 3 8B10010088846286.8%
GPT-4.1 Nano1001001001003086.0%
Mistral Medium 3.1100100100983185.8%
Minimax M2.5100100100992684.9%
Mistral NeMO100100100931581.6%
Claude Sonnet 4.6100100100100080.0%
Llama 3.1 8B100100100100080.0%
Z.AI GLM 4.7 Flash100100100871179.4%
Ministral 3B10010085791776.3%
Mistral Small 3.2 24B10010010045069.0%
Rocinante 12B10010010043068.6%
Z.AI GLM 4.610010010025064.9%
Llama 3.1 70B1001001000060.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Minimax M2.5100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Arcee AI: Trinity Large (Preview)1001001001009799.5%
Claude Sonnet 4.61001001001008797.4%
Rocinante 12B1001001001008797.4%
Ministral 8B1001001001008496.8%
Mistral Medium 3.11001001001008095.9%
Claude Opus 4.61001001001007795.3%
GPT-4o, Aug. 6th (temp=0)1001001001007795.3%
Mistral Large 31001001001007194.1%
Mistral Small Creative1001001001007094.0%
GPT-4o, May 13th (temp=0)1001001001006893.6%
WizardLM 2 8x22b1001001001006593.1%
Arcee AI: Trinity Mini1001001001005390.7%
Z.AI GLM 4.7 Flash1001001001004989.9%
Llama 3.1 Nemotron 70B1001001001004589.0%
Claude 3.5 Sonnet1001001001004388.6%
DeepSeek V3.2100100100845587.9%
DeepSeek V3 (2025-03-24)1001001001003787.5%
Claude Opus 4.5100100100954087.1%
Z.AI GLM 5100100100646084.7%
Claude Sonnet 4.51001001001002284.3%
Claude Haiku 4.5100100100675283.7%
Mistral Large 210010098931982.0%
Qwen 2.5 72B100100100575381.8%
Gemini 3 Flash (Preview)10010010099881.3%
Mistral NeMO10010092674180.1%
DeepSeek-V2 Chat100100100100080.0%
Gemini 2.5 Flash100100100100080.0%
Claude Sonnet 410010010095078.9%
Hermes 3 70B100100100682578.5%
GPT-4.1 Nano1001009593077.7%
Ministral 3 8B100100100403775.4%
Mistral Large10010069594173.9%
DeepSeek V3.110010080582172.0%
Gemini 3 Pro (Preview)10010074592671.8%
Ministral 3 14B1001009747068.8%
Gemma 3 4B10010010035067.1%
Llama 3.1 70B10010010035066.9%
Gemini 2.5 Pro1007168533264.8%
Claude 3.5 Haiku100877461064.5%
Ministral 3B1001001000060.0%
Gemma 3 12B100100488051.2%
Z.AI GLM 4.610084600048.8%
Ministral 3 3B100100150043.0%
Mistral Small 3.2 24B10010000040.0%
Gemma 3 27B100572511038.7%
Z.AI GLM 4.710015108527.7%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Gemini 2.5 Pro1001001001009598.9%
DeepSeek V3.21001001001009398.6%
Claude Opus 4.61001001001009098.0%
Hermes 3 70B1001001001008496.7%
Llama 3.1 8B1001001001007995.8%
GPT-4o, May 13th (temp=1)1001001001007895.6%
ByteDance Seed 1.6 Flash1001001001007795.4%
MoonshotAI: Kimi K2.51001001001007494.8%
Claude Sonnet 41001001001007294.4%
DeepSeek V3.11001001001007094.0%
Qwen 3.5 Plus (2026-02-15)1001001001006793.3%
Gemma 3 27B1001001001005591.0%
Gemma 3 4B1001001001005490.8%
DeepSeek V3 (2024-12-26)1001001001005390.6%
Gemini 3 Flash (Preview)1001001001005390.5%
Z.AI GLM 4.6100100100945489.7%
Gemini 2.5 Flash100100100964888.8%
Gemma 3 12B1001001001004188.3%
Writer: Palmyra X51001001001003587.1%
Z.AI GLM 4.51001001001003386.7%
Claude 3 Haiku1001001001002685.1%
Ministral 8B100100100853383.8%
Gemini 3 Pro (Preview)100100100962183.5%
Arcee AI: Trinity Large (Preview)1001001001001081.9%
Mistral Medium 3.1100100100981081.8%
GPT-4o, May 13th (temp=0)100100100792881.3%
Claude Opus 4.5100100100100080.0%
Mistral Large 3100100100100080.0%
Mistral Large100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Rocinante 12B100100100100080.0%
Claude 3.5 Sonnet100100100441872.4%
Minimax M2.5100100100481372.2%
Claude 3.5 Haiku10010010048069.7%
Z.AI GLM 4.7 Flash1009562533067.8%
Ministral 3 14B10010010027065.4%
Claude Sonnet 4.61001001000060.0%
Ministral 3B1001001000060.0%
Llama 3.1 70B1001007720059.3%
Z.AI GLM 4.7100565544051.1%
Mistral Small Creative10076542046.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Mistral NeMO100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Stealth: Aurora Alpha1001001001009999.8%
DeepSeek V3 (2024-12-26)1001001001009999.8%
Mistral Large1001001001009799.3%
Claude 3.7 Sonnet1001001001009398.7%
DeepSeek V3 (2025-03-24)1001001001009298.5%
Gemma 3 12B1001001001008897.5%
GPT-4o, Aug. 6th (temp=0)1001001001008697.1%
GPT-4.11001001001008396.6%
Hermes 3 70B1001001001008196.2%
GPT-5 Mini1001001001008096.1%
Claude Opus 41001001001007995.9%
Claude Opus 4.51001001001006893.6%
Qwen 2.5 72B100100100848192.9%
Claude Sonnet 4.510010098976792.3%
Z.AI GLM 4.610010087867990.4%
GPT-4o Mini (temp=1)1001001001005190.1%
DeepSeek-V2 Chat100100100777189.5%
Gemini 2.5 Flash10010097845887.8%
MoonshotAI: Kimi K2.51001001001002184.2%
Mistral Medium 3.11001001001001683.1%
Mistral Large 3100100100862983.1%
Llama 3.1 8B100100100100581.1%
Llama 3.1 70B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Arcee AI: Trinity Mini100100100100080.0%
DeepSeek V3.210010080705080.0%
Minimax M2.510010088595179.6%
Ministral 3 8B100100100741377.4%
Claude Sonnet 41009792831477.2%
Claude 3.5 Sonnet100100100562776.6%
Z.AI GLM 4.51001009479976.2%
Qwen 3.5 397B A17B1009372535374.3%
Writer: Palmyra X5100100100551774.3%
Gemma 3 4B10010010068073.5%
ByteDance Seed 1.6 Flash1001008062769.9%
Mistral Small Creative10010010033066.7%
Qwen 3.5 Plus (2026-02-15)100887946162.9%
Claude Sonnet 4.61001006546362.8%
Rocinante 12B10010010014062.8%
Gemini 2.5 Pro100817548762.2%
Gemini 2.5 Flash Lite1001001000060.0%
Claude Haiku 4.5100100990059.9%
ByteDance Seed 1.6100907921057.8%
Mistral Large 21001005330257.1%
GPT-4.1 Nano100100820056.6%
Ministral 3 3B100100745055.8%
DeepSeek V3.11001003937055.1%
Gemini 3.1 Pro (Preview)695945252143.7%
Ministral 3 14B100553127042.7%
Ministral 8B100100130042.5%
Z.AI GLM 510010040040.8%
Claude Opus 4.68962510040.2%
Gemini 3 Flash (Preview)10072260039.6%
Z.AI GLM 4.710057290037.1%
Gemma 3 27B888530035.1%
Z.AI GLM 4.7 Flash10048190033.3%
Gemini 3 Pro (Preview)10036290032.8%
Claude 3.5 Haiku4141150019.5%
Ministral 3B612700017.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Ministral 3 14B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Rocinante 12B100100100100100100.0%
Llama 3.1 70B1001001001009699.2%
Qwen 3.5 Plus (2026-02-15)1001001001009298.5%
Gemma 3 27B1001001001009097.9%
Claude Haiku 4.51001001001008496.9%
Mistral NeMO1001001001008496.7%
DeepSeek V3.21001001001008296.4%
Gemini 3 Pro (Preview)100100100918895.8%
Gemini 2.5 Pro100100100938695.7%
Claude 3.7 Sonnet1001001001007595.0%
DeepSeek-V2 Chat1001001001007494.8%
Qwen 3.5 397B A17B1001001001007094.0%
Gemini 2.5 Flash Lite1001001001006793.3%
Claude Opus 4.51001001001005891.5%
Claude Opus 4.61001001001005490.8%
Arcee AI: Trinity Large (Preview)1001001001005390.7%
Ministral 8B1001001001004889.7%
Claude Opus 4100100100826789.7%
MoonshotAI: Kimi K2.5100100100766988.9%
Claude 3.5 Haiku1001001001004388.6%
Gemini 2.5 Flash1001001001004388.6%
Mistral Medium 3.1100100100944487.7%
Mistral Large 31001001001002985.8%
Gemini 3 Flash (Preview)1009998923384.4%
Claude 3.5 Sonnet100100100100881.5%
DeepSeek V3 (2024-12-26)100100100891681.0%
Mistral Large100100100643780.1%
Ministral 3 3B100100100653580.0%
Z.AI GLM 4.710010010099079.9%
Z.AI GLM 4.7 Flash10010010088077.7%
Z.AI GLM 51001009082374.9%
Ministral 3B1009979622973.7%
DeepSeek V3.110010056494469.8%
Mistral Large 210010089331767.7%
Minimax M2.5100966355964.7%
Ministral 3 8B100100618053.6%
Z.AI GLM 4.69454383037.9%
Claude Sonnet 4.68546280031.6%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Rocinante 12B100100100100100100.0%
Gemma 3 27B1001001001009799.5%
ByteDance Seed 1.6100100100979698.7%
Gemini 3.1 Pro (Preview)10010096949296.5%
Llama 3.1 70B1001001001008096.0%
Claude 3.5 Sonnet1001001001007995.8%
Cohere Command R+ (Aug. 2024)1001001001007494.8%
Claude 3 Haiku100100100937994.4%
Qwen 2.5 72B1001001001006993.9%
Mistral NeMO100100100888093.6%
Stealth: Aurora Alpha100100100828092.4%
Claude Opus 4.5100100100946892.4%
GPT-4.1 Nano100100100777790.8%
Ministral 8B100100100866289.6%
Mistral Medium 3.1100100100786288.0%
Mistral Large 3100100100933685.9%
GPT-4o Mini (temp=0)100100100923785.8%
Z.AI GLM 4.5100100100991983.5%
Mistral Small Creative1001001001001482.8%
DeepSeek V3 (2025-03-24)100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Hermes 3 70B100100100100080.0%
Gemma 3 12B100100100524679.6%
Claude Opus 4.610010088634378.7%
DeepSeek V3.210010010091078.1%
Qwen 3.5 Plus (2026-02-15)10010099771377.7%
ByteDance Seed 1.6 Flash10010010085077.0%
Mistral Small 3.2 24B10010089643176.7%
Ministral 3 3B100100100562175.4%
Z.AI GLM 51001009084074.9%
Claude Opus 410010010045870.7%
Claude Sonnet 41009480681070.3%
Mistral Large10010010047069.3%
Ministral 3 14B10010010046069.2%
WizardLM 2 8x22b1001007364067.5%
Z.AI GLM 4.71001009737066.9%
Mistral Large 2100828152063.0%
Gemini 3 Pro (Preview)1001006545061.9%
Claude Haiku 4.51001003826052.8%
Z.AI GLM 4.7 Flash10095650052.0%
Ministral 3 8B1001002929051.6%
Claude Sonnet 4.6100744928050.2%
Minimax M2.5100100500050.0%
DeepSeek V3.11001002713047.9%
Ministral 3B10079290041.6%
Qwen 3.5 397B A17B77751414837.5%
Gemini 2.5 Pro96442719037.2%
Z.AI GLM 4.6583400018.5%
Gemini 3 Flash (Preview)67000013.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Claude 3.7 Sonnet1001001001009899.6%
Z.AI GLM 4.51001001001009799.5%
Mistral Large 21001001001009699.2%
Arcee AI: Trinity Large (Preview)1001001001009098.1%
o4 Mini1001001001008697.3%
Hermes 3 70B1001001001008597.0%
Gemini 3.1 Pro (Preview)1001001001007494.8%
Gemma 3 27B1001001001007094.0%
GPT-4.1 Nano1001001001006793.3%
Ministral 3 14B10010098907793.2%
Gemini 2.5 Pro100100100976592.3%
Z.AI GLM 5100100100956391.7%
Claude 3.5 Sonnet100100100876590.5%
Qwen 2.5 72B1001001001005290.4%
Qwen 3.5 397B A17B1001001001004989.8%
Claude Sonnet 4.51001001001004789.3%
Gemini 3 Flash (Preview)1001001001004689.3%
DeepSeek-V2 Chat100100100885588.7%
Mistral Medium 3.1100100100875087.3%
Claude Opus 41001001001003486.8%
Claude Opus 4.6100100100992885.5%
Rocinante 12B1001001001002584.9%
Ministral 8B100100100715284.6%
DeepSeek V3.11001001001002184.2%
Claude Opus 4.510010092636083.1%
Z.AI GLM 4.7 Flash100100100635283.1%
Llama 3.1 Nemotron 70B1001001001001583.0%
Z.AI GLM 4.71001001001001081.9%
DeepSeek V3 (2025-03-24)100100100100080.0%
Mistral Small Creative100100100100080.0%
Ministral 3 8B100100100100080.0%
Mistral NeMO100100100100080.0%
DeepSeek V3 (2024-12-26)1008378716879.9%
Mistral Large 3100100100841379.2%
DeepSeek V3.210010010091078.3%
Z.AI GLM 4.6100100100552576.1%
Claude Haiku 4.510010010046970.9%
Gemma 3 4B10010073251061.4%
Gemini 3 Pro (Preview)100956624257.4%
Minimax M2.58884840051.2%
Claude Sonnet 4.68780313040.3%
Ministral 3B10085100038.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Cohere Command R+ (Aug. 2024)1001001001008997.8%
GPT-4o Mini (temp=0)1001001001008897.7%
Gemini 3.1 Pro (Preview)1001001001008496.8%
DeepSeek V3.2100100100908795.5%
Mistral Large 2100100100948094.8%
MoonshotAI: Kimi K2.5100100100947894.4%
Mistral Medium 3.11001001001006993.9%
Claude 3.7 Sonnet1001001001006993.8%
Claude Sonnet 4.5100100100947493.7%
Claude Sonnet 41001001001006793.3%
Arcee AI: Trinity Mini1001001001006593.1%
DeepSeek V3 (2025-03-24)1001001001006192.2%
Llama 3.1 Nemotron 70B1001001001005791.3%
GPT-4o, May 13th (temp=0)1001001001005490.8%
Claude 3 Haiku100100100817290.5%
Rocinante 12B1001001001004889.7%
Mistral Small 3.2 24B1001001001004488.7%
Hermes 3 405B1001001001004088.1%
GPT-4.1 Nano1001001001003587.1%
Qwen 2.5 72B100100100864586.2%
Gemma 3 4B10010088736685.3%
Gemini 2.5 Flash Lite100100100534779.9%
Claude Opus 410010010094078.8%
GPT-4o, May 13th (temp=1)10010010094078.7%
Z.AI GLM 4.510010010092078.4%
Gemma 3 12B10010010088077.7%
Llama 3.1 70B10010085623776.9%
Qwen 3.5 Plus (2026-02-15)100100100611474.9%
DeepSeek V3.11001008984074.5%
Qwen 3.5 397B A17B10010076504474.0%
Mistral NeMO1001008772072.0%
Ministral 8B1007471565571.3%
Mistral Large1007972464568.4%
Mistral Small Creative1001009435065.9%
Minimax M2.5100807557062.4%
Ministral 3 14B877769473162.4%
Gemini 3 Flash (Preview)1001006741061.4%
Mistral Large 31001006239160.3%
Claude 3.5 Haiku1001001000060.0%
Claude Sonnet 4.6100100930058.7%
Z.AI GLM 51001005827056.8%
Claude Opus 4.5100100792056.3%
Ministral 3 3B100100720054.4%
Ministral 3B100100660053.2%
Gemini 2.5 Flash100100490049.9%
Z.AI GLM 4.610061480041.8%
Gemini 2.5 Pro1009880041.3%
Gemini 3 Pro (Preview)10050379039.2%
Claude Opus 4.69174300039.1%
Ministral 3 8B6460460034.0%
Gemma 3 27B1005450031.9%
Z.AI GLM 4.7 Flash6757250029.8%
Z.AI GLM 4.78335270029.1%
Claude Haiku 4.510022210028.5%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
Mistral Large 3100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Mistral Large100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Ministral 8B100100100100100100.0%
Ministral 3B100100100100100100.0%
GPT-4o, May 13th (temp=1)1001001001009999.8%
Claude Opus 41001001001009799.5%
WizardLM 2 8x22b1001001001009298.5%
Minimax M2.51001001001009198.1%
GPT-4o, May 13th (temp=0)1001001001009198.1%
Mistral Small Creative1001001001008797.3%
Ministral 3 14B1001001001008096.0%
Rocinante 12B100100100957694.2%
DeepSeek V3.11001001001006593.0%
Gemini 3 Pro (Preview)1001001001006492.9%
Ministral 3 3B1001001001006091.9%
Gemini 2.5 Pro1001001001005591.0%
Cohere Command R+ (Aug. 2024)1001001001005290.4%
Claude Opus 4.610010096876189.0%
Claude Opus 4.51001001001004488.8%
Claude Haiku 4.51001001001003687.3%
Z.AI GLM 4.61001001001002885.6%
Gemma 3 12B1001001001002584.9%
Claude Sonnet 4.61001001001001182.2%
Llama 3.1 Nemotron 70B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Claude 3.5 Haiku10010010096079.2%
Z.AI GLM 4.7 Flash100100520050.4%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Mistral Small Creative100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)1001001001009699.2%
Claude Sonnet 4.61001001001008697.1%
Gemini 2.5 Flash Lite100100100978897.0%
Mistral Large1001001001008597.0%
Mistral NeMO1001001001008396.5%
Claude 3.5 Sonnet1001001001008296.4%
Gemini 3 Pro (Preview)1001001001008096.0%
Gemma 3 12B1001001001007494.8%
Claude Opus 4.6100100100996091.9%
Hermes 3 405B1001001001005891.6%
Llama 3.1 70B100100100956291.4%
Ministral 3 8B1001001001005390.7%
Gemini 2.5 Pro1001001001005090.0%
Rocinante 12B100100100895889.4%
Llama 3.1 Nemotron 70B100100100816388.7%
DeepSeek-V2 Chat100100100736988.5%
GPT-4.1 Nano100100100855087.0%
Claude Haiku 4.5100100100685284.0%
Ministral 3 3B100100100100781.4%
Mistral Small 3.2 24B100100100100080.0%
Ministral 3B10010010099079.8%
ByteDance Seed 1.6100100100732178.9%
Claude Opus 4100100100791278.1%
Claude 3 Haiku1008773706078.0%
DeepSeek V3 (2024-12-26)10010010084076.7%
Claude Opus 4.5100100100503276.5%
Qwen 2.5 72B10010010080076.0%
Z.AI GLM 510010080603274.3%
Ministral 3 14B10010010063072.5%
Z.AI GLM 4.7 Flash1001008180072.2%
Gemma 3 27B10010010050069.9%
Gemini 3 Flash (Preview)1001007368068.3%
Gemma 3 4B10010010034066.8%
Z.AI GLM 4.61008763611164.3%
WizardLM 2 8x22b100868638763.2%
DeepSeek V3.21006664482460.4%
Minimax M2.5100100750055.0%
DeepSeek V3.110077667050.0%
Claude 3.5 Haiku100100290045.8%
Z.AI GLM 4.71006700033.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Llama 3.1 70B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Mistral NeMO100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Rocinante 12B100100100100100100.0%
Claude Opus 4.51001001001009899.7%
GPT-5 Mini1001001001009699.2%
Cohere Command R+ (Aug. 2024)1001001001009298.5%
Z.AI GLM 4.7 Flash1001001001008797.4%
DeepSeek V3 (2025-03-24)1001001001008597.0%
Claude Opus 4.61001001001008296.4%
Gemma 3 4B1001001001006893.5%
Mistral Small 3.2 24B1001001001006192.2%
Claude Haiku 4.5100100100827892.1%
Z.AI GLM 51001001001006091.9%
Gemini 3 Flash (Preview)100100100846088.8%
DeepSeek V3.11001001001004488.8%
Ministral 3 8B100100100984087.6%
Mistral Small Creative100100100983686.8%
Gemma 3 27B100100100666185.4%
Claude Sonnet 4.51001001001002584.9%
Minimax M2.5100100100992083.9%
DeepSeek V3 (2024-12-26)100100100100781.4%
Mistral Large 3100100100100080.0%
Ministral 3 3B100100100100080.0%
Mistral Large100100100771778.9%
Mistral Large 210010010087077.4%
Ministral 3 14B100100100731176.6%
Gemini 2.5 Pro100100100641976.6%
Mistral Medium 3.1100100100383574.8%
Claude 3.5 Haiku100100100371570.4%
Ministral 3B10010010038067.7%
Ministral 8B1001001000060.0%
Z.AI GLM 4.6100100810056.2%
Claude Sonnet 4.610054410039.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Claude Opus 4100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Mistral Medium 3.1100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Large100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Llama 3.1 8B100100100100100100.0%
Ministral 8B100100100100100100.0%
Minimax M2.51001001001009398.7%
Qwen 2.5 72B1001001001008296.4%
Cohere Command R+ (Aug. 2024)1001001001008296.4%
Claude Opus 4.61001001001008196.3%
Arcee AI: Trinity Mini1001001001007995.8%
Claude Haiku 4.51001001001007194.2%
Ministral 3 14B1001001001006993.9%
Gemma 3 4B1001001001006793.3%
WizardLM 2 8x22b100100100986592.7%
Gemini 2.5 Pro1001001001006092.0%
Llama 3.1 70B1001001001005891.6%
Mistral Small 3.2 24B1001001001005891.6%
Gemini 3 Flash (Preview)100100100916691.4%
Writer: Palmyra X5100100100926190.5%
Rocinante 12B1001001001003086.0%
Gemini 3 Pro (Preview)1001001001002585.1%
Arcee AI: Trinity Large (Preview)1001001001002584.9%
GPT-4.1 Nano1001001001002384.7%
ByteDance Seed 1.61001001001002084.0%
GPT-4o, May 13th (temp=0)1001001001001282.4%
Gemma 3 12B100100100100080.0%
Mistral Small Creative100100100100080.0%
Hermes 3 405B100100100642577.7%
Z.AI GLM 4.710010084762677.2%
Ministral 3B10010087741775.7%
Mistral Large 31001008478072.3%
Mistral NeMO100938957067.9%
DeepSeek V3.1100848155064.0%
DeepSeek V3.21001007214057.1%
Z.AI GLM 4.6100100810056.2%
Z.AI GLM 4.7 Flash3200006.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
GPT-5.2100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Mistral Large 3100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Claude 3.7 Sonnet100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Mistral Large100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Qwen 2.5 72B100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Ministral 3 8B100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100.0%
Gemma 3 4B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Rocinante 12B100100100100100100.0%
Llama 3.1 70B1001001001009398.7%
Z.AI GLM 4.51001001001008997.8%
MoonshotAI: Kimi K2.51001001001008697.1%
Z.AI GLM 4.71001001001008697.1%
Claude Opus 41001001001008496.9%
Z.AI GLM 4.7 Flash1001001001008496.9%
Mistral Medium 3.11001001001008396.6%
Arcee AI: Trinity Mini1001001001008396.6%
Ministral 3 3B1001001001007695.3%
Claude Opus 4.61001001001007595.1%
Claude Sonnet 41001001001007394.6%
DeepSeek V3 (2024-12-26)100100100888294.1%
Gemini 3 Pro (Preview)1001001001006492.9%
GPT-4.1 Nano1001001001006392.5%
Ministral 8B1001001001005791.4%
Claude Opus 4.51001001001003787.5%
Mistral Small Creative100100100894586.8%
Minimax M2.510010093726886.5%
Claude Haiku 4.51001001001002084.0%
Ministral 3B100100100882582.6%
Gemini 3 Flash (Preview)100100100100681.2%
Ministral 3 14B100100100100681.2%
Claude 3.5 Haiku100100100100080.0%
Arcee AI: Trinity Large (Preview)100100100100080.0%
Mistral NeMO100100100100080.0%
DeepSeek V3.110010098573177.1%
Llama 3.1 8B100100100582075.6%
Claude Sonnet 4.610094480048.5%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5.1100100100100100100.0%
o4 Mini High100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5100100100100100100.0%
o4 Mini100100100100100100.0%
GPT-5.2100100100100100100.0%
Minimax M2.5100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-4.1100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
Grok 4 Fast100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100.0%
Ministral 3 3B100100100100100100.0%
Ministral 3B100100100100100100.0%
DeepSeek-V2 Chat1001001001009699.3%
Z.AI GLM 51001001001008897.5%
Claude Sonnet 4.51001001001008296.3%
Claude 3 Haiku1001001001008196.2%
Claude Opus 4.61001001001008196.1%
Claude Opus 41001001001008096.0%
Rocinante 12B1001001001007595.1%
Claude Sonnet 41001001001007595.0%
Claude 3.5 Sonnet1001001001007494.8%
DeepSeek V3.11001001001006893.7%
GPT-4.1 Nano1001001001006793.3%
Llama 3.1 8B100100100858193.2%
GPT-4o, May 13th (temp=1)1001001001006492.8%
Hermes 3 405B1001001001005691.2%
Arcee AI: Trinity Mini100100100836890.2%
Mistral Large 21001001001004589.0%
Qwen 2.5 72B1001001001004388.6%
Z.AI GLM 4.71009884818088.6%
Claude Haiku 4.51001001001004288.4%
Gemini 3 Pro (Preview)100100100894686.8%
Claude Opus 4.510010086746785.5%
Mistral Large 3100100100943285.2%
Mistral Medium 3.11001001001002585.1%
Mistral NeMO1008887855883.7%
Z.AI GLM 4.61001001001001783.4%
Gemini 2.5 Pro1008684826182.7%
Gemma 3 4B100100100595382.5%
DeepSeek V3.2100100100793282.2%
Ministral 8B100100100892282.1%
Claude 3.5 Haiku100100100100480.8%
Mistral Large100100100871780.7%
Claude 3.7 Sonnet100100100100080.0%
Cohere Command R+ (Aug. 2024)100100100100080.0%
Ministral 3 8B10010010093078.7%
Hermes 3 70B100100100612777.7%
Z.AI GLM 4.7 Flash10010010088077.5%
Claude Sonnet 4.610010077663074.5%
Z.AI GLM 4.5100100100501773.4%
Gemini 3 Flash (Preview)10010082621271.4%
WizardLM 2 8x22b1001008271070.6%
Mistral Small Creative10010074402066.9%
Mistral Small 3.2 24B10010010030066.0%
Llama 3.1 70B1001009729065.3%
Ministral 3 14B1001001000060.0%