Name drop frequency

Test: Bad Writing Habits

Avg. Score
66.0%
Scenarios
18

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-4.1 Nano94.8%$0.000713.3s81%
2Gemma 3 4B88.9%$0.000220.0s65%
3Gemini 2.5 Pro92.7%$0.03636.2s69%
4Grok 4 Fast86.6%$0.001724.1s60%
5Z.AI GLM 4.689.7%$0.006551.5s59%
6DeepSeek V3.190.6%$0.00201.8m69%
7GPT-5 Nano85.9%$0.00421.4m63%
8Stealth: Aurora Alpha77.7%$0.00009.8s48%
9GPT-4o Mini (temp=1)80.4%$0.001234.8s48%
10Gemini 2.5 Flash80.2%$0.005210.6s42%
11Gemini 2.5 Flash Lite78.5%$0.00099.5s41%
12GPT-5 Mini82.2%$0.010057.4s47%
13DeepSeek V3.283.5%$0.00141.9m54%
14Arcee AI: Trinity Mini76.0%$0.00039.2s36%
15ByteDance Seed 1.684.3%$0.0132.5m53%
16Gemma 3 27B76.7%$0.000652.6s35%
17Claude Sonnet 4.681.4%$0.03139.3s37%
18GPT-4o, Aug. 6th (temp=1)72.7%$0.01824.4s36%
19GPT-4.1 Mini71.9%$0.002719.0s30%
20Claude 3.5 Haiku69.1%$0.003510.8s32%
21Gemini 3 Flash (Preview)68.4%$0.007819.6s36%
22Z.AI GLM 577.1%$0.00841.2m37%
23Grok 4.1 Fast71.8%$0.001837.8s32%
24Minimax M2.575.3%$0.00341.3m35%
25GPT-4.171.7%$0.01844.7s36%
26Mistral Medium 3.167.9%$0.004836.5s31%
27Z.AI GLM 4.7 Flash71.5%$0.00171.2m33%
28o4 Mini High73.6%$0.02547.2s31%
29Mistral Small Creative62.1%$0.00079.1s26%
30Claude Haiku 4.567.0%$0.01121.6s27%
31Grok 480.3%$0.0481.7m43%
32Gemma 3 12B68.7%$0.000441.3s25%
33Qwen 3.5 Plus (2026-02-15)64.3%$0.006031.5s28%
34Ministral 3 14B59.1%$0.000711.7s27%
35Claude Sonnet 469.2%$0.03243.7s34%
36o4 Mini67.0%$0.01525.7s26%
37Mistral Large 361.2%$0.003330.3s28%
38Ministral 3 8B59.7%$0.000819.6s26%
39DeepSeek V3 (2024-12-26)64.7%$0.002154.6s26%
40Writer: Palmyra X564.7%$0.01122.0s21%
41Ministral 8B57.3%$0.000410.4s22%
42Z.AI GLM 4.769.1%$0.0101.4m29%
43GPT-4o Mini (temp=0)61.3%$0.001234.8s22%
44Llama 3.1 8B67.8%$0.00031.3m24%
45WizardLM 2 8x22b69.3%$0.00261.8m30%
46ByteDance Seed 1.6 Flash57.8%$0.001327.3s24%
47Mistral NeMO55.0%$0.000510.1s22%
48Rocinante 12B59.4%$0.001438.4s24%
49Cohere Command R+ (Aug. 2024)64.2%$0.02052.5s28%
50DeepSeek-V2 Chat61.4%$0.002153.3s25%
51Ministral 3B53.1%$0.00018.1s22%
52Claude Opus 4.677.6%$0.0781.2m37%
53Mistral Large 257.5%$0.01329.4s25%
54Mistral Large57.1%$0.01430.9s25%
55Hermes 3 70B59.7%$0.00101.2m25%
56DeepSeek V3 (2025-03-24)55.6%$0.001439.4s22%
57Llama 3.1 Nemotron 70B53.2%$0.003831.7s24%
58GPT-580.8%$0.0652.8m45%
59Claude Sonnet 4.562.8%$0.03538.1s23%
60GPT-4o, Aug. 6th (temp=0)55.7%$0.02322.7s22%
61Arcee AI: Trinity Large (Preview)55.2%$0.000043.6s19%
62Claude Opus 4.567.7%$0.07053.4s32%
63Ministral 3 3B47.8%$0.000511.1s19%
64MoonshotAI: Kimi K2.573.1%$0.0193.2m36%
65Llama 3.1 70B47.7%$0.001529.4s19%
66GPT-4o, May 13th (temp=1)54.0%$0.03314.4s18%
67GPT-5.171.7%$0.0541.8m23%
68Gemini 3 Pro (Preview)61.3%$0.05554.4s22%
69Hermes 3 405B50.4%$0.003253.2s15%
70Qwen 2.5 72B40.8%$0.001036.7s21%
71Claude 3.5 Sonnet55.0%$0.04835.5s21%
72Z.AI GLM 4.542.0%$0.005142.1s14%
73Claude 3 Haiku30.6%$0.002514.9s16%
74Qwen 3.5 397B A17B57.9%$0.0143.0m22%
75GPT-4o, May 13th (temp=0)42.3%$0.03514.1s9%
76Gemini 3.1 Pro (Preview)64.3%$0.1071.8m25%
77Claude 3.7 Sonnet37.5%$0.04246.7s14%
78GPT-5.228.0%$0.0561.5m6%
79Mistral Small 3.2 24B52.8%$0.00695.7m15%
80Claude Opus 456.3%$0.2091.4m22%
66.02%

Individual Scenarios

Detailed Writing Rules

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4.1 Nano100100100989498.3%
Gemma 3 4B100100100928395.2%
Z.AI GLM 4.710010097836989.8%
Grok 4 Fast999189868389.7%
Claude Sonnet 4.61008683838387.1%
Gemini 3.1 Pro (Preview)10010083836786.7%
Llama 3.1 8B10010083806685.8%
DeepSeek V3.21009483836785.5%
GPT-4.1 Mini10010080786785.0%
Grok 4929185836883.8%
GPT-510010083835083.3%
GPT-5 Nano10010083676783.3%
Stealth: Aurora Alpha1009783676782.7%
Gemma 3 12B1009898675082.5%
Claude 3.5 Haiku1009489644979.1%
GPT-5.11008383785078.9%
Gemini 2.5 Pro10010091673378.3%
Gemini 3 Pro (Preview)10010083634377.8%
ByteDance Seed 1.6838282717077.6%
DeepSeek V3.11008478675075.8%
Ministral 3 3B938779695075.7%
Z.AI GLM 4.6969483673374.8%
GPT-5 Mini1009867505073.0%
Arcee AI: Trinity Large (Preview)10010067563671.7%
Grok 4.1 Fast827567676771.6%
Claude Opus 4.6878276625071.4%
Z.AI GLM 510010062613371.2%
Gemini 3 Flash (Preview)838371675070.8%
Ministral 3B949367484669.5%
Gemini 2.5 Flash Lite848267645069.4%
GPT-4o Mini (temp=1)848367625069.0%
Gemini 2.5 Flash838176505068.2%
Gemma 3 27B1007967613368.0%
Claude Haiku 4.5826765645967.2%
Llama 3.1 Nemotron 70B1006767613966.7%
Hermes 3 70B887850504963.1%
Ministral 3 8B906865503962.5%
Z.AI GLM 4.7 Flash948358413361.9%
Mistral Large837550504761.0%
Mistral Small 3.2 24B89816967061.0%
Ministral 8B1006756483360.8%
Qwen 3.5 Plus (2026-02-15)100786659060.6%
Minimax M2.5837750484259.9%
ByteDance Seed 1.6 Flash676160565559.7%
MoonshotAI: Kimi K2.5726864543358.2%
o4 Mini636057555257.5%
WizardLM 2 8x22b100675850055.0%
Claude Sonnet 4675856504254.5%
Mistral Large 2676056553454.5%
Hermes 3 405B93906819054.1%
Llama 3.1 70B82716250053.2%
Mistral Large 3675550494553.2%
Arcee AI: Trinity Mini726763372753.1%
DeepSeek-V2 Chat685351503952.3%
Mistral Medium 3.1666160422751.3%
Mistral NeMO82675944050.3%
Claude Opus 4645450503350.2%
Rocinante 12B94675033048.8%
Cohere Command R+ (Aug. 2024)100504846048.8%
Claude Sonnet 4.5706844441447.9%
GPT-4o, Aug. 6th (temp=1)675949451847.4%
Mistral Small Creative676638333147.0%
Qwen 2.5 72B635550431946.0%
DeepSeek V3 (2025-03-24)635642382945.6%
Writer: Palmyra X567595350045.6%
Ministral 3 14B676449271744.9%
GPT-4.1545050353344.4%
o4 Mini High835148211744.1%
Claude Opus 4.5675044331742.4%
DeepSeek V3 (2024-12-26)655442221639.6%
GPT-4o Mini (temp=0)67563333037.9%
Claude 3.5 Sonnet454140321935.4%
Qwen 3.5 397B A17B60474517033.7%
GPT-4o, May 13th (temp=0)724224171333.5%
GPT-5.28350330033.3%
GPT-4o, May 13th (temp=1)60503517032.3%
GPT-4o, Aug. 6th (temp=0)59353433032.2%
Z.AI GLM 4.550332928028.1%
Claude 3 Haiku4133290020.6%
Claude 3.7 Sonnet3320170014.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5 Nano100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
Gemma 3 27B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Grok 41001001001009999.9%
GPT-4o, Aug. 6th (temp=1)1001001001009999.8%
Z.AI GLM 5100100100979698.7%
Grok 4 Fast100100100968997.0%
DeepSeek V3.21001001001008396.7%
Gemma 3 4B1001001001008396.7%
Writer: Palmyra X51001001001008396.7%
Gemini 3 Pro (Preview)1001001001008396.7%
Grok 4.1 Fast100100100978496.1%
GPT-4.1 Mini100100100928395.0%
WizardLM 2 8x22b100100100908394.7%
GPT-5 Mini1001001001006793.3%
o4 Mini High100100100838393.3%
Claude Opus 4.5100100100838393.3%
Z.AI GLM 4.7 Flash100100100838393.3%
Gemini 2.5 Flash1001001001006793.3%
Claude 3.5 Haiku100100100937293.1%
Mistral Large 310010094847089.7%
Mistral Small Creative10010090837188.8%
Qwen 3.5 397B A17B1009689866988.0%
Gemini 3 Flash (Preview)1009083837886.9%
Minimax M2.51001001001003386.7%
DeepSeek V3 (2024-12-26)1009794746786.4%
Ministral 3 3B1009793855485.7%
Claude Haiku 4.510010098953385.3%
Claude Sonnet 4.510010083756785.1%
Qwen 3.5 Plus (2026-02-15)979583836785.0%
GPT-4.110010089835084.5%
DeepSeek V3 (2025-03-24)10010098675684.2%
Stealth: Aurora Alpha1009983835484.0%
Mistral Large10010094765083.9%
Hermes 3 70B100100100872883.1%
Arcee AI: Trinity Large (Preview)100100100634882.3%
o4 Mini10010071706180.5%
Claude Sonnet 41009893624980.2%
MoonshotAI: Kimi K2.51009383794680.2%
GPT-4o Mini (temp=0)10010083675080.0%
Mistral Medium 3.1968379746779.8%
DeepSeek-V2 Chat1008374706778.9%
Mistral Large 2959191675078.7%
Ministral 3 14B999778635678.7%
Ministral 8B1009292911778.4%
GPT-5.21008683833377.2%
Arcee AI: Trinity Mini10010085831777.0%
Llama 3.1 8B1008383605876.9%
Ministral 3 8B828075706774.7%
ByteDance Seed 1.6 Flash838376676274.3%
Mistral NeMO1009172673372.7%
Ministral 3B1009875731772.5%
Cohere Command R+ (Aug. 2024)1008371545071.8%
Mistral Small 3.2 24B1009583423771.3%
GPT-4o, Aug. 6th (temp=0)1008483561768.1%
Z.AI GLM 4.5827871673366.1%
GPT-4o, May 13th (temp=1)959470481764.6%
Claude Opus 4766767635064.6%
Claude 3 Haiku726867645164.3%
Llama 3.1 70B94888158064.1%
Qwen 2.5 72B757167565063.7%
Claude 3.5 Sonnet1007463503163.6%
Llama 3.1 Nemotron 70B686867594661.6%
Claude 3.7 Sonnet838181311758.8%
GPT-4o, May 13th (temp=0)1009036331755.2%
Hermes 3 405B806748392251.3%
Rocinante 12B8141170027.8%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemma 3 12B100100100100100100.0%
GPT-5 Nano1001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
Arcee AI: Trinity Mini100100100938295.1%
DeepSeek V3.11001001001006793.3%
GPT-4o Mini (temp=0)100100100838393.3%
Gemma 3 4B100100100838393.3%
Llama 3.1 8B100100100836790.0%
Gemini 2.5 Pro1001001001003286.5%
Mistral Large 210010096835085.9%
GPT-5 Mini100100100675083.3%
Claude Opus 41008383757182.5%
Gemma 3 27B10010083676282.4%
Mistral Large10010078705881.2%
Z.AI GLM 4.7100100100673079.3%
DeepSeek V3 (2024-12-26)1009689644278.3%
GPT-4.1 Mini1009469665977.6%
GPT-4o Mini (temp=1)10010080673376.0%
Z.AI GLM 510010010080076.0%
Gemini 2.5 Flash1001009483075.4%
Claude 3.5 Sonnet10010096411770.6%
Hermes 3 70B1008380483970.2%
Gemini 2.5 Flash Lite10010083461769.1%
Stealth: Aurora Alpha916767615067.1%
Mistral Medium 3.1908267633367.0%
Grok 4100787571064.9%
Rocinante 12B827967504163.9%
Z.AI GLM 4.61001009317061.9%
Claude Opus 4.51008376351461.7%
GPT-4o, Aug. 6th (temp=1)98917541061.1%
Grok 4 Fast838373461660.2%
ByteDance Seed 1.61008365331759.7%
GPT-583836750056.7%
DeepSeek V3.210010050171756.7%
Minimax M2.51008367171656.5%
Llama 3.1 Nemotron 70B926758332955.6%
DeepSeek-V2 Chat100635757055.4%
Z.AI GLM 4.7 Flash836760441854.5%
Ministral 3 8B836650332651.7%
Gemini 3 Flash (Preview)675050503350.0%
Cohere Command R+ (Aug. 2024)917735331249.4%
Claude Sonnet 4100100330046.7%
Arcee AI: Trinity Large (Preview)10083500046.7%
GPT-4o, May 13th (temp=1)67676732046.3%
Grok 4.1 Fast9874500044.3%
Hermes 3 405B100553332044.0%
Gemini 3 Pro (Preview)83673333043.3%
MoonshotAI: Kimi K2.5676333331141.5%
Ministral 3 14B7269500038.3%
GPT-4.11008300036.7%
Qwen 3.5 Plus (2026-02-15)67503333036.7%
WizardLM 2 8x22b1008300036.7%
Gemini 3.1 Pro (Preview)675025171735.0%
GPT-5.11006700033.3%
o4 Mini1006500033.0%
Mistral Large 367671714032.8%
o4 Mini High1005800031.5%
Writer: Palmyra X59048170030.9%
DeepSeek V3 (2025-03-24)5750433030.5%
Claude 3 Haiku45393328029.0%
Claude 3.7 Sonnet5450330027.4%
Ministral 8B9817170026.3%
Ministral 3B676000025.2%
Mistral NeMO35333317023.6%
Mistral Small 3.2 24B5733226023.5%
Claude Haiku 4.51001700023.3%
Claude Opus 4.66733110022.3%
Llama 3.1 70B33252417019.9%
Ministral 3 3B494500018.7%
Mistral Small Creative5017177018.0%
ByteDance Seed 1.6 Flash83600017.9%
Claude 3.5 Haiku602200016.4%
Qwen 3.5 397B A17B3331170016.3%
Claude Sonnet 4.5671300015.9%
Claude Sonnet 4.6331700010.0%
GPT-4o, Aug. 6th (temp=0)331700010.0%
Qwen 2.5 72B331700010.0%
GPT-4o, May 13th (temp=0)4100008.3%
Z.AI GLM 4.52300004.5%
GPT-5.2000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100868093.1%
Z.AI GLM 4.610010093777789.3%
Gemma 3 27B10010091915988.1%
Claude Opus 4.610010093836488.0%
GPT-4.1 Nano1009285838087.8%
Gemma 3 4B100100100835086.7%
Gemini 2.5 Pro868583837983.6%
Arcee AI: Trinity Large (Preview)10010083676582.9%
GPT-5 Nano10010083675080.0%
Mistral Large 3908379676776.9%
ByteDance Seed 1.6988367676375.7%
DeepSeek V3.1918683554872.6%
GPT-5968383673372.6%
GPT-4o Mini (temp=1)897673635070.4%
Gemini 2.5 Flash Lite837979733469.7%
GPT-5.1838382505069.7%
GPT-4.1857267665067.8%
Z.AI GLM 5976762615067.2%
Mistral Small Creative838379503766.7%
DeepSeek V3.2837467673364.9%
Mistral Large897950504462.4%
Grok 4 Fast838374333361.5%
Minimax M2.5836967503560.9%
Hermes 3 70B100676666059.7%
Claude Sonnet 4.5836952443857.2%
Gemini 3 Flash (Preview)635850505054.2%
Qwen 3.5 397B A17B786160451752.2%
Z.AI GLM 4.7786650333352.2%
Stealth: Aurora Alpha896746431752.1%
Mistral Medium 3.167676359051.1%
MoonshotAI: Kimi K2.5866456331350.4%
Qwen 3.5 Plus (2026-02-15)745150501748.5%
WizardLM 2 8x22b685250363348.0%
Arcee AI: Trinity Mini76675045047.5%
Gemini 3.1 Pro (Preview)835050332047.4%
Gemma 3 12B74675037947.2%
Claude Haiku 4.5676750311846.5%
Claude Opus 4.5675645441946.3%
Writer: Palmyra X583706314045.9%
ByteDance Seed 1.6 Flash555252501244.2%
Ministral 3 14B715042341943.1%
DeepSeek V3 (2025-03-24)524946382943.1%
GPT-4o, Aug. 6th (temp=1)615235332641.3%
Gemini 2.5 Flash70584629040.5%
Rocinante 12B65544833040.2%
Gemini 3 Pro (Preview)484646441740.2%
Mistral Large 277504033040.0%
Ministral 8B71544326038.7%
Claude 3.5 Haiku80553816037.9%
Z.AI GLM 4.7 Flash59505017936.9%
GPT-5 Mini675033171736.7%
o4 Mini High504734331836.4%
Claude Sonnet 496392010433.9%
Cohere Command R+ (Aug. 2024)77393318033.5%
Ministral 3 8B57563717033.3%
Grok 4.1 Fast723326171732.8%
Claude Opus 4573330261732.8%
Hermes 3 405B1004260029.7%
Grok 452413812529.5%
Mistral NeMO49411717024.7%
Llama 3.1 8B675000023.3%
Qwen 2.5 72B46381616023.0%
Ministral 3 3B35332321022.3%
GPT-4o, May 13th (temp=1)5033220021.1%
Ministral 3B671286218.9%
Claude 3.7 Sonnet761700018.6%
Claude 3.5 Sonnet393810015.7%
DeepSeek V3 (2024-12-26)56000011.2%
DeepSeek-V2 Chat401400010.6%
GPT-4.1 Mini32170009.7%
o4 Mini171713009.3%
GPT-4o Mini (temp=0)4300008.7%
GPT-4o, Aug. 6th (temp=0)27160008.6%
Llama 3.1 70B3430007.5%
GPT-5.217170006.7%
Z.AI GLM 4.51700003.3%
Mistral Small 3.2 24B1700003.3%
Llama 3.1 Nemotron 70B940002.5%
GPT-4o, May 13th (temp=0)000000.0%
Claude 3 Haiku000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Gemma 3 12B100100100100100100.0%
Gemma 3 27B100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Grok 4 Fast1001001001009599.0%
Gemini 3 Flash (Preview)1001001001009498.7%
Gemini 3.1 Pro (Preview)1001001001009198.3%
GPT-4o Mini (temp=1)1001001001008496.8%
o4 Mini High1001001001008396.7%
Gemma 3 4B1001001001008396.7%
Z.AI GLM 51001001001008396.7%
GPT-4o Mini (temp=0)1001001001008096.1%
MoonshotAI: Kimi K2.51001001001007895.6%
Claude Opus 4.510010096948394.6%
GPT-5 Mini100100100838393.3%
GPT-51001001001006793.3%
Minimax M2.51001001001006793.3%
Claude Sonnet 4.6100100100838393.3%
Hermes 3 405B1001001001006693.2%
GPT-4.110010097907993.1%
Mistral Medium 3.110010099837691.8%
Z.AI GLM 4.710010094906790.2%
Gemini 3 Pro (Preview)10010088837889.8%
DeepSeek V3 (2024-12-26)100100100786889.1%
Z.AI GLM 4.7 Flash100100100915088.3%
Arcee AI: Trinity Mini979692896788.0%
Claude Opus 4.610010083836786.7%
GPT-5.110010083836786.7%
Grok 41001001001003386.7%
Z.AI GLM 4.610010083836786.7%
Claude 3.5 Haiku10010092736886.5%
ByteDance Seed 1.61009583816985.6%
Writer: Palmyra X510010092924385.4%
Claude Sonnet 410010096675984.4%
Gemini 2.5 Flash Lite100100100833383.3%
GPT-4o, Aug. 6th (temp=0)1009581815582.3%
GPT-5 Nano10010093675082.0%
Mistral Large 210010083754580.6%
ByteDance Seed 1.6 Flash1008383676780.0%
Claude Haiku 4.510010097505079.5%
Ministral 3B10010083644277.8%
o4 Mini1009887613877.0%
GPT-4o, Aug. 6th (temp=1)958383674274.1%
GPT-4.1 Mini838181793371.7%
WizardLM 2 8x22b1008383671770.0%
Claude 3.5 Sonnet1009955543468.2%
Llama 3.1 8B1001008357068.1%
GPT-4o, May 13th (temp=1)1009283381164.9%
Rocinante 12B10010010014062.9%
Cohere Command R+ (Aug. 2024)1001008033062.7%
Qwen 3.5 397B A17B1007567501761.6%
Stealth: Aurora Alpha676763605061.3%
Claude 3.7 Sonnet837567651560.9%
DeepSeek-V2 Chat1001001000060.0%
Llama 3.1 Nemotron 70B1006763333359.5%
Grok 4.1 Fast938350501758.6%
Claude Sonnet 4.593876744058.4%
Qwen 3.5 Plus (2026-02-15)91835049054.7%
Ministral 3 14B958955171754.4%
Ministral 3 3B1001003533053.7%
Claude Opus 4100100670053.3%
Hermes 3 70B81706747052.9%
Mistral Large100833532049.9%
Ministral 3 8B8884770049.8%
Mistral Small 3.2 24B83673330042.6%
Mistral NeMO905033171741.3%
Llama 3.1 70B504945421941.2%
Mistral Large 3100563312040.4%
DeepSeek V3 (2025-03-24)63484641039.7%
Claude 3 Haiku75593323038.0%
Qwen 2.5 72B554933331136.3%
Ministral 8B996700033.2%
GPT-4o, May 13th (temp=0)663327171130.8%
Arcee AI: Trinity Large (Preview)9040170029.3%
Mistral Small Creative6462170028.6%
Z.AI GLM 4.56738170024.2%
GPT-5.2171711008.9%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.1100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash1001001001009298.4%
Grok 4 Fast100100100979498.4%
GPT-4o Mini (temp=1)100100100999097.9%
Z.AI GLM 4.61001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
Gemma 3 27B1001001001008396.7%
Grok 4.1 Fast10010095938394.3%
GPT-5100100100838393.3%
GPT-5 Mini100100100836790.0%
GPT-5 Nano10010083838390.0%
Gemini 2.5 Flash Lite100100100836790.0%
Writer: Palmyra X5100100100836790.0%
Claude Opus 4.610010096836789.3%
Minimax M2.51009998836288.6%
o4 Mini1008886838187.6%
Mistral Medium 3.1100100100805086.0%
Claude Opus 4.5938683837684.5%
Z.AI GLM 4.7 Flash10010083766384.4%
MoonshotAI: Kimi K2.510010083795082.6%
Gemma 3 4B1009683676782.5%
Claude Haiku 4.51009289676181.7%
Mistral NeMO10010089863381.7%
Rocinante 12B10010098901780.9%
o4 Mini High10010080755080.8%
DeepSeek V3.2100100100831780.0%
Grok 41009488833379.6%
Z.AI GLM 510010091831778.1%
Hermes 3 70B10010010086377.9%
ByteDance Seed 1.6838378766777.4%
ByteDance Seed 1.6 Flash928372676776.1%
Mistral Small Creative10010079505075.9%
GPT-4.1 Mini888381675675.1%
GPT-4o, Aug. 6th (temp=1)1008078675075.0%
Gemini 3 Pro (Preview)10010067565074.6%
Claude Sonnet 4.51008380565073.9%
GPT-4o, May 13th (temp=1)977977674973.8%
DeepSeek-V2 Chat888380555271.7%
GPT-4o Mini (temp=0)1008858555571.5%
Claude Sonnet 4958883592570.1%
Gemma 3 12B1006767675070.0%
Claude 3.5 Haiku1006663635870.0%
WizardLM 2 8x22b10010067601768.7%
Stealth: Aurora Alpha1007261605068.6%
Mistral Large 3988676671768.6%
DeepSeek V3 (2024-12-26)1007368653367.9%
Ministral 3 14B856767585366.0%
Cohere Command R+ (Aug. 2024)1007767503365.5%
Mistral Large 2676767656065.0%
Mistral Small 3.2 24B10010067391764.5%
Arcee AI: Trinity Large (Preview)10010052501763.7%
Mistral Large837750505062.0%
Gemini 3 Flash (Preview)736767505061.2%
Z.AI GLM 4.5676661565060.1%
Z.AI GLM 4.7100838333060.0%
Gemini 3.1 Pro (Preview)1005050505060.0%
Ministral 8B676665504458.5%
Arcee AI: Trinity Mini1007250472057.9%
Qwen 3.5 Plus (2026-02-15)766767501755.3%
Ministral 3 3B817154383154.9%
DeepSeek V3 (2025-03-24)96666039753.5%
Ministral 3B767354342151.7%
GPT-4.1715750501749.0%
Qwen 3.5 397B A17B946733301748.1%
GPT-4o, Aug. 6th (temp=0)83655633047.4%
Llama 3.1 8B100675015046.2%
Qwen 2.5 72B67625233042.8%
Ministral 3 8B67673733040.7%
Claude 3.5 Sonnet74633124038.4%
GPT-4o, May 13th (temp=0)93461917035.0%
Hermes 3 405B563533311834.8%
Llama 3.1 Nemotron 70B49473931033.3%
Claude Opus 467331717026.8%
Claude 3.7 Sonnet48311717022.4%
Llama 3.1 70B474071019.0%
Claude 3 Haiku2322174013.2%
GPT-5.23300006.7%

genre

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini1001001001009398.6%
Stealth: Aurora Alpha100100100959497.9%
Z.AI GLM 4.61001001001008396.7%
Gemma 3 4B1001001001008396.7%
GPT-4.1 Nano100100100888394.0%
Arcee AI: Trinity Mini10010091837790.2%
GPT-4.1 Mini1008989878189.3%
ByteDance Seed 1.6999189807787.1%
Llama 3.1 Nemotron 70B10010091756486.0%
WizardLM 2 8x22b959281807684.8%
GPT-51008383836783.3%
Llama 3.1 8B100100100644982.7%
Grok 4918983836782.6%
Llama 3.1 70B1009380765981.7%
Grok 4.1 Fast948783736981.3%
GPT-4o Mini (temp=1)1008376766780.3%
GPT-5.110010083675080.0%
Claude Sonnet 41008381785579.7%
Gemini 2.5 Flash1008383676579.6%
Gemini 2.5 Pro1009793743379.4%
DeepSeek V3.1968383676779.2%
Ministral 8B868583696778.0%
Grok 4 Fast848376767177.9%
o4 Mini High878383756077.6%
GPT-4o, Aug. 6th (temp=0)878681676677.4%
Qwen 2.5 72B917971676774.9%
DeepSeek V3.21009183831774.8%
Claude 3.5 Haiku968574625774.6%
GPT-4.1918373725073.8%
Claude Opus 4.61008967625073.5%
Qwen 3.5 397B A17B837673676572.7%
Claude Sonnet 4.61008373653370.9%
GPT-4o, Aug. 6th (temp=1)787668675969.5%
Minimax M2.5838067675069.4%
Gemini 3 Flash (Preview)817267626168.7%
Mistral Small 3.2 24B878367673968.5%
Cohere Command R+ (Aug. 2024)898483503368.0%
Gemma 3 12B807467645067.0%
Claude Haiku 4.5837977484566.5%
o4 Mini737066635064.3%
Ministral 3B866463504461.3%
Ministral 3 3B836559584261.3%
Gemini 3 Pro (Preview)786761505061.1%
Z.AI GLM 4.7676765575061.0%
Mistral Medium 3.1847150505060.9%
ByteDance Seed 1.6 Flash837567413860.7%
Rocinante 12B99807549060.5%
DeepSeek V3 (2024-12-26)826761543860.4%
DeepSeek-V2 Chat736858554660.0%
GPT-5 Nano838367501760.0%
Claude Sonnet 4.5796758504459.3%
Gemini 3.1 Pro (Preview)836763503359.3%
GPT-4o, May 13th (temp=1)876657503358.8%
Mistral NeMO806855503858.2%
GPT-4o, May 13th (temp=0)846059553358.0%
Z.AI GLM 5837367333358.0%
Gemma 3 27B776963473257.6%
Arcee AI: Trinity Large (Preview)876766501757.4%
Hermes 3 70B815753504356.7%
Gemini 2.5 Flash Lite1006767331756.7%
Ministral 3 8B83776252055.0%
Z.AI GLM 4.7 Flash675350505053.9%
DeepSeek V3 (2025-03-24)745957532553.7%
Claude 3.5 Sonnet857344332852.8%
Mistral Large 2746056502352.4%
Qwen 3.5 Plus (2026-02-15)776751333352.3%
GPT-4o Mini (temp=0)665250493350.1%
Claude Opus 4.5835050333350.0%
MoonshotAI: Kimi K2.5675050433348.6%
Ministral 3 14B785750361747.4%
Hermes 3 405B83766017047.2%
Claude Opus 4625855332847.2%
Writer: Palmyra X5605050333345.4%
Mistral Large735033252340.9%
Mistral Large 362524542040.2%
Z.AI GLM 4.5545042282639.9%
Mistral Small Creative565033331737.8%
Claude 3.7 Sonnet58503027033.1%
Claude 3 Haiku5627170019.8%
GPT-5.2331700010.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen 3.5 397B A17B100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Minimax M2.5100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Claude Haiku 4.5100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Claude 3.5 Haiku100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
GPT-5.2100100100100100100.0%
Llama 3.1 8B10010010010010099.9%
Claude Opus 4.51001001001009999.9%
Grok 4 Fast1001001001009999.8%
Gemma 3 12B1001001001009899.6%
o4 Mini High1001001001009799.3%
Gemini 3 Flash (Preview)1001001001009398.7%
Mistral Small 3.2 24B1001001001009298.5%
Claude Sonnet 4.510010099969497.8%
MoonshotAI: Kimi K2.5100100100998997.6%
Gemini 3.1 Pro (Preview)100100100949397.3%
Grok 4.1 Fast100100100958896.7%
Qwen 3.5 Plus (2026-02-15)1001001001008396.7%
GPT-5 Nano1001001001008396.7%
Writer: Palmyra X51001001001008396.7%
GPT-4o Mini (temp=0)1001001001008396.7%
Arcee AI: Trinity Mini1001001001008396.7%
Mistral Large 3100100100988496.4%
DeepSeek V3 (2024-12-26)10010099968696.2%
Gemma 3 27B100100100988396.2%
GPT-4.1 Mini1001001001007795.5%
o4 Mini100100100978095.4%
Claude 3.5 Sonnet100100100918595.1%
GPT-4o, Aug. 6th (temp=1)100100100898394.5%
GPT-4o, May 13th (temp=1)1001001001006793.3%
GPT-4.1100100100837090.6%
Gemini 3 Pro (Preview)1001001001005090.0%
Cohere Command R+ (Aug. 2024)1001001001005090.0%
Ministral 3 8B1009391827988.8%
Mistral Large 21009990876788.6%
Mistral Large1009685827888.1%
Ministral 3B999583818087.6%
Claude Sonnet 410010083836786.7%
Z.AI GLM 4.51009588846786.6%
Stealth: Aurora Alpha10010092676785.1%
Ministral 3 14B939188836784.5%
ByteDance Seed 1.6 Flash10010082676783.1%
WizardLM 2 8x22b1008383737382.4%
DeepSeek V3 (2025-03-24)938988726481.2%
GPT-4o, Aug. 6th (temp=0)1008584696781.0%
Llama 3.1 Nemotron 70B1008279746680.3%
Mistral NeMO10010092713379.3%
DeepSeek-V2 Chat10010085674278.8%
Mistral Small Creative1008883675077.6%
GPT-4o, May 13th (temp=0)100100100503376.7%
Ministral 8B10010082673376.4%
Arcee AI: Trinity Large (Preview)989467584872.9%
Ministral 3 3B1008682593372.1%
Mistral Medium 3.1998167615071.5%
Claude Opus 410010083333370.0%
Llama 3.1 70B1007864594969.8%
Hermes 3 70B919064504968.8%
Hermes 3 405B10010056433266.2%
Rocinante 12B100796660061.0%
Qwen 2.5 72B93776733054.0%
Claude 3.7 Sonnet745555333350.2%
Claude 3 Haiku746340363349.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
GPT-5 Nano100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
Gemma 3 4B100100100100100100.0%
o4 Mini1001001001009899.7%
Grok 41001001001008396.7%
Stealth: Aurora Alpha1001001001008396.7%
DeepSeek V3.21001001001008396.7%
GPT-4.1 Nano100100100987294.1%
DeepSeek V3.11001001001006793.3%
Gemini 2.5 Flash100100100837992.4%
Claude 3.5 Haiku1009891827689.3%
Ministral 3 8B10010098975089.2%
Ministral 3 14B1009283818087.2%
Llama 3.1 Nemotron 70B10010088806787.1%
o4 Mini High1001001001003386.7%
Mistral Small Creative100100100676786.7%
Grok 4.1 Fast100100100943385.4%
Z.AI GLM 4.71008383836783.3%
Mistral Large 31008383826783.0%
Llama 3.1 8B10010073686781.7%
Mistral Medium 3.11008983676781.1%
GPT-4o Mini (temp=1)1009590833380.2%
Ministral 8B10010095723380.1%
Claude Sonnet 41008383835080.0%
Grok 4 Fast10010083833380.0%
GPT-4.1100100100100079.9%
GPT-4.1 Mini10010099781778.9%
Qwen 3.5 Plus (2026-02-15)908383676778.0%
GPT-4o, Aug. 6th (temp=1)10010010083076.7%
GPT-5 Mini10010067673373.3%
Z.AI GLM 4.7 Flash1008367675073.3%
Claude Haiku 4.51008380672571.0%
ByteDance Seed 1.610010087333370.6%
DeepSeek V3 (2024-12-26)868383673370.5%
MoonshotAI: Kimi K2.510010083501269.0%
Gemini 2.5 Flash Lite1009167671768.2%
Rocinante 12B1008367503867.7%
Gemini 3.1 Pro (Preview)838367673366.7%
Gemini 3 Flash (Preview)838367663366.4%
DeepSeek V3 (2025-03-24)1008067671766.1%
WizardLM 2 8x22b1006761505065.5%
Minimax M2.510010010017063.3%
Gemma 3 27B1008367501763.3%
GPT-4o Mini (temp=0)83838067062.6%
Cohere Command R+ (Aug. 2024)786767525062.5%
DeepSeek-V2 Chat100676767060.0%
Claude Sonnet 4.610010033331756.7%
Llama 3.1 70B1001006717056.7%
Hermes 3 405B100895042056.3%
Claude 3.5 Sonnet838376171755.2%
Claude Opus 492838316054.9%
Writer: Palmyra X510098670052.8%
Ministral 3B83676050052.1%
Arcee AI: Trinity Large (Preview)838233332852.0%
Claude 3.7 Sonnet77676733048.7%
GPT-4o, Aug. 6th (temp=0)100913317048.2%
Claude 3 Haiku895840321647.2%
ByteDance Seed 1.6 Flash9983500046.6%
Qwen 3.5 397B A17B826733331746.4%
Z.AI GLM 5100484133044.4%
GPT-5100503317040.0%
Mistral Large8367500040.0%
GPT-4o, May 13th (temp=0)1009900039.8%
Claude Opus 4.58367330036.7%
Mistral NeMO67503333036.7%
Mistral Large 210067170036.7%
Claude Opus 4.6100333317036.5%
Hermes 3 70B763328231234.6%
Claude Sonnet 4.51005060031.2%
Gemini 3 Pro (Preview)6750330030.0%
Mistral Small 3.2 24B6750290029.1%
Qwen 2.5 72B7850170028.9%
Ministral 3 3B78331711027.8%
Z.AI GLM 4.567331717026.7%
GPT-4o, May 13th (temp=1)6733120022.5%
GPT-5.250000010.0%
Gemma 3 12B17160006.5%
GPT-5.1000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-4.1 Nano100100100868393.7%
Grok 4 Fast999783836785.8%
GPT-5 Nano10010083676783.3%
Minimax M2.51009383835081.8%
Stealth: Aurora Alpha1008579776781.6%
ByteDance Seed 1.6838383805076.0%
Claude Sonnet 4.5827972676172.0%
Hermes 3 70B1008362554168.3%
Gemini 2.5 Pro747367575064.1%
Mistral Medium 3.1100996750063.1%
GPT-5 Mini837667503361.9%
Mistral NeMO1008149413360.8%
Rocinante 12B787866641059.2%
Mistral Large 3817167501757.0%
DeepSeek V3.1755650505056.2%
Z.AI GLM 5948450331755.5%
Qwen 3.5 Plus (2026-02-15)837750353055.1%
GPT-4.194816733055.1%
Gemma 3 4B726761521753.6%
MoonshotAI: Kimi K2.5665852504053.1%
Z.AI GLM 4.6746450453152.7%
Claude Sonnet 4836648332951.9%
GPT-4o Mini (temp=1)817750331751.5%
WizardLM 2 8x22b79726436050.2%
GPT-5100833333050.0%
Claude Sonnet 4.683675050050.0%
Gemma 3 27B656350333349.0%
Claude Opus 4.6100615033048.9%
ByteDance Seed 1.6 Flash796833333048.9%
DeepSeek V3.2745050333348.1%
GPT-4o, Aug. 6th (temp=1)837350171748.0%
Mistral Large 275744738046.9%
Mistral Small 3.2 24B67675150046.9%
Gemma 3 12B10067670046.7%
Grok 4806733331745.9%
Gemini 2.5 Flash Lite625050333345.6%
Grok 4.1 Fast83634433044.7%
GPT-4o, May 13th (temp=1)614941333343.7%
Ministral 3 14B675033333343.3%
GPT-5.183505033043.3%
GPT-4o Mini (temp=0)68515048043.3%
Ministral 8B755933311743.0%
Arcee AI: Trinity Mini67575528041.2%
Cohere Command R+ (Aug. 2024)67564833040.7%
Claude 3.5 Haiku604734221836.3%
Mistral Small Creative515033291736.1%
Gemini 3 Flash (Preview)673333281735.5%
Llama 3.1 8B59503328134.4%
Qwen 3.5 397B A17B72392824032.8%
DeepSeek V3 (2025-03-24)56523321032.5%
Mistral Large65503017032.3%
Ministral 3 3B50433826031.5%
o4 Mini534922201331.3%
Claude Opus 4.552503317030.5%
o4 Mini High50503017029.4%
Ministral 3B48453316028.6%
Gemini 3.1 Pro (Preview)786300028.2%
GPT-4o, Aug. 6th (temp=0)493023171025.8%
DeepSeek-V2 Chat423229171025.7%
GPT-4.1 Mini50461710024.6%
Claude 3.5 Sonnet6033270023.8%
GPT-5.2333317171723.3%
Claude Haiku 4.539302422023.1%
Arcee AI: Trinity Large (Preview)832740022.9%
Ministral 3 8B584780022.5%
Gemini 2.5 Flash33331715019.6%
Claude Opus 4762010019.4%
Z.AI GLM 4.7 Flash33281717019.0%
Hermes 3 405B493372018.3%
Qwen 2.5 72B27251917017.4%
Z.AI GLM 4.7503300016.7%
Writer: Palmyra X55017170016.7%
Llama 3.1 Nemotron 70B541870015.8%
Llama 3.1 70B482440015.3%
DeepSeek V3 (2024-12-26)3317173014.0%
GPT-4o, May 13th (temp=0)19190007.6%
Claude 3 Haiku16110005.4%
Gemini 3 Pro (Preview)1780004.9%
Claude 3.7 Sonnet1700003.3%
Z.AI GLM 4.5400000.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Minimax M2.5100100100100100100.0%
Grok 4100100100100100100.0%
Grok 4.1 Fast100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B1001001001009999.8%
GPT-4o Mini (temp=1)1001001001009799.4%
GPT-4.1 Mini1001001001009699.2%
Claude Opus 4100100100989598.5%
Claude Sonnet 4.61001001001009298.4%
MoonshotAI: Kimi K2.51001001001008396.7%
Z.AI GLM 4.61001001001008396.7%
GPT-4.11001001001008396.7%
DeepSeek V3.21001001001008396.7%
DeepSeek V3.11001001001008396.7%
Z.AI GLM 51001001001008396.7%
Grok 4 Fast1001001001008396.7%
GPT-5 Nano100100100998396.5%
GPT-4o, Aug. 6th (temp=1)100100100957593.9%
Qwen 3.5 Plus (2026-02-15)10010096908393.8%
Claude Opus 4.61001001001006793.3%
GPT-51001001001006793.3%
Claude Sonnet 4.51001001001006793.3%
Stealth: Aurora Alpha1001001001006793.3%
ByteDance Seed 1.6 Flash1001001001006793.3%
Llama 3.1 8B100100100838393.3%
Arcee AI: Trinity Mini100100100907693.2%
Claude Opus 4.510010091918393.0%
Claude 3.5 Haiku100100100897492.6%
Claude 3.5 Sonnet1009491878391.3%
GPT-5.110010083838390.0%
Gemini 2.5 Flash Lite100100100836790.0%
Gemma 3 12B100100100975089.3%
Claude Sonnet 4100100100836389.2%
GPT-4o Mini (temp=0)100100100955088.9%
Claude Haiku 4.510010083827788.4%
Z.AI GLM 4.71008383838386.7%
ByteDance Seed 1.610010083836786.7%
Gemini 2.5 Flash10010083836786.7%
Mistral Medium 3.110010096676785.8%
o4 Mini1009383836785.2%
Gemini 3.1 Pro (Preview)838383838383.3%
Qwen 3.5 397B A17B1009583676782.4%
Gemini 3 Pro (Preview)1009383676781.9%
GPT-4o, Aug. 6th (temp=0)1009483755281.0%
Z.AI GLM 4.5928383796780.6%
Ministral 3 14B979782785080.6%
Gemini 3 Flash (Preview)858383836780.3%
DeepSeek V3 (2024-12-26)1009483675078.8%
Rocinante 12B1009086763377.1%
WizardLM 2 8x22b1008375695776.8%
Hermes 3 405B1001009083074.7%
GPT-4o, May 13th (temp=0)969283831774.3%
Ministral 3B978567674471.7%
Cohere Command R+ (Aug. 2024)918383663371.4%
Mistral Large 31009570524071.3%
DeepSeek-V2 Chat1001008367070.0%
Claude 3.7 Sonnet1007167535068.2%
Arcee AI: Trinity Large (Preview)10010010033066.7%
Llama 3.1 70B1007964453364.4%
GPT-4o, May 13th (temp=1)100837950062.5%
Ministral 3 3B797470473861.5%
DeepSeek V3 (2025-03-24)95946750061.3%
Hermes 3 70B1008367322260.8%
Llama 3.1 Nemotron 70B817769423360.6%
Gemma 3 27B1001001000060.0%
Mistral Small 3.2 24B776750494156.7%
Qwen 2.5 72B81676550052.5%
Mistral NeMO855050421748.8%
Ministral 3 8B98625917047.2%
Claude 3 Haiku565445362643.2%
GPT-5.2725033331741.1%
Ministral 8B10069330040.5%
Mistral Small Creative10067190037.1%
Mistral Large 26450330029.5%
Mistral Large545200021.3%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Grok 4 Fast1001001001009999.8%
GPT-4.1 Nano1001001001009298.3%
Z.AI GLM 5100100100969297.7%
GPT-4o, Aug. 6th (temp=1)100100100969197.4%
Gemini 2.5 Flash Lite1001001001008396.7%
Gemini 2.5 Pro1001001001008396.7%
DeepSeek V3.11001001001008396.7%
Qwen 3.5 397B A17B10010098949096.5%
MoonshotAI: Kimi K2.510010099968395.5%
Gemini 2.5 Flash10010096958394.7%
GPT-5100100100838393.3%
Minimax M2.5100100100838393.3%
Grok 4.1 Fast100100100838393.3%
Stealth: Aurora Alpha1001001001006793.3%
GPT-5 Nano100100100838393.3%
ByteDance Seed 1.61009797838392.1%
GPT-4o Mini (temp=1)10010098906790.9%
Claude Opus 4.6100100100836790.0%
Writer: Palmyra X5100100100836790.0%
Llama 3.1 8B1001001001005090.0%
DeepSeek V3.210010097836789.4%
Qwen 3.5 Plus (2026-02-15)10010097836789.4%
Claude Sonnet 41009889887089.0%
Claude Sonnet 4.51009489866787.2%
Grok 41009886846787.0%
GPT-5.1100100100835086.7%
GPT-4.110010083836786.7%
Gemma 3 27B100100100835086.7%
o4 Mini High1009983836786.4%
Ministral 3 8B10010088875084.9%
Gemma 3 4B1009382826784.7%
o4 Mini888383837783.1%
DeepSeek V3 (2025-03-24)1008779796782.4%
GPT-4.1 Mini908383817582.4%
Gemma 3 12B1009383835081.9%
GPT-4o, Aug. 6th (temp=0)958378686778.1%
Gemini 3.1 Pro (Preview)838180796778.0%
Claude Opus 4.51008367676776.7%
Z.AI GLM 4.7 Flash10010067675076.7%
Hermes 3 70B1008378626076.6%
Arcee AI: Trinity Mini1009671674976.5%
GPT-4o, May 13th (temp=1)1009788504876.4%
Mistral Small Creative1009476625076.3%
Claude Haiku 4.5918383625073.8%
Claude 3.5 Sonnet847571716673.4%
Cohere Command R+ (Aug. 2024)1007567675873.3%
Ministral 8B10010010050070.0%
Ministral 3 3B948368673369.1%
DeepSeek-V2 Chat1008463504668.6%
Rocinante 12B1008067534468.5%
Mistral Large 2967169613366.1%
Mistral Medium 3.11008364631765.6%
Mistral Large 3987455505065.5%
GPT-4o Mini (temp=0)736767655264.8%
Claude 3.5 Haiku777568584564.5%
Claude Opus 4817470633364.3%
Gemini 3 Flash (Preview)838367503363.3%
Z.AI GLM 4.7816759505061.2%
Mistral NeMO86838350060.5%
GPT-5.2836767503360.0%
WizardLM 2 8x22b1001005050060.0%
Mistral Large726767583359.4%
Z.AI GLM 4.5998754331757.9%
Ministral 3 14B825650505057.6%
DeepSeek V3 (2024-12-26)706753484656.7%
Mistral Small 3.2 24B995050503356.5%
Ministral 3B975858501756.1%
Gemini 3 Pro (Preview)675050505053.3%
Llama 3.1 Nemotron 70B726964322151.6%
Claude 3 Haiku785842402648.7%
Llama 3.1 70B83554943046.2%
Qwen 2.5 72B94505033045.4%
GPT-4o, May 13th (temp=0)65615917040.3%
Arcee AI: Trinity Large (Preview)90504317039.9%
Hermes 3 405B93483017037.5%
ByteDance Seed 1.6 Flash6733330026.7%
Claude 3.7 Sonnet36231717018.3%

Novelcrafter Default Prompt

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.6100100100938996.4%
Z.AI GLM 4.6100100100887793.1%
Gemini 2.5 Pro100100100976792.7%
Gemini 2.5 Flash100100100926791.7%
GPT-510010086837889.6%
Gemma 3 4B10010083838089.3%
GPT-5.1100100100834986.5%
Claude Opus 4.61008383836783.4%
GPT-5 Nano10010083835083.3%
Llama 3.1 8B100100100674782.8%
Grok 4919190717082.6%
Gemini 2.5 Flash Lite1008380736881.0%
Stealth: Aurora Alpha1009077706780.7%
GPT-4.1 Nano868382816980.2%
GPT-5 Mini10010083833380.0%
DeepSeek V3.1958681676778.9%
Grok 4 Fast918178736677.7%
Gemma 3 12B1008871676077.1%
Arcee AI: Trinity Mini1008377764877.0%
Gemini 3 Flash (Preview)967675676575.6%
Claude 3.5 Haiku979173644474.1%
Llama 3.1 Nemotron 70B827373706672.6%
Arcee AI: Trinity Large (Preview)929178673372.1%
Minimax M2.5857768656271.2%
ByteDance Seed 1.6858367675070.4%
Claude Opus 4.5908374505069.4%
Mistral Small 3.2 24B10010067581367.5%
o4 Mini High837769565067.2%
Qwen 3.5 397B A17B838383503366.5%
WizardLM 2 8x22b877867504765.6%
DeepSeek V3 (2024-12-26)817372574365.6%
Grok 4.1 Fast848063515065.6%
GPT-4.1966761505064.6%
Mistral NeMO837370633364.5%
Ministral 3 8B767067585064.2%
DeepSeek-V2 Chat676259535158.5%
DeepSeek V3.2676759505058.5%
Ministral 3 14B796059504258.1%
Gemini 3 Pro (Preview)865450505057.9%
Hermes 3 70B876756562357.6%
Ministral 3B736959473556.7%
GPT-4o, Aug. 6th (temp=1)956552462556.4%
ByteDance Seed 1.6 Flash626160504856.2%
Z.AI GLM 583817833055.3%
Claude Opus 4746752483355.0%
Writer: Palmyra X5806763461754.5%
GPT-4.1 Mini837946333154.4%
Gemma 3 27B726552424054.2%
Claude Haiku 4.5676653503354.0%
Claude 3.7 Sonnet656155503953.9%
Ministral 8B736560521753.1%
Mistral Medium 3.1715050474753.1%
Mistral Large676763412752.9%
Llama 3.1 70B746450393352.1%
Qwen 2.5 72B615453504151.8%
Hermes 3 405B676652502351.7%
Claude Sonnet 4626258551851.1%
MoonshotAI: Kimi K2.5815849333350.9%
Z.AI GLM 4.7 Flash706750333350.6%
Ministral 3 3B575550493549.2%
GPT-4o, Aug. 6th (temp=0)715143413848.8%
o4 Mini675047443348.2%
GPT-4o Mini (temp=1)805044412547.9%
Claude Sonnet 4.5655652382947.9%
DeepSeek V3 (2025-03-24)705247452347.1%
Z.AI GLM 4.5775337333346.6%
Mistral Large 2585150393346.1%
Claude 3.5 Sonnet555147443345.9%
Claude 3 Haiku715550331645.0%
Mistral Large 3584947362643.1%
Mistral Small Creative635033322841.1%
Rocinante 12B76504337041.0%
Gemini 3.1 Pro (Preview)585043331740.2%
Cohere Command R+ (Aug. 2024)494747421640.0%
Qwen 3.5 Plus (2026-02-15)835033171740.0%
Z.AI GLM 4.778503330038.4%
GPT-4o, May 13th (temp=1)433831292833.8%
GPT-4o, May 13th (temp=0)672727261732.7%
GPT-4o Mini (temp=0)59501713027.8%
GPT-5.21600003.2%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5.1100100100100100100.0%
GPT-5100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
WizardLM 2 8x22b100100100100100100.0%
Arcee AI: Trinity Mini1001001001009999.8%
Qwen 3.5 397B A17B1001001001009699.3%
Gemini 3 Pro (Preview)100100100989698.8%
Z.AI GLM 51001001001009398.6%
MoonshotAI: Kimi K2.51001001001009198.3%
Gemini 3.1 Pro (Preview)1001001001009098.0%
Grok 4 Fast1001001001008997.7%
Cohere Command R+ (Aug. 2024)1001001001008797.4%
Hermes 3 405B100100100959197.3%
GPT-4.110010098969196.9%
o4 Mini High1001001001008396.7%
Claude Opus 4.51001001001008396.7%
o4 Mini1001001001008396.7%
ByteDance Seed 1.61001001001008396.7%
DeepSeek V3.21001001001008396.7%
Gemini 2.5 Flash Lite1001001001008396.7%
Claude Sonnet 4100100100998396.4%
Claude Sonnet 4.510010097948094.3%
Minimax M2.51001001001006793.3%
DeepSeek-V2 Chat100100100868193.3%
Grok 4.1 Fast100100100838192.9%
Gemma 3 27B100100100837992.5%
Claude 3.5 Haiku100100100887392.3%
GPT-4o Mini (temp=1)100100100906791.3%
GPT-4o, Aug. 6th (temp=1)10010098856990.5%
Writer: Palmyra X510010099836789.9%
Gemini 3 Flash (Preview)10010099836789.7%
Claude Haiku 4.510010092896789.5%
Llama 3.1 Nemotron 70B10010097885086.8%
Llama 3.1 8B100100100835086.7%
Mistral Medium 3.11009584807486.4%
Claude 3.5 Sonnet958884827685.1%
Ministral 8B958585827884.8%
Ministral 3 14B1009081747083.0%
Claude Opus 410010099793382.3%
Stealth: Aurora Alpha1009283676581.4%
DeepSeek V3 (2024-12-26)1009483765180.6%
Mistral Small 3.2 24B10010010099079.7%
Z.AI GLM 4.51008383815079.4%
Ministral 3B1008479795679.4%
GPT-4o, May 13th (temp=0)10010079675079.1%
Gemma 3 12B1008377676778.8%
Mistral Large948483825078.7%
Mistral Small Creative1008882725078.3%
Ministral 3 8B988685774277.6%
Mistral Large 2918377696677.1%
Mistral Large 3938778675576.1%
Qwen 2.5 72B838075676774.4%
DeepSeek V3 (2025-03-24)868376725474.2%
GPT-4o, Aug. 6th (temp=0)1007873675073.6%
GPT-4o, May 13th (temp=1)1008377743373.5%
Rocinante 12B918976674373.3%
Claude 3.7 Sonnet838279635973.2%
ByteDance Seed 1.6 Flash1008382671769.7%
Ministral 3 3B828270574968.1%
GPT-4o Mini (temp=0)1008867503367.5%
Mistral NeMO1008367661766.5%
GPT-5.2838175593366.3%
Llama 3.1 70B1006359564163.9%
Hermes 3 70B676257565058.3%
Arcee AI: Trinity Large (Preview)1004744332249.2%
Claude 3 Haiku65635245045.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Grok 4100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
Gemma 3 4B100100100100100100.0%
Gemini 2.5 Pro1001001001008396.7%
Arcee AI: Trinity Mini1001001001008396.6%
Grok 4 Fast10010097838392.7%
DeepSeek V3 (2024-12-26)10010090838291.2%
ByteDance Seed 1.61001001001005090.0%
DeepSeek V3.2100100100836790.0%
GPT-5 Mini1001001001003386.7%
Stealth: Aurora Alpha10010083776284.6%
Gemma 3 27B1009683756583.7%
Mistral Small Creative10010092913383.1%
Cohere Command R+ (Aug. 2024)10010098872682.2%
Mistral Large 21008972696779.4%
GPT-4o Mini (temp=1)10010068675077.0%
MoonshotAI: Kimi K2.510010083831776.7%
Z.AI GLM 4.7 Flash1008383803376.0%
Mistral Large 310010083564075.8%
GPT-4.1 Mini838375676775.1%
o4 Mini10010010067073.3%
WizardLM 2 8x22b10010079671772.5%
Llama 3.1 70B1009469534672.3%
DeepSeek-V2 Chat100968371070.0%
GPT-4o Mini (temp=0)817864646169.8%
o4 Mini High10010010048069.5%
GPT-5 Nano1008375503368.3%
GPT-4o, May 13th (temp=1)99908367067.9%
Ministral 3 14B1009279501767.6%
Grok 4.1 Fast838367673366.7%
GPT-4o, May 13th (temp=0)93837867064.3%
GPT-4o, Aug. 6th (temp=0)1006751505063.6%
Gemini 2.5 Flash100928333061.6%
Ministral 3B906750505061.4%
Claude Sonnet 41001006733060.0%
Z.AI GLM 4.5100676767060.0%
Mistral Medium 3.183838150059.6%
Gemini 2.5 Flash Lite1001006725058.4%
Claude Opus 4.61006750501756.7%
Gemini 3 Flash (Preview)836767501756.7%
Claude Haiku 4.51001006717056.7%
Hermes 3 405B1005050503356.7%
Mistral Large100936717055.2%
Ministral 8B10098670052.9%
GPT-4o, Aug. 6th (temp=1)96675050052.5%
Rocinante 12B90676134751.8%
Mistral NeMO835850501751.6%
Ministral 3 8B10083670050.0%
Claude 3.5 Haiku76625644849.2%
Z.AI GLM 4.79583670049.1%
Z.AI GLM 5100754129049.1%
Claude 3 Haiku706356331948.2%
Hermes 3 70B10083500046.7%
Qwen 3.5 397B A17B100833317046.7%
GPT-5100675017046.7%
Llama 3.1 8B100673332046.4%
Gemma 3 12B9867670046.2%
GPT-4.183613333042.3%
Gemini 3 Pro (Preview)83673317040.0%
Writer: Palmyra X583505017040.0%
Arcee AI: Trinity Large (Preview)9075260038.1%
Qwen 3.5 Plus (2026-02-15)67505017036.7%
Llama 3.1 Nemotron 70B643333331736.0%
DeepSeek V3 (2025-03-24)10050110032.2%
ByteDance Seed 1.6 Flash100331710032.0%
Claude 3.7 Sonnet56503317031.2%
Claude 3.5 Sonnet756730028.9%
Gemini 3.1 Pro (Preview)45333317025.6%
Mistral Small 3.2 24B8217170023.1%
Ministral 3 3B4333330022.0%
Claude Opus 4831700020.0%
Claude Opus 4.550171717020.0%
Qwen 2.5 72B91000018.2%
Minimax M2.533171717617.9%
Claude Sonnet 4.583000016.7%
Claude Sonnet 4.64817170016.3%
GPT-5.1000000.0%
GPT-5.2000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Sonnet 4.61001001001008396.6%
Gemini 2.5 Pro1009690838390.6%
ByteDance Seed 1.610010083838390.0%
DeepSeek V3.11009790836787.4%
Z.AI GLM 4.610010087795984.9%
GPT-5 Nano1008383836783.3%
GPT-4.1 Nano1009076675076.5%
DeepSeek V3.2917878785175.2%
Cohere Command R+ (Aug. 2024)999887513373.6%
GPT-510010067505073.3%
Claude Opus 4.61009867505073.0%
Gemma 3 4B10010067623372.4%
Mistral Medium 3.1908867602766.1%
Grok 4 Fast967166503363.3%
GPT-4.1838350504862.8%
Mistral Small Creative956662503361.3%
Gemini 2.5 Flash Lite866967463360.1%
GPT-4o, Aug. 6th (temp=1)906762503360.1%
Stealth: Aurora Alpha676458505057.7%
Mistral Small 3.2 24B837367331754.7%
Claude Sonnet 4.5756766421753.4%
GPT-5.11006750331753.3%
Z.AI GLM 51008350171753.3%
Mistral Large656255453352.1%
Gemini 2.5 Flash676451413351.3%
Rocinante 12B83726238151.3%
o4 Mini High77756733050.4%
Qwen 3.5 Plus (2026-02-15)845751411749.8%
Mistral NeMO985450281749.3%
Ministral 3 8B74626246048.8%
Claude Sonnet 4635849373147.6%
WizardLM 2 8x22b67675450047.4%
Claude Opus 4.569655045346.3%
Minimax M2.583675017043.3%
Claude Haiku 4.5676738261743.0%
GPT-4o Mini (temp=1)565336333342.4%
Grok 467675026041.9%
Claude 3.5 Haiku776930181241.2%
Ministral 8B6862485036.6%
Claude Opus 4524342331036.2%
Arcee AI: Trinity Large (Preview)7460310033.1%
Gemma 3 12B8339259632.5%
Ministral 3B6854328032.4%
Gemini 3 Flash (Preview)50493317029.9%
DeepSeek-V2 Chat67501710028.6%
Hermes 3 70B7347167028.5%
DeepSeek V3 (2024-12-26)6042308027.9%
Claude 3.5 Sonnet44422717727.6%
MoonshotAI: Kimi K2.556333311026.9%
GPT-5 Mini67331717026.7%
Mistral Large 253323216026.7%
Writer: Palmyra X5834700026.0%
Mistral Large 35744179025.4%
GPT-4o, Aug. 6th (temp=0)655900024.9%
o4 Mini7433170024.7%
Arcee AI: Trinity Mini46392610023.9%
Qwen 3.5 397B A17B35333317023.7%
Z.AI GLM 4.7 Flash42322417023.0%
ByteDance Seed 1.6 Flash8117140022.4%
Llama 3.1 70B673300020.0%
Ministral 3 14B494720019.8%
Gemma 3 27B671600016.5%
Llama 3.1 8B502900015.7%
Gemini 3.1 Pro (Preview)31171710015.0%
Z.AI GLM 4.7502400014.8%
Grok 4.1 Fast471780014.3%
Ministral 3 3B22201611013.8%
Gemini 3 Pro (Preview)432300013.2%
GPT-5.217171715013.0%
Hermes 3 405B2920150012.8%
GPT-4.1 Mini57000011.4%
Qwen 2.5 72B25170008.4%
DeepSeek V3 (2025-03-24)26140007.9%
GPT-4o, May 13th (temp=1)17120005.8%
Claude 3.7 Sonnet2800005.5%
Llama 3.1 Nemotron 70B2310004.9%
GPT-4o, May 13th (temp=0)1700003.3%
Claude 3 Haiku1300002.6%
Z.AI GLM 4.51000001.9%
GPT-4o Mini (temp=0)000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100.0%
Grok 4 Fast100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Arcee AI: Trinity Mini1001001001009098.1%
GPT-5.11001001001008396.7%
Minimax M2.51001001001008396.7%
Grok 41001001001008396.7%
Claude Sonnet 4.61001001001008396.7%
ByteDance Seed 1.61001001001008396.7%
GPT-4.1 Nano1001001001008396.7%
Z.AI GLM 51001001001008396.7%
Claude Opus 4.6100100100988396.3%
Gemma 3 27B100100100978396.1%
MoonshotAI: Kimi K2.5100100100948195.1%
Mistral Small Creative10010097938494.8%
GPT-5 Mini100100100838393.3%
o4 Mini100100100838393.3%
Claude Opus 4100100100838393.3%
DeepSeek V3.21001001001006793.3%
Gemini 2.5 Flash100100100838393.3%
GPT-4o Mini (temp=0)100100100838393.3%
Claude Opus 4.510010095838392.3%
o4 Mini High1001001001006192.3%
GPT-4.11009694868492.0%
DeepSeek V3 (2025-03-24)10010087838390.8%
Gemma 3 4B1009995936790.7%
GPT-4o, May 13th (temp=0)1001001001005090.0%
Gemini 2.5 Flash Lite10010083838390.0%
WizardLM 2 8x22b100100100836790.0%
Z.AI GLM 4.710010096837089.8%
Gemini 3 Pro (Preview)1009583837887.9%
GPT-4o, Aug. 6th (temp=1)100100100835587.6%
GPT-5100100100835086.7%
Z.AI GLM 4.61001001001003386.6%
GPT-4o, Aug. 6th (temp=0)10010079786684.6%
GPT-4.1 Mini1009895795084.3%
Gemini 3 Flash (Preview)959486835983.4%
DeepSeek-V2 Chat10010078675680.0%
ByteDance Seed 1.6 Flash1009489833379.9%
Claude 3.5 Haiku868585805778.7%
Claude Sonnet 41009089832176.7%
Qwen 3.5 Plus (2026-02-15)967978735075.4%
Claude Haiku 4.510010087701774.8%
Grok 4.1 Fast1001008883074.2%
Writer: Palmyra X51001009567072.3%
Claude 3.5 Sonnet877773595470.0%
Llama 3.1 8B1009767582869.8%
Stealth: Aurora Alpha837975615069.6%
Llama 3.1 70B878367615069.5%
GPT-4o, May 13th (temp=1)979265563368.7%
Cohere Command R+ (Aug. 2024)908383552467.1%
Mistral NeMO94847976066.7%
Ministral 3 8B939158483364.7%
Ministral 8B777567663363.8%
Mistral Large1007769532063.8%
GPT-5 Nano1006750505063.3%
DeepSeek V3 (2024-12-26)1001006746263.0%
Ministral 3 14B1007353503161.3%
Claude Sonnet 4.5100927420057.2%
Gemini 3.1 Pro (Preview)766967433057.0%
Llama 3.1 Nemotron 70B676355524456.0%
Claude 3.7 Sonnet827250373354.8%
Gemma 3 12B1006749331753.2%
Rocinante 12B10088670050.8%
Arcee AI: Trinity Large (Preview)80735037749.4%
Qwen 2.5 72B76554732042.2%
Hermes 3 405B8272560042.1%
Z.AI GLM 4.565605029040.8%
Mistral Large 38766470040.2%
Hermes 3 70B8365217035.1%
Mistral Large 2835900028.5%
Ministral 3 3B10017130025.9%
Mistral Small 3.2 24B49332323025.6%
Qwen 3.5 397B A17B5041330024.9%
Mistral Medium 3.11001750024.4%
Claude 3 Haiku5333320023.5%
GPT-5.24329150017.5%
Ministral 3B3830008.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 2.5 Pro100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.210010010010010099.9%
GPT-51001001001009699.2%
MoonshotAI: Kimi K2.51001001001009398.7%
Gemini 2.5 Flash1001001001008396.7%
Z.AI GLM 4.7 Flash100100100968395.8%
Z.AI GLM 5100100100997895.4%
Z.AI GLM 4.6100100100908695.3%
GPT-4.1 Nano100100100848393.5%
Gemma 3 27B100100100836790.0%
GPT-5 Mini100100100826789.7%
Gemini 2.5 Flash Lite100100100836188.8%
DeepSeek V3 (2024-12-26)10010095836087.6%
o4 Mini High10010083836786.7%
GPT-5 Nano10010083836786.7%
Claude Opus 4.610010083836786.7%
GPT-5.110010097835086.2%
Writer: Palmyra X51009483836785.4%
GPT-4.1 Mini1009383816784.9%
ByteDance Seed 1.610010088835084.4%
Claude Opus 4.51009183806784.3%
DeepSeek-V2 Chat1008380787783.6%
Rocinante 12B10010090765083.3%
Claude Haiku 4.5959383806382.5%
Mistral Small Creative939183766782.1%
Grok 4 Fast999483676782.1%
GPT-4o, Aug. 6th (temp=1)10010093833382.0%
Claude Sonnet 4.51008480796781.9%
Gemini 3.1 Pro (Preview)10010081675981.4%
Grok 4.1 Fast999184675979.8%
WizardLM 2 8x22b1009883675079.5%
Minimax M2.510010074715078.9%
Mistral Medium 3.1958379676778.1%
Mistral Small 3.2 24B1001009983076.5%
GPT-4o Mini (temp=1)999473684976.4%
Hermes 3 70B10010083504876.2%
DeepSeek V3 (2025-03-24)888878724674.3%
GPT-4.11008272675074.2%
Z.AI GLM 4.71009567593370.9%
Ministral 3 8B837774615068.8%
ByteDance Seed 1.6 Flash100948267068.6%
Claude Sonnet 4888376613368.1%
Grok 4826767636168.0%
Claude 3.5 Haiku817670565567.5%
GPT-4o, Aug. 6th (temp=0)898158535066.4%
Mistral Large 3807464644866.2%
o4 Mini836767625065.8%
Gemini 3 Pro (Preview)89837767063.1%
Arcee AI: Trinity Large (Preview)787460515162.6%
Gemma 3 4B836767504562.6%
GPT-4o, May 13th (temp=1)1006550494662.1%
Claude Opus 4847561484262.0%
Gemini 3 Flash (Preview)837667503361.9%
Stealth: Aurora Alpha836658505061.5%
Cohere Command R+ (Aug. 2024)1006764332858.4%
Gemma 3 12B83756767058.3%
Arcee AI: Trinity Mini87836550056.9%
Ministral 8B837956501656.8%
Mistral NeMO676454493353.4%
Mistral Large 2905048482351.9%
GPT-4o Mini (temp=0)725947433450.9%
Qwen 3.5 397B A17B696750501750.6%
Qwen 3.5 Plus (2026-02-15)836150331748.8%
Claude 3.5 Sonnet645754412848.8%
Z.AI GLM 4.5696740333348.5%
Ministral 3 14B786746331748.3%
Ministral 3 3B715050501747.5%
Llama 3.1 8B100675019047.1%
Ministral 3B874934302745.5%
Qwen 2.5 72B535049383344.8%
Mistral Large635150292443.5%
Llama 3.1 70B575544312441.9%
Llama 3.1 Nemotron 70B564643302840.3%
Hermes 3 405B55484844039.0%
Claude 3.7 Sonnet505033171633.3%
GPT-4o, May 13th (temp=0)50433929032.1%
GPT-5.250333217026.3%
Claude 3 Haiku3742008.7%