Either/Or composite

Test: Novel outline

Avg. Score
70.7%
Scenarios
1

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 2.5 Flash Lite100.0%$0.00031.0s100%
2GPT-5.4 Nano (Reasoning)100.0%$0.00084.8s100%
3Z.AI GLM 4.5100.0%$0.00098.1s100%
4Z.AI GLM 5 Turbo100.0%$0.00307.2s100%
5Qwen3 235B A22B Instruct 2507100.0%$0.000416.4s100%
6ByteDance Seed 1.6 Flash100.0%$0.001015.0s100%
7Gemini 3 Flash (Preview, Reasoning)100.0%$0.00416.0s100%
8GPT-5.4 (Reasoning, Low)100.0%$0.00515.3s100%
9Qwen 3.6 Flash100.0%$0.003510.6s100%
10Qwen 3.5 Plus (2026-02-15)100.0%$0.002116.3s100%
11GPT-5.2100.0%$0.00556.4s100%
12Qwen 3.5 Flash100.0%$0.001120.6s100%
13Writer: Palmyra X5100.0%$0.003912.4s100%
14Qwen 3.6 35B100.0%$0.003016.3s100%
15Grok 4.20 (Beta, Reasoning)100.0%$0.00724.3s100%
16Z.AI GLM 5100.0%$0.003221.5s100%
17MoonshotAI: Kimi K2.5100.0%$0.003323.3s100%
18Aion 2.0100.0%$0.002824.9s100%
19GPT-5.4 (Reasoning)100.0%$0.006218.8s100%
20Z.AI GLM 4.6100.0%$0.002334.1s100%
21GPT-4o, May 13th (temp=0)100.0%$0.0137.2s100%
22Qwen 3.5 35B100.0%$0.007823.2s100%
23Gemini 2.5 Pro100.0%$0.0139.0s100%
24Qwen 3.5 122B100.0%$0.009121.8s100%
25Claude Sonnet 4.6 (Reasoning)100.0%$0.0149.3s100%
26GPT-5.5 (Reasoning, Low)100.0%$0.0155.3s100%
27Gemini 3 Pro (Preview)100.0%$0.0158.7s100%
28MoonshotAI: Kimi K2.6100.0%$0.005039.7s100%
29GPT-5.4 Nano (Reasoning, Low)95.0%$0.00073.2s70%
30GPT-5.5 (Reasoning)100.0%$0.0175.3s100%
31Gemini 2.5 Flash Lite (Reasoning)95.0%$0.00095.1s70%
32Gemini 3.1 Pro (Preview)100.0%$0.01613.7s100%
33Llama 3.1 Nemotron 70B95.0%$0.000614.0s70%
34Claude Opus 4.6 (Reasoning)100.0%$0.01910.4s100%
35DeepSeek-V2 Chat95.0%$0.000317.9s70%
36Qwen 3.5 397B A17B100.0%$0.007748.3s100%
37Claude Opus 4.7 (Reasoning)100.0%$0.0234.5s100%
38Qwen3.7 Max100.0%$0.01726.5s100%
39Qwen 3.5 27B100.0%$0.006259.6s100%
40o4 Mini High95.0%$0.005413.6s70%
41DeepSeek V4 Flash (Reasoning)90.0%$0.000310.5s60%
42MiniMax M2.790.0%$0.000911.7s60%
43Claude Opus 4.7100.0%$0.0274.8s100%
44GPT-5 Mini90.0%$0.001611.1s60%
45GPT-595.0%$0.009314.9s70%
46Gemma 4 31B (Reasoning)95.0%$0.000545.5s70%
47Mistral Large85.0%$0.00372.0s54%
48GPT-5.4 Mini (Reasoning)85.0%$0.00325.8s54%
49Qwen3.6 Max Preview100.0%$0.01752.2s100%
50ByteDance Seed 2.0 Lite95.0%$0.003941.7s70%
51Qwen 3.5 Plus (2026-04-20)95.0%$0.005836.0s70%
52Qwen 3.6 27B95.0%$0.008132.0s70%
53Stealth: Hunter Alpha80.0%$0.00007.5s51%
54MiniMax M2.585.0%$0.001016.4s54%
55Mistral Large 390.0%$0.00124.8s40%
56Gemini 3.5 Flash (Reasoning)95.0%$0.0186.9s70%
57Nemotron 3 Super80.0%$0.000013.6s51%
58Gemini 3.1 Flash Lite (Preview)80.0%$0.00101.7s34%
59Z.AI GLM 4.790.0%$0.002951.1s60%
60GPT-4.185.0%$0.00467.3s36%
61Gemini 2.5 Flash (Reasoning)80.0%$0.00243.9s34%
62Gemma 3 12B75.0%$0.00029.5s38%
63Gemini 3.1 Flash Lite (Reasoning)75.0%$0.00091.6s33%
64GPT-4o, Aug. 6th (temp=0)75.0%$0.00371.1s38%
65GPT-5.185.0%$0.00569.0s36%
66o4 Mini85.0%$0.004911.6s36%
67Qwen 3.5 9B95.0%$0.00111.5m70%
68DeepSeek V4 Pro (Reasoning)85.0%$0.002746.4s54%
69ByteDance Seed 1.680.0%$0.003434.0s51%
70GPT-4o Mini (temp=0)50.0%$0.00021.5s50%
71Z.AI GLM 5.190.0%$0.006029.6s40%
72Gemma 4 26B (Reasoning)90.0%$0.00051.3m60%
73Grok 4.20 (Reasoning)75.0%$0.004512.1s38%
74GPT-4o, Aug. 6th (temp=1)75.0%$0.00593.9s33%
75Stealth: Healer Alpha70.0%$0.00005.0s25%
76Gemini 3.1 Flash Lite75.0%$0.00091.6s19%
77GPT-4.1 Nano65.0%$0.00014.1s27%
78Xiaomi MIMO v2.575.0%$0.00135.9s19%
79GPT-4o Mini (temp=1)60.0%$0.00031.9s30%
80Inception Mercury 255.0%$0.0005836ms35%
81Mistral Small 4 (Reasoning)70.0%$0.001310.5s25%
82Qwen 2.5 72B70.0%$0.000814.9s25%
83Mistral Small 465.0%$0.00032.8s18%
84Qwen 3 32B75.0%$0.001036.8s33%
85Claude Sonnet 4.680.0%$0.00873.6s20%
86Gemini 2.5 Flash60.0%$0.0005886ms20%
87Gemma 4 31B60.0%$0.00022.8s20%
88GPT-4.1 Mini65.0%$0.00075.8s16%
89Llama 3.1 70B70.0%$0.001217.6s20%
90GPT-5 Nano70.0%$0.000926.9s26%
91Ministral 3 8B50.0%$0.00032.3s28%
92GPT-OSS 120B55.0%$0.000322.8s35%
93Claude Sonnet 470.0%$0.00995.6s25%
94Claude Haiku 4.565.0%$0.00323.1s10%
95Z.AI GLM 4.7 Flash70.0%$0.001039.3s25%
96DeepSeek V4 Flash60.0%$0.00015.3s10%
97Gemini 3.5 Flash (Reasoning, Minimal)60.0%$0.00411.5s13%
98Mistral Large 250.0%$0.00487.9s28%
99ByteDance Seed 2.0 Mini95.0%$0.00242.3m70%
100Claude Opus 495.0%$0.04711.4s70%
101Inception Mercury40.0%$0.00011.2s20%
102Ministral 3 3B40.0%$0.00021.5s20%
103WizardLM 2 8x22b55.0%$0.001911.7s15%
104Xiaomi MIMO v2.5 Pro60.0%$0.002411.9s10%
105Mistral Small Creative40.0%$0.00033.6s20%
106Arcee AI: Trinity Mini35.0%$0.00026.7s27%
107Gemini 3 Flash (Preview)40.0%$0.00131.7s20%
108Grok 4.1 Fast35.0%$0.00067.0s27%
109Z.AI GLM 4.5 Air55.0%$0.000912.3s8%
110Grok 4.2045.0%$0.00193.6s15%
111DeepSeek V3.155.0%$0.000513.6s8%
112Stealth: Aurora Alpha60.0%937ms30%
113GPT-5.4 Nano45.0%$0.00061.9s8%
114DeepSeek V3 (2024-12-26)55.0%$0.000810.1s4%
115DeepSeek V3.250.0%$0.000513.5s11%
116Hermes 3 70B55.0%$0.000610.8s4%
117Ministral 3B45.0%$0.00013.9s8%
118Grok 4.20 (Beta)55.0%$0.002313.5s8%
119GPT-4o, May 13th (temp=1)65.0%$0.0147.2s16%
120Cohere Command R+ (Aug. 2024)40.0%$0.00462.4s20%
121Ministral 3 14B35.0%$0.00045.0s18%
122DeepSeek V4 Pro45.0%$0.00066.3s8%
123Gemma 4 26B60.0%$0.000320.4s2%
124Hermes 3 405B30.0%$0.000011.9s26%
125GPT-5.4 Mini (Reasoning, Low)35.0%$0.00213.2s18%
126Grok 470.0%$0.01720.2s25%
127LFM2 24B35.0%$0.00019.9s18%
128Gemma 3 27B40.0%$0.00029.7s13%
129Rocinante 12B35.0%$0.000513.5s18%
130Claude Opus 4.555.0%$0.0164.9s23%
131Grok 4.3 (Reasoning)65.0%$0.005836.0s10%
132Llama 3.1 8B35.0%$0.00022.6s5%
133Nemotron 3 Nano35.0%$0.000532.9s27%
134Mistral NeMO35.0%$0.00033.5s5%
135GPT-5.4 Mini25.0%$0.0013851ms13%
136Claude Sonnet 4.550.0%$0.00944.5s5%
137Claude 3 Haiku35.0%$0.00061.6s0%
138GPT-5.435.0%$0.00352.7s5%
139Arcee AI: Trinity Large (Preview)25.0%$0.00002.1s0%
140Grok 4 Fast25.0%$0.00053.8s0%
141GPT-5.525.0%$0.00831.4s13%
142Gemma 3 4B20.0%$0.00013.5s0%
143DeepSeek V3 (2025-03-24)25.0%$0.00089.5s0%
144Claude Opus 4.640.0%$0.0125.5s0%
145Grok 4.315.0%$0.0017955ms0%
146Ministral 8B15.0%$0.00028.4s0%
147Claude 3.7 Sonnet20.0%$0.00623.5s0%
148Mistral Medium 3.110.0%$0.00135.3s0%
149Mistral Small 3.2 24B5.0%$0.00023.0s0%
150Claude 3.5 Sonnet10.0%$0.00625.2s0%
70.70%

Individual Scenarios

pov-count

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Qwen3.7 Max100100100100100100100100100100100.0%
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
Gemini 3.5 Flash (Reasoning)1001001001001001001001001005095.0%
GPT-51001001001001001001001001005095.0%
Gemma 4 31B (Reasoning)1001001001001001001001001005095.0%
Qwen 3.5 Plus (2026-04-20)1001001001001001001001001005095.0%
o4 Mini High1001001001001001001001001005095.0%
Qwen 3.6 27B1001001001001001001001001005095.0%
Claude Opus 41001001001001001001001001005095.0%
ByteDance Seed 2.0 Mini1001001001001001001001001005095.0%
Qwen 3.5 9B1001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
DeepSeek-V2 Chat1001001001001001001001001005095.0%
ByteDance Seed 2.0 Lite1001001001001001001001001005095.0%
GPT-5.4 Nano (Reasoning, Low)1001001001001001001001001005095.0%
Llama 3.1 Nemotron 70B1001001001001001001001001005095.0%
Z.AI GLM 5.1100100100100100100100100100090.0%
GPT-5 Mini100100100100100100100100505090.0%
Gemma 4 26B (Reasoning)100100100100100100100100505090.0%
MiniMax M2.7100100100100100100100100505090.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100505090.0%
Z.AI GLM 4.7100100100100100100100100505090.0%
Mistral Large 3100100100100100100100100100090.0%
GPT-5.110010010010010010010010050085.0%
GPT-5.4 Mini (Reasoning)10010010010010010010050505085.0%
DeepSeek V4 Pro (Reasoning)10010010010010010010050505085.0%
MiniMax M2.510010010010010010010050505085.0%
GPT-4.110010010010010010010010050085.0%
o4 Mini10010010010010010010010050085.0%
Mistral Large10010010010010010010050505085.0%
Claude Sonnet 4.61001001001001001001001000080.0%
ByteDance Seed 1.61001001001001001005050505080.0%
Stealth: Hunter Alpha1001001001001001005050505080.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001005050080.0%
Gemini 3.1 Flash Lite (Preview)1001001001001001001005050080.0%
Nemotron 3 Super1001001001001001005050505080.0%
Grok 4.20 (Reasoning)100100100100100505050505075.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100505050075.0%
Gemini 3.1 Flash Lite100100100100100100100500075.0%
Xiaomi MIMO v2.5100100100100100100100500075.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100505050075.0%
GPT-4o, Aug. 6th (temp=0)100100100100100505050505075.0%
Qwen 3 32B100100100100100100505050075.0%
Gemma 3 12B100100100100100505050505075.0%
Claude Sonnet 410010010010010050505050070.0%
Grok 410010010010010050505050070.0%
Stealth: Healer Alpha10010010010010050505050070.0%
Z.AI GLM 4.7 Flash10010010010010050505050070.0%
GPT-5 Nano10010010010050505050505070.0%
Mistral Small 4 (Reasoning)10010010010010050505050070.0%
Llama 3.1 70B10010010010010010050500070.0%
Qwen 2.5 72B10010010010010050505050070.0%
Grok 4.3 (Reasoning)1001001001001001005000065.0%
Claude Haiku 4.51001001001001001005000065.0%
GPT-4o, May 13th (temp=1)1001001001001005050500065.0%
GPT-4.1 Mini1001001001001005050500065.0%
Mistral Small 41001001001005050505050065.0%
GPT-4.1 Nano1001001005050505050505065.0%
Xiaomi MIMO v2.5 Pro100100100100100505000060.0%
Gemma 4 31B100100100505050505050060.0%
Gemini 3.5 Flash (Reasoning, Minimal)100100100100505050500060.0%
Gemma 4 26B100100100100100100000060.0%
Stealth: Aurora Alpha100100505050505050505060.0%
DeepSeek V4 Flash100100100100100505000060.0%
Gemini 2.5 Flash100100100505050505050060.0%
GPT-4o Mini (temp=1)100100505050505050505060.0%
Claude Opus 4.510010050505050505050055.0%
GPT-OSS 120B10050505050505050505055.0%
Grok 4.20 (Beta)10010010010050505000055.0%
Inception Mercury 210050505050505050505055.0%
DeepSeek V3 (2024-12-26)10010010010010050000055.0%
Z.AI GLM 4.5 Air10010010010050505000055.0%
DeepSeek V3.110010010010050505000055.0%
Hermes 3 70B10010010010010050000055.0%
WizardLM 2 8x22b10010010050505050500055.0%
Claude Sonnet 4.51001001001005050000050.0%
Mistral Large 21005050505050505050050.0%
DeepSeek V3.21001001005050505000050.0%
GPT-4o Mini (temp=0)5050505050505050505050.0%
Ministral 3 8B1005050505050505050050.0%
DeepSeek V4 Pro100100100505050000045.0%
Grok 4.20100100505050505000045.0%
GPT-5.4 Nano100100100505050000045.0%
Ministral 3B100100100505050000045.0%
Claude Opus 4.610010010010000000040.0%
Gemini 3 Flash (Preview)10050505050505000040.0%
Inception Mercury10050505050505000040.0%
Gemma 3 27B10010050505050000040.0%
Mistral Small Creative10050505050505000040.0%
Cohere Command R+ (Aug. 2024)10050505050505000040.0%
Ministral 3 3B10050505050505000040.0%
Grok 4.1 Fast5050505050505000035.0%
GPT-5.4 Mini (Reasoning, Low)1005050505050000035.0%
GPT-5.41001005050500000035.0%
Nemotron 3 Nano5050505050505000035.0%
Ministral 3 14B1005050505050000035.0%
Claude 3 Haiku1001001005000000035.0%
Arcee AI: Trinity Mini5050505050505000035.0%
Mistral NeMO1001005050500000035.0%
Llama 3.1 8B1001005050500000035.0%
LFM2 24B1005050505050000035.0%
Rocinante 12B1005050505050000035.0%
Hermes 3 405B505050505050000030.0%
GPT-5.550505050500000025.0%
Grok 4 Fast10050505000000025.0%
GPT-5.4 Mini50505050500000025.0%
DeepSeek V3 (2025-03-24)10050505000000025.0%
Arcee AI: Trinity Large (Preview)10050505000000025.0%
Claude 3.7 Sonnet1005050000000020.0%
Gemma 3 4B5050505000000020.0%
Grok 4.3505050000000015.0%
Ministral 8B100500000000015.0%
Claude 3.5 Sonnet50500000000010.0%
Mistral Medium 3.150500000000010.0%
Mistral Small 3.2 24B500000000005.0%