Matches Regex

Test: Voice/dialogue sheets

Avg. Score
63.0%
Scenarios
5

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemma 4 31B100.0%$0.00018.3s100%
2Claude Sonnet 4.6100.0%$0.00331.9s100%
3Claude Sonnet 4100.0%$0.00332.8s100%
4Claude 3.7 Sonnet100.0%$0.00343.5s100%
5Claude Opus 4.6100.0%$0.00553.9s100%
6Claude Opus 4.6 (Reasoning)100.0%$0.00603.1s100%
7GPT-4.196.0%$0.00172.6s61%
8GPT-4o, May 13th (temp=0)96.0%$0.00375.2s61%
9GPT-4o, Aug. 6th (temp=0)92.0%$0.00222.4s46%
10Gemini 2.5 Flash88.0%$0.0005933ms35%
11ByteDance Seed 1.694.0%$0.001314.2s53%
12Gemma 4 26B88.0%$0.00014.5s35%
13Qwen 3.5 Plus (2026-02-15)86.0%$0.00056.6s31%
14Grok 4 Fast84.0%$0.00033.6s27%
15Grok 4.1 Fast84.0%$0.00044.4s27%
16GPT-4o, May 13th (temp=1)90.0%$0.00375.2s40%
17DeepSeek V3 (2025-03-24)84.0%$0.00036.1s27%
18Grok 4.20 (Reasoning)92.0%$0.003512.9s46%
19Claude Haiku 4.582.0%$0.00111.6s23%
20Gemini 3 Flash (Preview)80.0%$0.00061.8s20%
21Claude Sonnet 4.6 (Reasoning)90.0%$0.00533.3s40%
22Mistral Large 380.0%$0.00044.1s20%
23Gemini 3.1 Flash Lite (Reasoning)78.0%$0.00031.4s17%
24GPT-4o, Aug. 6th (temp=1)82.0%$0.00222.3s23%
25DeepSeek-V2 Chat80.0%$0.00018.2s20%
26Gemini 3.1 Flash Lite76.0%$0.00031.1s15%
27Claude Sonnet 4.584.0%$0.00332.8s27%
28DeepSeek V4 Pro82.0%$0.000412.1s23%
29GPT-4.1 Mini76.0%$0.00032.7s15%
30Stealth: Healer Alpha76.0%$0.00005.0s15%
31Z.AI GLM 4.576.0%$0.00045.3s15%
32Mistral Small 3.2 24B72.0%$0.00012.1s10%
33DeepSeek V3 (2024-12-26)74.0%$0.00034.6s12%
34Llama 3.1 70B72.0%$0.00051.5s10%
35Writer: Palmyra X578.0%$0.00108.1s17%
36DeepSeek V3.176.0%$0.00028.4s15%
37Grok 4.2072.0%$0.00071.5s10%
38Gemini 3.1 Flash Lite (Preview)70.0%$0.0003979ms8%
39Qwen3 235B A22B Instruct 250772.0%$0.00014.9s10%
40Claude 3.5 Sonnet80.0%$0.00344.5s20%
41Gemini 2.5 Flash Lite (Reasoning)70.0%$0.00032.7s8%
42Gemma 4 31B (Reasoning)92.0%$0.000343.9s46%
43Gemini 2.5 Flash Lite66.0%$0.0001610ms5%
44Z.AI GLM 5 Turbo78.0%$0.00307.7s17%
45Hermes 3 70B68.0%$0.00026.0s7%
46DeepSeek V4 Flash (Reasoning)78.0%$0.000220.6s17%
47ByteDance Seed 1.6 Flash66.0%$0.00024.5s5%
48GPT-5.4 (Reasoning)80.0%$0.00495.4s20%
49Gemini 3 Flash (Preview, Reasoning)72.0%$0.00244.3s10%
50Xiaomi MIMO v2.570.0%$0.00136.2s8%
51Xiaomi MIMO v2.5 Pro72.0%$0.00157.9s10%
52Mistral Large 270.0%$0.00184.7s8%
53Cohere Command R+ (Aug. 2024)70.0%$0.00233.2s8%
54Mistral Small 460.0%$0.00011.3s2%
55DeepSeek V4 Flash62.0%$0.00013.7s3%
56GPT-4o Mini (temp=0)60.0%$0.00013.7s2%
57Grok 4.20 (Beta)60.0%$0.0009735ms2%
58Grok 4.360.0%$0.00081.7s2%
59Claude Opus 4.7 (Reasoning)82.0%$0.00801.8s23%
60Qwen 3.5 122B88.0%$0.007317.5s35%
61Llama 3.1 8B54.0%$0.0001919ms0%
62GPT-5 Mini72.0%$0.001813.0s10%
63GPT-5.5 (Reasoning, Low)78.0%$0.00653.3s17%
64Qwen 3.6 35B72.0%$0.002212.2s10%
65GPT-5.4 Mini (Reasoning)64.0%$0.00184.1s4%
66Stealth: Hunter Alpha70.0%$0.000018.7s8%
67ByteDance Seed 2.0 Lite76.0%$0.001820.6s15%
68Claude Opus 4.780.0%$0.00802.2s20%
69Qwen 3.5 35B82.0%$0.005117.7s23%
70Grok 4.3 (Reasoning)80.0%$0.003820.2s20%
71Gemma 3 12B52.0%$0.00004.1s0%
72Z.AI GLM 4.7 Flash70.0%$0.000521.5s8%
73MiniMax M2.756.0%$0.00066.8s1%
74Qwen 3.6 Flash68.0%$0.003110.1s7%
75Arcee AI: Trinity Large (Preview)48.0%$0.00004.8s0%
76Claude Opus 4.570.0%$0.00552.9s8%
77Hermes 3 405B58.0%$0.000013.3s1%
78Mistral Small 4 (Reasoning)48.0%$0.00044.2s0%
79DeepSeek V3.250.0%$0.00026.7s0%
80GPT-4o Mini (temp=1)60.0%$0.000115.0s2%
81GPT-5.4 Mini (Reasoning, Low)48.0%$0.00102.2s0%
82Qwen 3.5 Flash70.0%$0.001222.0s8%
83Ministral 3 8B40.0%$0.00011.2s0%
84Rocinante 12B52.0%$0.00029.1s0%
85GPT-4.1 Nano40.0%$0.00012.0s0%
86GPT-5.4 Nano (Reasoning, Low)40.0%$0.00021.4s0%
87GPT-5.254.0%$0.00252.1s0%
88Mistral Small Creative38.0%$0.00011.2s0%
89Grok 4.20 (Beta, Reasoning)78.0%$0.00759.9s17%
90GPT-5.4 (Reasoning, Low)58.0%$0.00343.2s1%
91GPT-5.5 (Reasoning)76.0%$0.00834.6s15%
92GPT-5.164.0%$0.00416.9s4%
93MiniMax M2.548.0%$0.00067.7s0%
94GPT-5.4 Mini40.0%$0.00091.1s0%
95Claude 3 Haiku40.0%$0.00033.7s0%
96Inception Mercury34.0%$0.0001961ms0%
97Aion 2.066.0%$0.001520.9s5%
98Z.AI GLM 5.182.0%$0.004533.2s23%
99WizardLM 2 8x22b42.0%$0.00047.6s0%
100o4 Mini60.0%$0.00368.6s2%
101Nemotron 3 Super42.0%$0.00009.8s0%
102Gemma 3 4B30.0%$0.00001.9s0%
103GPT-5.4 Nano30.0%$0.00021.4s0%
104Gemini 2.5 Flash (Reasoning)38.0%$0.00142.7s0%
105GPT-5.560.0%$0.00582.3s2%
106Gemma 3 27B30.0%$0.00015.0s0%
107Inception Mercury 226.0%$0.0004823ms0%
108GPT-5.4 Nano (Reasoning)24.0%$0.00031.5s0%
109Qwen 3.6 27B76.0%$0.005625.4s15%
110Gemma 4 26B (Reasoning)60.0%$0.000327.8s2%
111Llama 3.1 Nemotron 70B30.0%$0.00026.9s0%
112Ministral 3 14B20.0%$0.00011.5s0%
113Mistral NeMO22.0%$0.00013.2s0%
114Z.AI GLM 4.5 Air36.0%$0.000511.1s0%
115Mistral Medium 3.122.0%$0.00043.2s0%
116Qwen 3.5 27B66.0%$0.004420.4s5%
117GPT-5.434.0%$0.00302.0s0%
118Qwen 2.5 72B20.0%$0.00024.9s0%
119Qwen 3 32B50.0%$0.000624.6s0%
120GPT-OSS 120B38.0%$0.000318.0s0%
121GPT-5 Nano44.0%$0.000721.1s0%
122Ministral 3B10.0%$0.00001.1s0%
123ByteDance Seed 2.0 Mini64.0%$0.000737.7s4%
124Claude Opus 486.0%$0.0177.0s31%
125Ministral 8B8.0%$0.00011.5s0%
126Qwen3.6 Max Preview96.0%$0.01446.9s61%
127Nemotron 3 Nano22.0%$0.000213.0s0%
128Qwen 3.5 Plus (2026-04-20)64.0%$0.004229.1s4%
129Ministral 3 3B0.0%$0.0001835ms0%
130Z.AI GLM 556.0%$0.003526.7s1%
131LFM2 24B0.0%$0.00002.9s0%
132Arcee AI: Trinity Mini8.0%$0.00018.1s0%
133Qwen 3.5 9B76.0%$0.00071.1m15%
134Gemini 3.1 Pro (Preview)74.0%$0.01312.9s12%
135MoonshotAI: Kimi K2.550.0%$0.004030.8s0%
136Z.AI GLM 4.654.0%$0.002539.8s0%
137Grok 468.0%$0.01217.7s7%
138Stealth: Aurora Alpha34.0%1.7s0%
139Mistral Large26.0%$0.00706.2s0%
140GPT-558.0%$0.01119.2s1%
141Gemini 3 Pro (Preview)68.0%$0.01610.7s7%
142Gemini 2.5 Pro40.0%$0.0118.4s0%
143DeepSeek V4 Pro (Reasoning)70.0%$0.00391.1m8%
144Qwen 3.5 397B A17B82.0%$0.00861.1m23%
145Z.AI GLM 4.760.0%$0.00271.1m2%
146o4 Mini High58.0%$0.005258.3s1%
147MoonshotAI: Kimi K2.650.0%$0.00661.1m0%
63.02%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100090.0%
Qwen 3.5 122B100100100100100100100100100090.0%
Gemma 4 26B (Reasoning)100100100100100100100100100090.0%
GPT-4.1100100100100100100100100100090.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100090.0%
GPT-4.1 Mini100100100100100100100100100090.0%
Gemma 4 31B (Reasoning)1001001001001001001001000080.0%
ByteDance Seed 1.61001001001001001001001000080.0%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001000080.0%
GPT-4o, May 13th (temp=0)1001001001001001001001000080.0%
DeepSeek V3.21001001001001001001001000080.0%
Z.AI GLM 5.110010010010010010010000070.0%
Aion 2.010010010010010010010000070.0%
DeepSeek V3.110010010010010010010000070.0%
Z.AI GLM 5 Turbo100100100100100100000060.0%
Qwen 3.5 397B A17B100100100100100100000060.0%
Grok 4.20 (Reasoning)100100100100100100000060.0%
Z.AI GLM 5100100100100100100000060.0%
Grok 4.1 Fast100100100100100100000060.0%
Grok 4100100100100100100000060.0%
Stealth: Hunter Alpha100100100100100100000060.0%
Gemini 2.5 Flash (Reasoning)100100100100100100000060.0%
Grok 4 Fast100100100100100100000060.0%
Stealth: Healer Alpha100100100100100100000060.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100000060.0%
Grok 4.3 (Reasoning)1001001001001000000050.0%
GPT-5.4 (Reasoning)1001001001001000000050.0%
Qwen 3.5 27B1001001001001000000050.0%
Claude Opus 4.51001001001001000000050.0%
Z.AI GLM 4.61001001001001000000050.0%
Xiaomi MIMO v2.51001001001001000000050.0%
DeepSeek V3 (2024-12-26)1001001001001000000050.0%
DeepSeek V4 Pro1001001001001000000050.0%
DeepSeek V3 (2025-03-24)1001001001001000000050.0%
Gemma 3 12B1001001001001000000050.0%
DeepSeek V4 Pro (Reasoning)10010010010000000040.0%
Qwen 3.6 35B10010010010000000040.0%
Z.AI GLM 4.710010010010000000040.0%
Gemini 2.5 Pro10010010010000000040.0%
Qwen 3.5 9B10010010010000000040.0%
Qwen 3.6 27B100100100000000030.0%
DeepSeek V4 Flash (Reasoning)100100100000000030.0%
MiniMax M2.5100100100000000030.0%
Qwen 3.5 35B100100100000000030.0%
Gemini 3.1 Flash Lite (Preview)100100100000000030.0%
Gemini 3.1 Flash Lite100100100000000030.0%
Z.AI GLM 4.7 Flash100100100000000030.0%
GPT-4o, Aug. 6th (temp=1)100100100000000030.0%
Grok 4.3100100100000000030.0%
GPT-5.11001000000000020.0%
MoonshotAI: Kimi K2.61001000000000020.0%
Qwen 3.5 Plus (2026-04-20)1001000000000020.0%
Grok 4.20 (Beta, Reasoning)1001000000000020.0%
GPT-5.4 (Reasoning, Low)1001000000000020.0%
ByteDance Seed 2.0 Mini1001000000000020.0%
Qwen 3.5 Flash1001000000000020.0%
Qwen3 235B A22B Instruct 25071001000000000020.0%
Arcee AI: Trinity Large (Preview)1001000000000020.0%
ByteDance Seed 1.6 Flash1001000000000020.0%
Cohere Command R+ (Aug. 2024)1001000000000020.0%
GPT-5.5 (Reasoning)10000000000010.0%
MoonshotAI: Kimi K2.510000000000010.0%
Qwen 3.6 Flash10000000000010.0%
GPT-5.4 Mini (Reasoning)10000000000010.0%
GPT-5.210000000000010.0%
MiniMax M2.710000000000010.0%
Gemini 3.1 Flash Lite (Reasoning)10000000000010.0%
Z.AI GLM 4.510000000000010.0%
Stealth: Aurora Alpha10000000000010.0%
Gemini 2.5 Flash Lite10000000000010.0%
Inception Mercury10000000000010.0%
GPT-5 Mini00000000000.0%
GPT-5.5 (Reasoning, Low)00000000000.0%
GPT-500000000000.0%
o4 Mini High00000000000.0%
GPT-5.500000000000.0%
o4 Mini00000000000.0%
GPT-OSS 120B00000000000.0%
GPT-5.4 Mini (Reasoning, Low)00000000000.0%
Mistral Large 300000000000.0%
DeepSeek-V2 Chat00000000000.0%
Nemotron 3 Super00000000000.0%
GPT-5.400000000000.0%
Claude 3.5 Sonnet00000000000.0%
Grok 4.20 (Beta)00000000000.0%
Inception Mercury 200000000000.0%
Z.AI GLM 4.5 Air00000000000.0%
Hermes 3 405B00000000000.0%
GPT-5 Nano00000000000.0%
GPT-5.4 Mini00000000000.0%
Mistral Small 4 (Reasoning)00000000000.0%
Qwen 3 32B00000000000.0%
DeepSeek V4 Flash00000000000.0%
Grok 4.2000000000000.0%
GPT-5.4 Nano (Reasoning)00000000000.0%
Mistral Large00000000000.0%
Writer: Palmyra X500000000000.0%
GPT-5.4 Nano (Reasoning, Low)00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
Mistral Small 3.2 24B00000000000.0%
Llama 3.1 70B00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Gemma 3 27B00000000000.0%
Mistral Medium 3.100000000000.0%
Nemotron 3 Nano00000000000.0%
Mistral Small 400000000000.0%
Qwen 2.5 72B00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
GPT-5.4 Nano00000000000.0%
Mistral Small Creative00000000000.0%
Hermes 3 70B00000000000.0%
Ministral 3 14B00000000000.0%
GPT-4.1 Nano00000000000.0%
Ministral 3 8B00000000000.0%
Claude 3 Haiku00000000000.0%
WizardLM 2 8x22b00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Gemma 3 4B00000000000.0%
Ministral 3 3B00000000000.0%
Mistral NeMO00000000000.0%
Ministral 8B00000000000.0%
Llama 3.1 8B00000000000.0%
Ministral 3B00000000000.0%
LFM2 24B00000000000.0%
Rocinante 12B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100090.0%
Z.AI GLM 5.1100100100100100100100100100090.0%
Gemma 4 31B (Reasoning)100100100100100100100100100090.0%
ByteDance Seed 1.6100100100100100100100100100090.0%
GPT-4.1100100100100100100100100100090.0%
Stealth: Hunter Alpha100100100100100100100100100090.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100090.0%
Gemini 3.1 Flash Lite100100100100100100100100100090.0%
ByteDance Seed 2.0 Lite100100100100100100100100100090.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100090.0%
GPT-4.1 Mini100100100100100100100100100090.0%
DeepSeek V3.1100100100100100100100100100090.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100090.0%
Gemini 2.5 Flash Lite100100100100100100100100100090.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100090.0%
Mistral Small 4100100100100100100100100100090.0%
Qwen 3.6 Flash1001001001001001001001000080.0%
Qwen 3.6 27B1001001001001001001001000080.0%
Grok 4.1 Fast1001001001001001001001000080.0%
Claude Sonnet 4.51001001001001001001001000080.0%
Qwen 3.5 35B1001001001001001001001000080.0%
Grok 4 Fast1001001001001001001001000080.0%
Qwen 3.5 9B1001001001001001001001000080.0%
DeepSeek V4 Pro1001001001001001001001000080.0%
Cohere Command R+ (Aug. 2024)1001001001001001001001000080.0%
GPT-5.5 (Reasoning)10010010010010010010000070.0%
GPT-5 Mini10010010010010010010000070.0%
Grok 4.20 (Beta, Reasoning)10010010010010010010000070.0%
DeepSeek V4 Flash (Reasoning)10010010010010010010000070.0%
Qwen 3.5 Flash10010010010010010010000070.0%
Z.AI GLM 4.7 Flash10010010010010010010000070.0%
Mistral Large 210010010010010010010000070.0%
Grok 4.310010010010010010010000070.0%
Llama 3.1 70B10010010010010010010000070.0%
ByteDance Seed 1.6 Flash10010010010010010010000070.0%
Qwen 3.5 122B100100100100100100000060.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100000060.0%
Aion 2.0100100100100100100000060.0%
MiniMax M2.7100100100100100100000060.0%
Qwen 3.6 35B100100100100100100000060.0%
Xiaomi MIMO v2.5 Pro100100100100100100000060.0%
ByteDance Seed 2.0 Mini100100100100100100000060.0%
Xiaomi MIMO v2.5100100100100100100000060.0%
Gemini 2.5 Flash100100100100100100000060.0%
Mistral Small 3.2 24B100100100100100100000060.0%
Hermes 3 70B100100100100100100000060.0%
Rocinante 12B100100100100100100000060.0%
Gemini 3.1 Pro (Preview)1001001001001000000050.0%
Grok 4.3 (Reasoning)1001001001001000000050.0%
GPT-5.4 (Reasoning)1001001001001000000050.0%
Qwen 3.5 397B A17B1001001001001000000050.0%
Qwen 3.5 27B1001001001001000000050.0%
MiniMax M2.51001001001001000000050.0%
Stealth: Healer Alpha1001001001001000000050.0%
DeepSeek V3.21001001001001000000050.0%
Qwen 3 32B1001001001001000000050.0%
GPT-5.210010010010000000040.0%
GPT-5.510010010010000000040.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010000000040.0%
Gemma 3 4B10010010010000000040.0%
Llama 3.1 8B10010010010000000040.0%
Z.AI GLM 5 Turbo100100100000000030.0%
Gemini 3 Flash (Preview, Reasoning)100100100000000030.0%
Z.AI GLM 4.6100100100000000030.0%
Gemini 3 Pro (Preview)100100100000000030.0%
Gemini 3 Flash (Preview)100100100000000030.0%
Mistral Small 4 (Reasoning)100100100000000030.0%
Gemma 4 26B (Reasoning)1001000000000020.0%
Z.AI GLM 51001000000000020.0%
GPT-5.4 Mini (Reasoning)1001000000000020.0%
o4 Mini High1001000000000020.0%
DeepSeek V4 Pro (Reasoning)1001000000000020.0%
Z.AI GLM 4.71001000000000020.0%
Gemini 2.5 Pro1001000000000020.0%
o4 Mini1001000000000020.0%
GPT-5.4 Mini (Reasoning, Low)1001000000000020.0%
Gemma 3 27B1001000000000020.0%
Arcee AI: Trinity Large (Preview)1001000000000020.0%
Ministral 3B1001000000000020.0%
Claude Opus 4.7 (Reasoning)10000000000010.0%
Grok 410000000000010.0%
Gemini 2.5 Flash (Reasoning)10000000000010.0%
Claude Haiku 4.510000000000010.0%
Nemotron 3 Super10000000000010.0%
Inception Mercury10000000000010.0%
GPT-5.4 Nano10000000000010.0%
WizardLM 2 8x22b10000000000010.0%
Arcee AI: Trinity Mini10000000000010.0%
Ministral 8B10000000000010.0%
GPT-5.100000000000.0%
MoonshotAI: Kimi K2.600000000000.0%
GPT-500000000000.0%
GPT-5.4 (Reasoning, Low)00000000000.0%
MoonshotAI: Kimi K2.500000000000.0%
Claude Opus 4.700000000000.0%
Claude Opus 4.500000000000.0%
GPT-OSS 120B00000000000.0%
GPT-5.400000000000.0%
Inception Mercury 200000000000.0%
Stealth: Aurora Alpha00000000000.0%
Z.AI GLM 4.5 Air00000000000.0%
GPT-5 Nano00000000000.0%
GPT-5.4 Mini00000000000.0%
GPT-5.4 Nano (Reasoning)00000000000.0%
Mistral Large00000000000.0%
GPT-5.4 Nano (Reasoning, Low)00000000000.0%
Gemma 3 12B00000000000.0%
Mistral Medium 3.100000000000.0%
Nemotron 3 Nano00000000000.0%
Qwen 2.5 72B00000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Mistral Small Creative00000000000.0%
Ministral 3 14B00000000000.0%
GPT-4.1 Nano00000000000.0%
Ministral 3 8B00000000000.0%
Claude 3 Haiku00000000000.0%
Ministral 3 3B00000000000.0%
Mistral NeMO00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
DeepSeek V4 Flash100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100090.0%
Gemma 4 31B (Reasoning)100100100100100100100100100090.0%
Qwen 3.5 122B100100100100100100100100100090.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100090.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100090.0%
Qwen 3.6 35B100100100100100100100100100090.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
Qwen 3.5 9B100100100100100100100100100090.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100100090.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100090.0%
Xiaomi MIMO v2.5100100100100100100100100100090.0%
Hermes 3 405B100100100100100100100100100090.0%
DeepSeek V4 Pro100100100100100100100100100090.0%
DeepSeek V3.1100100100100100100100100100090.0%
Gemini 2.5 Flash Lite100100100100100100100100100090.0%
Grok 4.3100100100100100100100100100090.0%
Llama 3.1 70B100100100100100100100100100090.0%
ByteDance Seed 1.6 Flash100100100100100100100100100090.0%
Mistral Small Creative100100100100100100100100100090.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100090.0%
Z.AI GLM 5.11001001001001001001001000080.0%
GPT-5.21001001001001001001001000080.0%
Grok 4.1 Fast1001001001001001001001000080.0%
Grok 41001001001001001001001000080.0%
Gemini 3.1 Flash Lite1001001001001001001001000080.0%
GPT-5.41001001001001001001001000080.0%
Mistral Large 21001001001001001001001000080.0%
Rocinante 12B1001001001001001001001000080.0%
Qwen 3.5 Plus (2026-04-20)10010010010010010010000070.0%
GPT-5.4 (Reasoning, Low)10010010010010010010000070.0%
Qwen 3.5 Flash10010010010010010010000070.0%
Gemini 3.1 Flash Lite (Preview)10010010010010010010000070.0%
Gemini 3 Flash (Preview)10010010010010010010000070.0%
DeepSeek V3 (2024-12-26)10010010010010010010000070.0%
Z.AI GLM 4.5 Air10010010010010010010000070.0%
Qwen 3 32B10010010010010010010000070.0%
GPT-5.4 Nano (Reasoning, Low)10010010010010010010000070.0%
MoonshotAI: Kimi K2.5100100100100100100000060.0%
Qwen 3.6 Flash100100100100100100000060.0%
Z.AI GLM 4.7 Flash100100100100100100000060.0%
ByteDance Seed 2.0 Lite100100100100100100000060.0%
Mistral Small 4 (Reasoning)100100100100100100000060.0%
Llama 3.1 Nemotron 70B100100100100100100000060.0%
Llama 3.1 8B100100100100100100000060.0%
Gemini 3.1 Pro (Preview)1001001001001000000050.0%
Aion 2.01001001001001000000050.0%
Gemini 3 Pro (Preview)1001001001001000000050.0%
MiniMax M2.51001001001001000000050.0%
Z.AI GLM 4.71001001001001000000050.0%
Xiaomi MIMO v2.5 Pro1001001001001000000050.0%
GPT-OSS 120B1001001001001000000050.0%
Gemma 4 26B1001001001001000000050.0%
Nemotron 3 Super1001001001001000000050.0%
DeepSeek V3.21001001001001000000050.0%
Arcee AI: Trinity Large (Preview)1001001001001000000050.0%
MoonshotAI: Kimi K2.610010010010000000040.0%
Qwen 3.5 27B10010010010000000040.0%
Stealth: Hunter Alpha10010010010000000040.0%
GPT-5 Nano10010010010000000040.0%
GPT-5.4 Nano (Reasoning)10010010010000000040.0%
Mistral NeMO10010010010000000040.0%
Gemma 4 26B (Reasoning)100100100000000030.0%
Z.AI GLM 5100100100000000030.0%
Z.AI GLM 4.6100100100000000030.0%
Gemini 2.5 Flash Lite (Reasoning)100100100000000030.0%
Stealth: Aurora Alpha100100100000000030.0%
Mistral Large100100100000000030.0%
Inception Mercury100100100000000030.0%
MiniMax M2.71001000000000020.0%
Mistral Medium 3.11001000000000020.0%
Ministral 3B1001000000000020.0%
Gemini 2.5 Pro10000000000010.0%
Inception Mercury 210000000000010.0%
Nemotron 3 Nano10000000000010.0%
GPT-5.4 Nano10000000000010.0%
Gemma 3 4B10000000000010.0%
Gemini 2.5 Flash (Reasoning)00000000000.0%
Gemma 3 27B00000000000.0%
Qwen 2.5 72B00000000000.0%
Ministral 3 14B00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Ministral 3 3B00000000000.0%
Ministral 8B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Claude Opus 4100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Grok 4.20 (Beta)100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100100100100.0%
Hermes 3 405B100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
GPT-5.4 Mini100100100100100100100100100100100.0%
Mistral Large 2100100100100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100100100100.0%
Grok 4.20100100100100100100100100100100100.0%
Mistral Large100100100100100100100100100100100.0%
Qwen3 235B A22B Instruct 2507100100100100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
Grok 4.3100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Gemma 3 12B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
Mistral Small 4100100100100100100100100100100100.0%
GPT-5.4 Nano100100100100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100100100100.0%
Mistral Small Creative100100100100100100100100100100100.0%
Hermes 3 70B100100100100100100100100100100100.0%
Ministral 3 14B100100100100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100100100100.0%
Gemma 3 4B100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100090.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100090.0%
MoonshotAI: Kimi K2.6100100100100100100100100100090.0%
GPT-5100100100100100100100100100090.0%
Qwen 3.5 27B100100100100100100100100100090.0%
Qwen 3.6 Flash100100100100100100100100100090.0%
GPT-5.2100100100100100100100100100090.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100090.0%
MiniMax M2.7100100100100100100100100100090.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100090.0%
Z.AI GLM 4.7100100100100100100100100100090.0%
Grok 4100100100100100100100100100090.0%
Qwen 3.5 Flash100100100100100100100100100090.0%
Grok 4 Fast100100100100100100100100100090.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100100100100090.0%
Gemma 4 26B100100100100100100100100100090.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100090.0%
GPT-5.4100100100100100100100100100090.0%
DeepSeek V3 (2024-12-26)100100100100100100100100100090.0%
DeepSeek V4 Pro100100100100100100100100100090.0%
Gemini 2.5 Flash100100100100100100100100100090.0%
Mistral Medium 3.1100100100100100100100100100090.0%
Llama 3.1 Nemotron 70B100100100100100100100100100090.0%
Z.AI GLM 5.11001001001001001001001000080.0%
MoonshotAI: Kimi K2.51001001001001001001001000080.0%
o4 Mini1001001001001001001001000080.0%
Qwen 3.5 Plus (2026-02-15)1001001001001001001001000080.0%
Stealth: Healer Alpha1001001001001001001001000080.0%
Gemini 3.1 Flash Lite1001001001001001001001000080.0%
Xiaomi MIMO v2.51001001001001001001001000080.0%
GPT-5 Nano1001001001001001001001000080.0%
DeepSeek V3.11001001001001001001001000080.0%
Qwen 3 32B1001001001001001001001000080.0%
GPT-5.4 Nano (Reasoning)1001001001001001001001000080.0%
ByteDance Seed 1.6 Flash1001001001001001001001000080.0%
Llama 3.1 8B1001001001001001001001000080.0%
Gemini 3.1 Pro (Preview)10010010010010010010000070.0%
Qwen 3.5 Plus (2026-04-20)10010010010010010010000070.0%
Z.AI GLM 510010010010010010010000070.0%
Gemini 3 Flash (Preview, Reasoning)10010010010010010010000070.0%
o4 Mini High10010010010010010010000070.0%
Qwen 3.6 27B10010010010010010010000070.0%
Qwen 3.6 35B10010010010010010010000070.0%
Xiaomi MIMO v2.5 Pro10010010010010010010000070.0%
Stealth: Hunter Alpha10010010010010010010000070.0%
Qwen 3.5 9B10010010010010010010000070.0%
GPT-4o, May 13th (temp=1)10010010010010010010000070.0%
Z.AI GLM 4.5 Air10010010010010010010000070.0%
Mistral Small 4 (Reasoning)10010010010010010010000070.0%
DeepSeek V3.210010010010010010010000070.0%
DeepSeek V4 Flash10010010010010010010000070.0%
Gemini 2.5 Flash Lite10010010010010010010000070.0%
Rocinante 12B10010010010010010010000070.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100000060.0%
Gemma 4 26B (Reasoning)100100100100100100000060.0%
Aion 2.0100100100100100100000060.0%
Z.AI GLM 4.6100100100100100100000060.0%
Gemini 3 Pro (Preview)100100100100100100000060.0%
Mistral NeMO100100100100100100000060.0%
Nemotron 3 Super1001001001001000000050.0%
GPT-OSS 120B10010010010000000040.0%
Gemini 2.5 Pro100100100000000030.0%
ByteDance Seed 2.0 Lite100100100000000030.0%
Stealth: Aurora Alpha100100100000000030.0%
Gemma 3 27B100100100000000030.0%
Ministral 8B100100100000000030.0%
Gemini 2.5 Flash (Reasoning)1001000000000020.0%
Inception Mercury 21001000000000020.0%
Inception Mercury1001000000000020.0%
MiniMax M2.510000000000010.0%
Nemotron 3 Nano10000000000010.0%
Ministral 3B10000000000010.0%
Qwen 2.5 72B00000000000.0%
Arcee AI: Trinity Mini00000000000.0%
Ministral 3 3B00000000000.0%
LFM2 24B00000000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 122B100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
Qwen 3.5 27B100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
Qwen 3.6 Flash100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
Qwen 3.6 35B100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100100100100.0%
MiniMax M2.5100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Grok 4100100100100100100100100100100100.0%
Qwen 3.5 35B100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Flash100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 3.1 Flash Lite100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Mistral Large 3100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100100100100.0%
Stealth: Aurora Alpha100100100100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100100100100.0%
DeepSeek V4 Pro100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100100100100.0%
Gemma 3 27B100100100100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100090.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100100090.0%
Aion 2.0100100100100100100100100100090.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100090.0%
Stealth: Hunter Alpha100100100100100100100100100090.0%
Stealth: Healer Alpha100100100100100100100100100090.0%
Z.AI GLM 4.7 Flash100100100100100100100100100090.0%
Gemini 2.5 Flash100100100100100100100100100090.0%
Writer: Palmyra X5100100100100100100100100100090.0%
Nemotron 3 Nano100100100100100100100100100090.0%
Llama 3.1 8B100100100100100100100100100090.0%
GPT-4o, Aug. 6th (temp=1)1001001001001001001001000080.0%
Mistral Small 4 (Reasoning)1001001001001001001001000080.0%
DeepSeek V3 (2025-03-24)1001001001001001001001000080.0%
Hermes 3 70B1001001001001001001001000080.0%
Z.AI GLM 4.510010010010010010010000070.0%
Xiaomi MIMO v2.510010010010010010010000070.0%
Gemini 2.5 Flash Lite10010010010010010010000070.0%
ByteDance Seed 1.6 Flash10010010010010010010000070.0%
GPT-5.5100100100100100100000060.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100000060.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100000060.0%
DeepSeek V3 (2024-12-26)100100100100100100000060.0%
Grok 4.20100100100100100100000060.0%
Cohere Command R+ (Aug. 2024)100100100100100100000060.0%
GPT-5.21001001001001000000050.0%
DeepSeek V3.11001001001001000000050.0%
Qwen 3 32B1001001001001000000050.0%
Qwen3 235B A22B Instruct 25071001001001001000000050.0%
Arcee AI: Trinity Large (Preview)1001001001001000000050.0%
Rocinante 12B1001001001001000000050.0%
Claude Sonnet 4.510010010010000000040.0%
ByteDance Seed 2.0 Mini10010010010000000040.0%
Z.AI GLM 4.5 Air10010010010000000040.0%
DeepSeek V4 Flash10010010010000000040.0%
Claude Opus 4100100100000000030.0%
GPT-5.4 Mini (Reasoning, Low)100100100000000030.0%
GPT-5.4 Nano (Reasoning, Low)100100100000000030.0%
GPT-5.4 Nano100100100000000030.0%
Arcee AI: Trinity Mini100100100000000030.0%
Grok 4.310000000000010.0%
Gemma 3 12B10000000000010.0%
Mistral Small 410000000000010.0%
Mistral NeMO10000000000010.0%
GPT-5.400000000000.0%
Grok 4.20 (Beta)00000000000.0%
GPT-4.1 Mini00000000000.0%
Hermes 3 405B00000000000.0%
GPT-5.4 Mini00000000000.0%
Mistral Large 200000000000.0%
DeepSeek V3.200000000000.0%
GPT-5.4 Nano (Reasoning)00000000000.0%
Mistral Large00000000000.0%
GPT-4o Mini (temp=1)00000000000.0%
GPT-4o Mini (temp=0)00000000000.0%
Mistral Medium 3.100000000000.0%
Llama 3.1 Nemotron 70B00000000000.0%
Mistral Small Creative00000000000.0%
Ministral 3 14B00000000000.0%
GPT-4.1 Nano00000000000.0%
Ministral 3 8B00000000000.0%
Claude 3 Haiku00000000000.0%
WizardLM 2 8x22b00000000000.0%
Gemma 3 4B00000000000.0%
Ministral 3 3B00000000000.0%
Ministral 8B00000000000.0%
Ministral 3B00000000000.0%
LFM2 24B00000000000.0%