No hallucinated violations

Test: Codex Red Herring (False Positive Detection)

Avg. Score
61.1%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Inception Mercury98.8%$0.00048.0s84%
2Inception Mercury 297.5%$0.00273.8s78%
3Grok 4.1 Fast97.9%$0.001912.5s79%
4Nemotron 3 Super99.4%$0.00001.3m89%
5GPT-5.4 Mini (Reasoning, Low)96.9%$0.00344.0s76%
6ByteDance Seed 1.6 Flash95.7%$0.00089.1s69%
7o4 Mini97.5%$0.01425.0s78%
8Z.AI GLM 5 Turbo96.9%$0.007116.0s73%
9GPT-5.4 Nano (Reasoning, Low)96.7%$0.00083.9s66%
10GPT-5.4 Nano (Reasoning)95.2%$0.001711.4s66%
11Gemini 2.5 Flash Lite (Reasoning)94.4%$0.002316.6s68%
12GPT-5.196.9%$0.02526.1s76%
13o4 Mini High98.1%$0.02752.5s81%
14GPT-5 Mini95.0%$0.005937.8s67%
15MiniMax M2.793.1%$0.004734.6s66%
16DeepSeek V4 Flash (Reasoning)92.3%$0.000930.8s63%
17GPT-5 Nano95.0%$0.00351.1m70%
18GPT-5.4 Mini (Reasoning)92.3%$0.009010.8s63%
19Gemini 2.5 Flash (Reasoning)91.3%$0.008514.2s62%
20Stealth: Healer Alpha89.4%$0.000021.5s59%
21Grok 4.3 (Reasoning)94.4%$0.01450.4s66%
22Xiaomi MIMO v2.590.4%$0.007530.3s60%
23GPT-OSS 120B90.6%$0.001256.4s61%
24ByteDance Seed 1.689.6%$0.004332.7s56%
25Mistral Small 4 (Reasoning)87.5%$0.002522.6s53%
26Z.AI GLM 595.2%$0.0171.4m69%
27GPT-5.289.3%$0.01314.5s55%
28Aion 2.091.9%$0.00961.3m63%
29Grok 4.20 (Beta, Reasoning)91.5%$0.02515.9s57%
30MiniMax M2.585.5%$0.003225.9s50%
31Qwen 3.5 Plus (2026-04-20)95.0%$0.0201.9m70%
32GPT-4.185.9%$0.00481.2s43%
33Grok 4.20 (Reasoning)90.2%$0.01747.9s55%
34Xiaomi MIMO v2.5 Pro89.8%$0.0151.1m57%
35Gemma 4 26B (Reasoning)94.4%$0.00272.6m64%
36Grok 4 Fast80.3%$0.001919.2s44%
37Qwen 3.6 Flash84.8%$0.01644.4s53%
38GPT-5.4 (Reasoning, Low)82.5%$0.0128.8s44%
39Z.AI GLM 5.194.8%$0.0242.1m66%
40Qwen 3 32B77.9%$0.001437.7s44%
41Gemma 4 31B77.3%$0.000911.7s39%
42Stealth: Hunter Alpha80.3%$0.00001.0m46%
43Claude Opus 4.7 (Reasoning)94.4%$0.07513.1s66%
44Nemotron 3 Nano91.5%$0.00272.9m61%
45Arcee AI: Trinity Mini79.2%$0.000926.3s33%
46Z.AI GLM 4.5 Air81.8%$0.002743.1s35%
47Qwen 3.6 35B80.8%$0.0171.2m50%
48GPT-5.4 Nano69.5%$0.00051.5s28%
49Qwen 3.6 27B90.7%$0.0332.3m57%
50Gemma 4 31B (Reasoning)87.7%$0.00253.1m52%
51GPT-5.4 (Reasoning)78.9%$0.03231.9s43%
52Z.AI GLM 4.7 Flash83.4%$0.00392.5m49%
53Gemma 4 26B68.8%$0.001239.3s27%
54GPT-583.5%$0.0481.5m52%
55GPT-5.4 Mini55.4%$0.00201.2s30%
56Ministral 3 8B74.5%$0.001317.9s14%
57Z.AI GLM 4.674.6%$0.01559.0s31%
58Claude Opus 4.6 (Reasoning)96.3%$0.1201.0m74%
59Claude Opus 4.679.7%$0.04913.2s34%
60MoonshotAI: Kimi K2.575.6%$0.0172.6m48%
61Z.AI GLM 4.775.9%$0.0232.2m46%
62DeepSeek V4 Pro (Reasoning)82.5%$0.0173.4m52%
63LFM2 24B71.9%$0.000439.0s13%
64ByteDance Seed 2.0 Mini79.8%$0.00333.1m41%
65Claude Haiku 4.551.9%$0.00804.2s28%
66Z.AI GLM 4.563.4%$0.003825.0s19%
67GPT-5.5 (Reasoning, Low)67.7%$0.02512.0s24%
68GPT-5.440.2%$0.00483.1s34%
69GPT-5.555.9%$0.0132.9s21%
70Claude Sonnet 4.666.2%$0.03317.4s25%
71GPT-4.1 Nano58.4%$0.00054.6s8%
72Ministral 8B60.8%$0.00074.4s6%
73ByteDance Seed 2.0 Lite57.9%$0.00671.0m25%
74Cohere Command R+ (Aug. 2024)60.6%$0.0177.4s15%
75Gemini 3 Flash (Preview, Reasoning)67.6%$0.02849.2s20%
76Qwen 3.5 9B82.8%$0.00384.7m44%
77Gemini 3.1 Flash Lite (Reasoning)40.3%$0.00151.9s19%
78Gemini 3.1 Flash Lite41.2%$0.00173.2s18%
79Qwen 3.5 Flash63.6%$0.00712.0m23%
80Gemini 2.5 Flash31.4%$0.00211.5s23%
81DeepSeek V3 (2024-12-26)34.3%$0.00236.6s21%
82GPT-4o, Aug. 6th (temp=1)45.0%$0.00942.4s13%
83GPT-5.5 (Reasoning)66.7%$0.05124.1s23%
84Hermes 3 405B47.5%$0.00595.2s9%
85Gemini 3.1 Flash Lite (Preview)35.7%$0.00171.6s14%
86Claude Opus 4.552.4%$0.0364.4s20%
87Hermes 3 70B43.4%$0.001812.1s7%
88Grok 475.5%$0.0741.6m37%
89Claude Sonnet 445.5%$0.0224.5s13%
90DeepSeek-V2 Chat30.9%$0.00238.3s15%
91Gemini 3 Flash (Preview)25.0%$0.00343.2s20%
92DeepSeek V3 (2025-03-24)31.5%$0.001515.7s15%
93Mistral Medium 3.139.4%$0.00324.7s5%
94DeepSeek V4 Pro36.1%$0.006942.0s17%
95GPT-4o, Aug. 6th (temp=0)31.4%$0.0112.6s14%
96Qwen 2.5 72B35.2%$0.000911.3s6%
97GPT-4.1 Mini35.6%$0.00186.9s4%
98Claude Sonnet 4.532.8%$0.0245.9s19%
99Grok 4.336.1%$0.00774.4s6%
100Grok 4.20 (Beta)35.3%$0.00614.3s5%
101Claude Opus 4.752.9%$0.0493.9s14%
102Llama 3.1 70B33.5%$0.003116.4s7%
103Claude 3 Haiku19.3%$0.00203.5s16%
104Rocinante 12B32.3%$0.001513.3s6%
105Qwen3.6 Max Preview82.8%$0.0723.8m50%
106Claude 3.7 Sonnet35.5%$0.0224.3s13%
107GPT-4o Mini (temp=1)28.2%$0.00085.6s6%
108Gemini 2.5 Flash Lite32.1%$0.00104.0s2%
109Gemini 2.5 Pro70.4%$0.07352.1s21%
110Mistral Small 3.2 24B31.3%$0.001011.9s3%
111Mistral Large 332.2%$0.004011.7s2%
112WizardLM 2 8x22b32.6%$0.004155.3s12%
113Qwen3 235B A22B Instruct 250726.3%$0.001132.0s10%
114Writer: Palmyra X524.0%$0.008513.8s12%
115GPT-4o Mini (temp=0)24.5%$0.000914.5s6%
116DeepSeek V3.124.5%$0.001936.6s11%
117Qwen 3.5 35B64.5%$0.0442.4m23%
118Gemini 3.1 Pro (Preview)76.9%$0.1201.5m45%
119Grok 4.2025.6%$0.008211.7s8%
120Claude 3.5 Sonnet39.1%$0.0456.9s15%
121Ministral 3 14B29.3%$0.001929.2s1%
122DeepSeek V3.221.3%$0.001842.8s11%
123Mistral Large 232.2%$0.01611.4s2%
124DeepSeek V4 Flash17.8%$0.000512.2s6%
125Claude Sonnet 4.6 (Reasoning)87.5%$0.1452.2m55%
126Mistral Large31.9%$0.01612.3s2%
127Gemma 3 12B13.7%$0.000511.7s10%
128MoonshotAI: Kimi K2.676.9%$0.0475.2m47%
129Gemma 3 27B12.5%$0.000712.6s10%
130GPT-4o, May 13th (temp=1)23.5%$0.0333.0s16%
131Llama 3.1 Nemotron 70B22.6%$0.007721.4s5%
132Mistral Small 412.3%$0.00127.4s6%
133Ministral 3B9.9%$0.000412.4s6%
134Mistral Small Creative9.3%$0.00118.9s4%
135Llama 3.1 8B10.0%$0.000333.8s8%
136Gemini 3 Pro (Preview)75.6%$0.1441.7m47%
137Gemma 3 4B5.6%$0.000315.0s5%
138Ministral 3 3B6.7%$0.001331.1s5%
139GPT-4o, May 13th (temp=0)16.7%$0.0436.4s11%
140Mistral NeMO14.7%$0.00181.6m4%
141Arcee AI: Trinity Large (Preview)15.7%$0.00001.7m2%
142Qwen 3.5 Plus (2026-02-15)22.0%$0.0212.0m12%
143Qwen 3.5 397B A17B61.5%$0.0285.4m23%
144Claude Opus 436.0%$0.1097.1s22%
145Qwen 3.5 27B68.8%$0.0564.8m19%
146Qwen 3.5 122B70.1%$0.1356.4m20%
61.12%

Individual Scenarios

basic entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
Xiaomi MIMO v2.51001001001001001001001001005095.0%
Nemotron 3 Super1001001001001001001001001005095.0%
Inception Mercury1001001001001001001001001005095.0%
Nemotron 3 Nano1001001001001001001001001005095.0%
ByteDance Seed 1.6 Flash1001001001001001001001001005095.0%
GPT-51001001001001001001001001003393.3%
Grok 4.1 Fast1001001001001001001001001003393.3%
Grok 4.3 (Reasoning)1001001001001001001001001003393.3%
GPT-5.21001001001001001001001001003393.3%
Mistral Small 4 (Reasoning)1001001001001001001001001003393.3%
Claude Opus 4.61001001001001001001001001002092.0%
Ministral 3 8B100100100100100100100100100290.2%
GPT-4.1 Nano100100100100100100100100100190.1%
o4 Mini High100100100100100100100100505090.0%
o4 Mini100100100100100100100100505090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100505090.0%
GPT-OSS 120B100100100100100100100100505090.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100505090.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100333386.7%
Z.AI GLM 5100100100100100100100100333386.7%
Qwen 3.5 Plus (2026-04-20)10010010010010010010050505085.0%
DeepSeek V4 Flash (Reasoning)10010010010010010010050505085.0%
MoonshotAI: Kimi K2.510010010010010010010050503383.3%
Z.AI GLM 4.7 Flash10010010010010010010050503383.3%
Arcee AI: Trinity Mini1001001001001001005050505080.0%
MiniMax M2.7100100100100100505050505075.0%
Z.AI GLM 5 Turbo1001001001001001005033333375.0%
MoonshotAI: Kimi K2.61001001001001001003333333373.3%
Qwen 3 32B100100100100100505050502572.5%
Stealth: Hunter Alpha100100100100100505033333370.0%
MiniMax M2.5100100100100100505033332068.7%
Z.AI GLM 5.1100100100100100503333333368.3%
Qwen 3.6 27B100100100100100503333333368.3%
Xiaomi MIMO v2.5 Pro100100100100100503333333368.3%
GPT-5.4 Nano (Reasoning)100100100100100333333333366.7%
GPT-5.4 (Reasoning)100100100100100333333333366.7%
Z.AI GLM 4.5 Air10010010010010050503325766.5%
Z.AI GLM 4.61001001001005050505033063.3%
Qwen 3.6 35B1001001005050505050503363.3%
ByteDance Seed 1.610010010010050333333333361.7%
Grok 4.20 (Reasoning)10010010010033333333333360.0%
Claude Opus 4.7 (Reasoning)1001001005050505033333360.0%
GPT-5 Mini1001001005050505033333360.0%
GPT-5.5 (Reasoning, Low)10010010010033333333333360.0%
Qwen 3.6 Flash100100505050505050503358.3%
Claude Sonnet 4.61001001001005033333320757.7%
GPT-5.4 (Reasoning, Low)1001001005033333333333355.0%
Gemma 4 26B (Reasoning)1001001005033333333333355.0%
Grok 4.20 (Beta, Reasoning)1001001005033333333333355.0%
GPT-5.4 Nano100100100100503333205454.5%
GPT-5.5 (Reasoning)1001001003333333333333353.3%
Gemini 3 Pro (Preview)10050505050505050503353.3%
Gemini 3.1 Pro (Preview)100100505050503333332552.5%
GPT-5.4 Mini10050505050505050333351.7%
Qwen3.6 Max Preview5050505050505050505050.0%
GPT-5.4 Mini (Reasoning)5050505050505050503348.3%
Z.AI GLM 4.710050505050333333333346.7%
Gemma 4 31B (Reasoning)100100333333333333333346.7%
Rocinante 12B100100100502520202010645.1%
Gemini 2.5 Pro5050505050505033333345.0%
Qwen 3.5 35B5050505050503333332542.5%
Qwen 3.5 9B5050505050503333332542.5%
Grok 410033333333333333333340.0%
Qwen 3.5 Flash5050505033333333333340.0%
GPT-4.1100100503333201717171039.7%
Hermes 3 405B100100333325202017131337.3%
Qwen 3.5 27B5050333333333333332535.8%
Qwen 3.5 397B A17B5033333333333333332534.2%
ByteDance Seed 2.0 Lite5033333333333333252533.3%
Gemini 2.5 Flash Lite100100100884431132.9%
GPT-4.1 Mini505050505050888432.8%
Mistral NeMO100100100865430032.6%
Qwen 3.5 122B3333333333333333332532.5%
GPT-5.55050333333332525251732.5%
Ministral 8B100100100633222232.0%
ByteDance Seed 2.0 Mini3333333333333333332032.0%
Claude Opus 4.55033333333333325251731.7%
Grok 4 Fast3333333333333325252530.8%
Gemini 3 Flash (Preview, Reasoning)5033333325252525252029.5%
Z.AI GLM 4.510050333325201185228.8%
GPT-5.43333333325252525202027.3%
Claude Haiku 4.55033252525252520202026.8%
Cohere Command R+ (Aug. 2024)1003333201717141410926.8%
Claude Opus 4.73333332525252020202025.5%
Hermes 3 70B10050251413111098824.8%
DeepSeek V3 (2024-12-26)5050332525201098523.6%
Llama 3.1 70B1003333171310876623.4%
Claude 3 Haiku2525252525252525201323.3%
Claude 3.5 Sonnet3333252020202020171422.3%
DeepSeek V4 Pro505033332017854222.3%
Gemini 2.5 Flash3333252525201714101021.3%
GPT-4o, Aug. 6th (temp=1)505020171714141110721.0%
Claude Sonnet 43325202020201717171720.5%
Grok 4.31001717141110996319.7%
Gemini 3.1 Flash Lite (Preview)2520202020202017171018.8%
Claude Opus 4252525202017171711918.5%
Gemma 4 31B2525202020171717131118.4%
Gemini 3.1 Flash Lite2020202020171717141417.9%
Gemini 3.1 Flash Lite (Reasoning)2020202017171717131317.2%
DeepSeek V3 (2025-03-24)252520201714141313816.9%
DeepSeek-V2 Chat5033171313111195416.6%
WizardLM 2 8x22b5050252074331116.4%
Grok 4.20 (Beta)10013111185553316.4%
GPT-4o, May 13th (temp=1)25252020171413118715.9%
Gemma 4 26B3325171413131111111115.9%
Claude 3.7 Sonnet201717171717141413915.3%
Claude Sonnet 4.52017171714141414141115.3%
Gemini 3 Flash (Preview)1717141414141099612.4%
GPT-4o Mini (temp=1)251311111010888711.1%
GPT-4o, Aug. 6th (temp=0)141413131111988710.9%
Qwen 3.5 Plus (2026-02-15)17141313119887210.1%
Mistral Small 4508876553339.8%
DeepSeek V3.1171713139977119.3%
Mistral Large 21711111110876659.2%
Llama 3.1 Nemotron 70B1311111010997769.2%
Gemma 3 27B11111199988758.9%
Gemma 3 12B259888766548.6%
Qwen 2.5 72B20141387665528.6%
Mistral Large 3141110108777648.3%
Llama 3.1 8B3311965554208.2%
Ministral 3B2514976553338.0%
Writer: Palmyra X53310854433337.7%
Ministral 3 3B338644332106.4%
Mistral Large147766655546.4%
DeepSeek V3.21310777543225.9%
GPT-4o, May 13th (temp=0)109885443225.7%
Mistral Medium 3.177666665535.6%
Mistral Small 3.2 24B87666555425.5%
Grok 4.20176555443335.4%
GPT-4o Mini (temp=0)77666665315.2%
Qwen3 235B A22B Instruct 2507175554443325.1%
DeepSeek V4 Flash176543322114.4%
Mistral Small Creative54433322222.9%
Arcee AI: Trinity Large (Preview)133222111102.5%
Gemma 3 4B33322222222.3%
Ministral 3 14B33332111101.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-5.2100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
GPT-OSS 120B100100100100100100100100100100100.0%
Grok 4 Fast100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5 Nano100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Nano100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Qwen 3.6 Flash1001001001001001001001001005095.0%
DeepSeek V4 Pro (Reasoning)1001001001001001001001001005095.0%
Aion 2.01001001001001001001001001005095.0%
MiniMax M2.51001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
Z.AI GLM 4.5 Air1001001001001001001001001005095.0%
Mistral Medium 3.11001001001001001001001001005095.0%
Claude Sonnet 4.61001001001001001001001001001491.4%
GPT-4.1 Nano100100100100100100100100100990.9%
Stealth: Hunter Alpha100100100100100100100100100790.7%
Ministral 3 8B100100100100100100100100100390.3%
Qwen 3.6 35B100100100100100100100100505090.0%
Qwen 3.5 9B100100100100100100100100505090.0%
Gemini 3.1 Flash Lite100100100100100100100100505090.0%
Gemini 2.5 Pro100100100100100100100100503388.3%
Grok 4100100100100100100100100502587.5%
Gemini 3.1 Flash Lite (Reasoning)10010010010010010010050503383.3%
Mistral Small 4 (Reasoning)1001001001001001005050503378.3%
Gemini 2.5 Flash Lite10010010010010010010033331377.9%
Gemini 3.1 Flash Lite (Preview)1001001001001001005050332575.8%
Qwen 3 32B1001001001001001005050332575.8%
Grok 4.20 (Beta)10010010010010010010017171374.6%
Mistral Small 3.2 24B10010010010010010010013111173.5%
Z.AI GLM 4.7 Flash100100100100100505050503373.3%
Qwen 3.5 27B100100100100100505050332570.8%
Gemini 3.1 Pro (Preview)10010010010050505050505070.0%
Hermes 3 70B1001001001001001003317141367.7%
GPT-5.4 Nano1001001001001001003317141167.5%
Qwen 3.5 397B A17B10010010010050505033252563.3%
Z.AI GLM 4.61001001001005050505014862.2%
Qwen 3.5 122B1001001005050505050333361.7%
Gemma 4 26B1001001005050505050333361.7%
Z.AI GLM 4.51001001001001005014108558.8%
Arcee AI: Trinity Mini10010010010050333333251158.6%
Ministral 8B10010010010010050442256.2%
Qwen 3.5 35B1001001005050333333252555.0%
Gemini 3 Pro (Preview)10050505050505050503353.3%
Gemini 3 Flash (Preview, Reasoning)10050505050505050503353.3%
GPT-5.4 Mini10050505050505050332550.8%
Claude Haiku 4.55050505050505050503348.3%
ByteDance Seed 2.0 Lite5050505050505033333345.0%
Grok 4.201001001001001313844344.4%
Qwen 3.5 Flash10050505033333333252543.3%
GPT-5.45050505050505033251442.3%
DeepSeek V4 Pro50505050503333337636.3%
GPT-4o, Aug. 6th (temp=1)10033333333332517141433.7%
Hermes 3 405B10050332525252020201733.5%
Claude 3.5 Sonnet5033333333333325252532.5%
Claude Sonnet 45033333333333325252032.0%
Claude Opus 45050333333252525252032.0%
GPT-5.55050503333252020201731.8%
WizardLM 2 8x22b1005050501713111010831.8%
DeepSeek-V2 Chat5050333325252020201128.8%
Qwen 2.5 72B1005033252013131313828.6%
DeepSeek V3.11005025251717141110827.6%
Cohere Command R+ (Aug. 2024)1003333202020171411227.1%
DeepSeek V3 (2024-12-26)5033252525252525171126.1%
Mistral NeMO10010011887777225.6%
Claude Opus 4.53333332525252020202025.5%
GPT-4o Mini (temp=1)1002525251714141313825.4%
Claude 3.7 Sonnet3333252525252520202025.2%
Claude Opus 4.72525252525252525202024.0%
Rocinante 12B100332017171413118623.9%
Grok 4.310033251414111188723.2%
Llama 3.1 70B50333320202020139622.4%
Llama 3.1 Nemotron 70B100252014131310107621.7%
DeepSeek V4 Flash50505011108777620.6%
Gemini 3 Flash (Preview)2525252020201717171720.2%
DeepSeek V3 (2025-03-24)3320202020202017171420.1%
GPT-4o, May 13th (temp=1)503320171714141411819.9%
Claude Opus 4.63333252014141413111118.9%
Gemini 2.5 Flash2525252020201714131018.8%
Mistral Small 41001414131010986318.7%
Gemma 3 12B3317171717171414141417.4%
DeepSeek V3.23333332599988517.4%
Claude Sonnet 4.52020201717171717141417.2%
Qwen 3.5 Plus (2026-02-15)502020171313131310417.1%
Gemma 3 27B2020202017171717141017.1%
GPT-4o, Aug. 6th (temp=0)2020171717171414131316.0%
GPT-4o Mini (temp=0)1717171717171717131315.8%
Claude 3 Haiku2020171717141414131115.6%
Ministral 3B3325131311111088613.8%
Writer: Palmyra X5332014141310998413.5%
GPT-4.1 Mini20202017139998613.0%
Qwen3 235B A22B Instruct 250733201411109887512.6%
Mistral Large 31111111111101010779.9%
Ministral 3 3B25141098887419.5%
Llama 3.1 8B1411111010886659.1%
GPT-4o, May 13th (temp=0)1717141413633118.8%
Mistral Large1111998888668.5%
Mistral Large 2119998777657.9%
Arcee AI: Trinity Large (Preview)10101088766657.4%
Gemma 3 4B76666555445.4%
Mistral Small Creative86654443224.4%
Ministral 3 14B66655433334.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001005095.0%
Gemma 4 31B (Reasoning)1001001001001001001001001005095.0%
Qwen 3.6 Flash1001001001001001001001001005095.0%
GPT-5.4 Mini (Reasoning)1001001001001001001001001005095.0%
Aion 2.01001001001001001001001001005095.0%
MiniMax M2.71001001001001001001001001005095.0%
Gemini 2.5 Flash (Reasoning)1001001001001001001001001005095.0%
Qwen 3.5 9B1001001001001001001001001005095.0%
Mistral Small 4 (Reasoning)1001001001001001001001001005095.0%
Qwen 3 32B1001001001001001001001001005095.0%
Inception Mercury1001001001001001001001001005095.0%
Grok 4.3 (Reasoning)1001001001001001001001001003393.3%
Grok 4.20 (Beta, Reasoning)1001001001001001001001001003393.3%
ByteDance Seed 2.0 Mini1001001001001001001001001003393.3%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100890.8%
Ministral 3 8B100100100100100100100100100390.3%
Qwen 3.6 35B100100100100100100100100505090.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100505090.0%
Gemini 2.5 Pro100100100100100100100100505090.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100505090.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100505090.0%
Z.AI GLM 4.7 Flash100100100100100100100100505090.0%
Nemotron 3 Nano100100100100100100100100503388.3%
Arcee AI: Trinity Mini10010010010010010010010050985.9%
Z.AI GLM 510010010010010010010050505085.0%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
Xiaomi MIMO v2.510010010010010010010050505085.0%
GPT-5 Nano10010010010010010010050505085.0%
Z.AI GLM 4.5 Air10010010010010010010010010981.9%
GPT-5.210010010010010010010050332080.3%
Hermes 3 405B10010010010010010010050332080.3%
MiniMax M2.51001001001001001005050505080.0%
GPT-OSS 120B1001001001001001005050505080.0%
Stealth: Healer Alpha1001001001001001005050505080.0%
Claude Opus 4.7100100100100100505050505075.0%
Claude Sonnet 41001001001001001005033333375.0%
GPT-5.4 Nano100100100100100100505011671.7%
Cohere Command R+ (Aug. 2024)1001001001001001005025201070.5%
MoonshotAI: Kimi K2.610010010010050505050505070.0%
GPT-510010010010050505050505070.0%
Qwen 3.5 397B A17B10010010010050505050505070.0%
MoonshotAI: Kimi K2.510010010010050505050505070.0%
DeepSeek V4 Pro (Reasoning)10010010010050505050505070.0%
Z.AI GLM 4.710010010010050505050505070.0%
Gemini 3 Flash (Preview, Reasoning)10010010010050505050503368.3%
Grok 4100100100100100503333333368.3%
GPT-5.5 (Reasoning)1001001005050505050505065.0%
Claude 3.5 Sonnet10010010010050505033332564.2%
Stealth: Hunter Alpha1001001005050505050503363.3%
Grok 4 Fast1001001005050505050503363.3%
Ministral 8B100100100100100100654161.6%
Hermes 3 70B100100100100100332520171460.9%
Qwen 3.5 122B1001001005050505033333360.0%
Qwen 3.5 27B1001001005050505033333360.0%
Z.AI GLM 4.6100100100100100502585058.9%
Qwen 3.5 35B100100505050505050503358.3%
Claude Opus 4.510050505050505050505055.0%
Qwen 3.5 Flash10050505050505050505055.0%
ByteDance Seed 2.0 Lite10050505050505050505055.0%
GPT-5.4 Mini10050505050505050505055.0%
Claude Haiku 4.510050505050505050503353.3%
GPT-4o, Aug. 6th (temp=1)1001001005050332525201752.0%
WizardLM 2 8x22b10050505050505050333351.7%
GPT-5.5 (Reasoning, Low)5050505050505050505050.0%
Claude Opus 45050505050505050505050.0%
GPT-5.4 (Reasoning)10050505050505033333350.0%
Z.AI GLM 4.51001001005050502587549.4%
Claude Sonnet 4.510050505050503333333348.3%
Writer: Palmyra X510010050505033333325748.2%
DeepSeek V4 Pro10010050505050332013847.4%
Claude 3.7 Sonnet100100333333333333333346.7%
GPT-4.1 Nano100100100100503222246.0%
GPT-5.4 (Reasoning, Low)10050505033333333333345.0%
Qwen3 235B A22B Instruct 250710050505050333333252545.0%
DeepSeek V3 (2024-12-26)5050505050505025141140.0%
Grok 4.31001003333252520147636.4%
GPT-4o, Aug. 6th (temp=0)5050333333333325251733.3%
Rocinante 12B1005050332520201714133.0%
GPT-5.43333333333333333332532.5%
Gemini 3 Flash (Preview)5033333333333325252532.5%
DeepSeek V3.2505050333333171717930.9%
Gemini 3.1 Flash Lite (Preview)3333333333333325252030.3%
Gemini 3.1 Flash Lite3333333333333325252030.3%
Gemini 3.1 Flash Lite (Reasoning)3333333333332525251428.9%
DeepSeek V3 (2025-03-24)5050332525252520141428.2%
DeepSeek V3.150503333332525139928.1%
Gemini 2.5 Flash503333252525202017725.5%
Grok 4.2050333333332520176425.5%
Qwen 2.5 72B100332525201111108825.2%
GPT-4o, May 13th (temp=1)3333332525202020201124.1%
DeepSeek-V2 Chat5033252520201713131022.5%
Llama 3.1 70B1002017141313111110721.5%
Claude Sonnet 4.63325252520201717171421.3%
Grok 4.20 (Beta)100252014101010106621.1%
Mistral NeMO1003313101010764319.5%
GPT-4o Mini (temp=1)25252020171717119716.7%
Qwen 3.5 Plus (2026-02-15)3333202017171087216.7%
Claude 3 Haiku201717141414141411914.5%
DeepSeek V4 Flash502017999766513.8%
Mistral Medium 3.12517171111101099912.8%
GPT-4.1 Mini171414131313111010912.3%
Mistral Small 3.2 24B141414131313111110812.1%
GPT-4o, May 13th (temp=0)25252020204211112.0%
GPT-4o Mini (temp=0)141414141413131110211.9%
Mistral Large 31414141410101098811.2%
Mistral Large141414141111888811.1%
Gemini 2.5 Flash Lite20201010108887710.9%
Gemma 3 12B14141311111110108410.6%
Gemma 3 27B111111111010101010910.4%
Mistral Small Creative3398877777610.0%
Llama 3.1 Nemotron 70B1711111010998779.9%
Mistral Large 217171198887779.8%
Ministral 3 3B141414119877669.6%
Mistral Small 4131311109988779.4%
Ministral 3B141110109877758.8%
Llama 3.1 8B1714987654327.4%
Gemma 3 4B109966533225.4%
Arcee AI: Trinity Large (Preview)86666555435.4%
Ministral 3 14B66655543314.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
MiniMax M2.7100100100100100100100100100100100.0%
GPT-5.5100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Gemma 4 31B100100100100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100100100100.0%
Gemma 4 26B100100100100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Z.AI GLM 4.5 Air100100100100100100100100100100100.0%
Mistral Small 4 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
GPT-5.11001001001001001001001001005095.0%
Z.AI GLM 51001001001001001001001001005095.0%
Qwen 3.6 Flash1001001001001001001001001005095.0%
o4 Mini High1001001001001001001001001005095.0%
MiniMax M2.51001001001001001001001001005095.0%
ByteDance Seed 2.0 Mini1001001001001001001001001005095.0%
GPT-5.4 Mini (Reasoning, Low)1001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
Nemotron 3 Nano1001001001001001001001001005095.0%
Qwen 3.5 9B1001001001001001001001001003393.3%
Z.AI GLM 4.6100100100100100100100100100590.5%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100505090.0%
Qwen 3.5 27B100100100100100100100100505090.0%
GPT-OSS 120B100100100100100100100100505090.0%
GPT-5 Nano100100100100100100100100505090.0%
Grok 4.3 (Reasoning)100100100100100100100100503388.3%
Aion 2.010010010010010010010050505085.0%
Qwen 3.6 35B10010010010010010010050505085.0%
Gemini 3 Pro (Preview)10010010010010010010050505085.0%
Qwen 3 32B10010010010010010010050505085.0%
GPT-5.4 Nano10010010010010010010050505085.0%
Grok 4 Fast10010010010010010010050503383.3%
DeepSeek V4 Flash (Reasoning)10010010010010010010050503383.3%
GPT-5.210010010010010010010050502582.5%
Qwen 3.6 27B10010010010010010010050502582.5%
Arcee AI: Trinity Mini10010010010010010010010011081.1%
Z.AI GLM 4.51001001001001001001005050780.7%
Qwen 3.5 122B1001001001001001005050505080.0%
Gemini 2.5 Pro1001001001001001005050505080.0%
Xiaomi MIMO v2.5 Pro1001001001001001005050505080.0%
MoonshotAI: Kimi K2.6100100100100100505050505075.0%
Xiaomi MIMO v2.5100100100100100505050505075.0%
Qwen 3.5 397B A17B100100100100100505050503373.3%
Gemini 3 Flash (Preview, Reasoning)100100100100100505050503373.3%
GPT-5.4 (Reasoning, Low)100100100100100505050333371.7%
Cohere Command R+ (Aug. 2024)1001001001001001005020171770.3%
Stealth: Healer Alpha10010010010050505050505070.0%
Claude Sonnet 410010010010050505050503368.3%
Grok 4100100100100100503333332567.5%
GPT-5.5 (Reasoning, Low)1001001005050505050505065.0%
Z.AI GLM 4.71001001005050505050505065.0%
Llama 3.1 Nemotron 70B10010010010010010014138764.2%
GPT-5100100505050505050505060.0%
Stealth: Hunter Alpha100100505050505050505060.0%
Claude Haiku 4.5100100505050505050505060.0%
DeepSeek-V2 Chat100100505050505050503358.3%
GPT-5.4 Mini100100505050505050503358.3%
Llama 3.1 70B100100100100100332099857.9%
Hermes 3 405B10010010010033333333251757.5%
Grok 4.20 (Beta)1001001005050503333332057.0%
Qwen 2.5 72B1001001001005033331714855.6%
GPT-5.5 (Reasoning)10050505050505050505055.0%
DeepSeek V4 Pro (Reasoning)10050505050505050505055.0%
Claude Sonnet 4.5100100505050505033333355.0%
Qwen 3.5 35B10050505050505050505055.0%
GPT-5.410050505050505050503353.3%
GPT-4.1 Nano10010010010010013864153.2%
Grok 4.2010050505050505050502552.5%
Claude 3.5 Sonnet1001001005033332525252051.2%
MoonshotAI: Kimi K2.55050505050505050505050.0%
Claude Opus 4.55050505050505050505050.0%
Qwen 3.5 Flash5050505050505050505050.0%
ByteDance Seed 2.0 Lite5050505050505050505050.0%
DeepSeek V3 (2024-12-26)1005050505050505033749.0%
GPT-5.4 (Reasoning)10050505050503333333348.3%
GPT-4.1 Mini1001001001002017141111948.2%
GPT-4.1100100503333333333332547.5%
Hermes 3 70B100100100503325251714947.3%
DeepSeek V3 (2025-03-24)5050505050505033333345.0%
Gemini 2.5 Flash5050505050505033333345.0%
GPT-4o, Aug. 6th (temp=0)5050505050505033332544.2%
Rocinante 12B100100100502020201110843.9%
Grok 4.3100100333333332525251141.9%
Gemini 3.1 Flash Lite (Reasoning)5050505050333325252539.2%
Claude Opus 45050503333333333252536.7%
Qwen3 235B A22B Instruct 2507505050333333333333935.9%
Writer: Palmyra X550505050503333258635.5%
WizardLM 2 8x22b5050505050332517171135.3%
Gemini 3.1 Flash Lite5050505033252525202034.8%
Gemini 3.1 Flash Lite (Preview)5050333333332525252533.3%
Gemini 3 Flash (Preview)5033333333333333252533.3%
DeepSeek V4 Pro5050505033252520131032.6%
GPT-4o, Aug. 6th (temp=1)10033332525252020202032.2%
Claude 3.7 Sonnet5033333333252525252530.8%
DeepSeek V3.1505050503320171411730.3%
Qwen 3.5 Plus (2026-02-15)505050502020201714930.0%
GPT-4o, May 13th (temp=0)3333333333252525252028.7%
GPT-4o, May 13th (temp=1)3333333333252520171126.4%
DeepSeek V3.2505050333314887626.0%
GPT-4o Mini (temp=1)10025252017171414141325.9%
Mistral Medium 3.1503333333320141310924.9%
Claude Sonnet 4.63325252525202020201723.0%
DeepSeek V4 Flash50502017141413107620.0%
Gemini 2.5 Flash Lite33332520202011118718.9%
Claude 3 Haiku2525252020171714141118.8%
GPT-4o Mini (temp=0)2020202020201717171318.3%
Arcee AI: Trinity Large (Preview)1002011888766017.4%
Gemma 3 27B2520201717171413131116.5%
Ministral 3 14B505010998887616.5%
Mistral Large 2202020171714131313915.4%
Mistral Large202017171414131111914.6%
Mistral Large 3141414131313131313912.7%
Ministral 3B501710987776412.4%
Mistral Small 3.2 24B2020141111111088812.1%
Gemma 3 4B141414131111111111811.9%
Mistral NeMO17141414131110108611.7%
Mistral Small 417141111109987610.3%
Gemma 3 12B14111110101010108710.2%
Llama 3.1 8B1411101010988759.2%
Mistral Small Creative1411998887658.5%
Ministral 3 3B1311888886547.9%
Ministral 8B65555444414.4%
Ministral 3 8B66544431003.3%

detailed entries

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Qwen3.6 Max Preview100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100100100100100100.0%
GPT-5100100100100100100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
Qwen 3.6 27B100100100100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100100100100.0%
Aion 2.0100100100100100100100100100100100.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
o4 Mini100100100100100100100100100100100.0%
Xiaomi MIMO v2.5 Pro100100100100100100100100100100100.0%
Stealth: Hunter Alpha100100100100100100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
Ministral 3 8B100100100100100100100100100100100.0%
Ministral 8B100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001001001001005095.0%
GPT-5.4 Mini (Reasoning)1001001001001001001001001005095.0%
GPT-5.21001001001001001001001001005095.0%
DeepSeek V4 Pro (Reasoning)1001001001001001001001001005095.0%
MiniMax M2.71001001001001001001001001005095.0%
Grok 4 Fast1001001001001001001001001005095.0%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001001001001005095.0%
GPT-5 Nano1001001001001001001001001005095.0%
GPT-5.4 (Reasoning, Low)1001001001001001001001001003393.3%
MoonshotAI: Kimi K2.51001001001001001001001001003393.3%
Gemini 3 Flash (Preview, Reasoning)1001001001001001001001001002092.0%
Ministral 3 14B100100100100100100100100100590.5%
Arcee AI: Trinity Mini100100100100100100100100100090.0%
Gemma 4 31B (Reasoning)100100100100100100100100505090.0%
ByteDance Seed 1.6100100100100100100100100505090.0%
Qwen 3.6 Flash100100100100100100100100505090.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100505090.0%
Inception Mercury 2100100100100100100100100505090.0%
Nemotron 3 Nano100100100100100100100100505090.0%
ByteDance Seed 1.6 Flash100100100100100100100100503388.3%
Qwen 3.5 122B100100100100100100100100503388.3%
Qwen 3.5 Flash100100100100100100100100502587.5%
MiniMax M2.510010010010010010010010050485.4%
GPT-5.4 Nano10010010010010010010050505085.0%
Qwen 3.6 35B10010010010010010010050503383.3%
GPT-5.5 (Reasoning, Low)10010010010010010010050333381.7%
Grok 410010010010010010010050332080.3%
Z.AI GLM 4.71001001001001001005050502577.5%
Qwen 3.5 35B1001001001001001005050333376.7%
GPT-5.4 (Reasoning)1001001001001001005050332575.8%
Gemini 3.1 Pro (Preview)1001001001001001003333332572.5%
Mistral Small 4 (Reasoning)100100100100100505050333371.7%
Z.AI GLM 4.5100100100100100100503311870.3%
GPT-OSS 120B10010010010050505050505070.0%
Z.AI GLM 4.610010010010010050505033168.4%
Qwen 3 32B10010010010010050503325866.7%
Qwen 3.5 397B A17B10010010010050505050332565.8%
Gemini 3 Pro (Preview)10010010010050505033252563.3%
Qwen 2.5 72B10010010010010050251713560.9%
GPT-5.5 (Reasoning)10010010010033333333333360.0%
Cohere Command R+ (Aug. 2024)1001001001001003325256259.2%
GPT-5.4 Mini100100505050505050332555.8%
Qwen 3.5 9B100100100503333333325651.5%
Z.AI GLM 4.5 Air100100100100333320177251.2%
Gemma 4 31B5050505050505050505050.0%
Claude Sonnet 4.6100100505050505020141349.7%
Claude Haiku 4.55050505050505050503348.3%
GPT-5.45050505050505050502547.5%
GPT-4.1 Nano100100100100507322146.5%
Z.AI GLM 4.7 Flash10050505050333333332545.8%
GPT-4o, Aug. 6th (temp=1)100100505033333320201745.7%
Claude Sonnet 4.5100100333333332525251742.5%
GPT-5.510050503333333325252040.3%
Gemini 2.5 Flash5050505033333333333340.0%
DeepSeek V4 Pro50505050505025252135.4%
Qwen 3.5 27B5050503333332525202034.0%
Rocinante 12B10010050141313131110833.1%
Mistral Small 3.2 24B100100100444332031.9%
Gemini 2.5 Flash Lite100100100753211031.8%
WizardLM 2 8x22b10010033252014766331.4%
Arcee AI: Trinity Large (Preview)100100100222211131.2%
Grok 4.3100100332017111043230.0%
Gemini 2.5 Pro5033333333252525202029.8%
Claude Opus 4.55050333333202017171729.0%
Gemma 4 26B3333333333332525171728.3%
GPT-4o, May 13th (temp=1)5050333325201717141327.2%
Claude 3.7 Sonnet10025202020202017141427.0%
Gemini 3.1 Flash Lite (Reasoning)3333332525252525252027.0%
Gemini 3.1 Flash Lite3333333325252520202026.8%
Claude Opus 45033252525252020201425.8%
Claude Sonnet 43325252525252525252025.3%
Hermes 3 405B3333332525252020141324.2%
DeepSeek-V2 Chat50505020141411108122.9%
Hermes 3 70B100201717141110109821.6%
Qwen3 235B A22B Instruct 2507100252020179776321.3%
Gemini 3 Flash (Preview)2525252525202020111020.6%
Claude Opus 4.72525252020201717171319.8%
Claude 3 Haiku3325252525171313111119.7%
Gemini 3.1 Flash Lite (Preview)252020202020201714718.3%
Grok 4.20 (Beta)10033171044333217.9%
GPT-4o, Aug. 6th (temp=0)3333332014131196417.7%
DeepSeek V3 (2024-12-26)50332014131310108717.7%
DeepSeek V4 Flash50332520208843217.3%
GPT-4o Mini (temp=1)202020201717171413716.4%
Claude 3.5 Sonnet2017171717171414141416.0%
DeepSeek V3.25050201463321115.0%
Mistral Medium 3.1332517171313988314.4%
Llama 3.1 70B3320201713101096614.3%
Qwen 3.5 Plus (2026-02-15)252520201714963214.2%
Writer: Palmyra X533332511107543213.4%
Mistral Small 4332020141311544312.7%
GPT-4o Mini (temp=0)1717141414111198812.3%
Gemma 3 12B171717131110888311.0%
Llama 3.1 Nemotron 70B1713131111111186610.6%
LFM2 24B50251143321009.8%
DeepSeek V3 (2025-03-24)1717131010999319.7%
GPT-4.1 Mini509666553339.6%
Llama 3.1 8B1717171313754419.6%
Mistral Large 3171111108766438.3%
Mistral Large 217111088776538.2%
Mistral Large1110998754437.0%
Gemma 3 27B148876555446.6%
Mistral NeMO20141087222006.5%
GPT-4o, May 13th (temp=0)99777665546.5%
Mistral Small Creative502222211116.5%
Grok 4.2088666533335.0%
DeepSeek V3.176544322113.4%
Ministral 3B55432211102.3%
Gemma 3 4B43222222112.1%
Ministral 3 3B33211100001.1%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100100100100.0%
Z.AI GLM 5.1100100100100100100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100100100100100100.0%
GPT-5 Mini100100100100100100100100100100100.0%
GPT-5.1100100100100100100100100100100100.0%
Gemma 4 26B (Reasoning)100100100100100100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100100100100100100.0%
o4 Mini High100100100100100100100100100100100.0%
GPT-4.1100100100100100100100100100100100.0%
Qwen 3.5 9B100100100100100100100100100100100.0%
Stealth: Healer Alpha100100100100100100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100100100100100100.0%
Nemotron 3 Super100100100100100100100100100100100.0%
Inception Mercury 2100100100100100100100100100100100.0%
Inception Mercury100100100100100100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100100100100.0%
LFM2 24B100100100100100100100100100100100.0%
Gemini 3.1 Pro (Preview)1001001001001001001001001005095.0%
Grok 4.3 (Reasoning)1001001001001001001001001005095.0%
GPT-5.5 (Reasoning)1001001001001001001001001005095.0%
GPT-51001001001001001001001001005095.0%
Grok 4.20 (Reasoning)1001001001001001001001001005095.0%
Qwen 3.6 Flash1001001001001001001001001005095.0%
Qwen 3.6 27B1001001001001001001001001005095.0%
Qwen 3.6 35B1001001001001001001001001005095.0%
Gemini 3 Pro (Preview)1001001001001001001001001005095.0%
o4 Mini1001001001001001001001001005095.0%
ByteDance Seed 2.0 Mini1001001001001001001001001005095.0%
GPT-OSS 120B1001001001001001001001001005095.0%
Z.AI GLM 4.7 Flash1001001001001001001001001005095.0%
GPT-5 Nano1001001001001001001001001005095.0%
GPT-5.4 Nano (Reasoning)1001001001001001001001001005095.0%
Grok 4.20 (Beta, Reasoning)100100100100100100100100505090.0%
Grok 4.1 Fast100100100100100100100100505090.0%
DeepSeek V4 Flash (Reasoning)100100100100100100100100505090.0%
Xiaomi MIMO v2.5100100100100100100100100505090.0%
Z.AI GLM 4.5 Air100100100100100100100100505090.0%
Qwen 3 32B100100100100100100100100505090.0%
Nemotron 3 Nano100100100100100100100100503388.3%
Claude Sonnet 4.6100100100100100100100100501786.7%
GPT-5.5 (Reasoning, Low)10010010010010010010050505085.0%
DeepSeek V4 Pro (Reasoning)10010010010010010010050505085.0%
MiniMax M2.710010010010010010010050505085.0%
Xiaomi MIMO v2.5 Pro10010010010010010010050505085.0%
Qwen 3.5 Flash10010010010010010010050505085.0%
Gemini 2.5 Flash Lite (Reasoning)10010010010010010010050505085.0%
Mistral Small 4 (Reasoning)10010010010010010010050505085.0%
GPT-4.1 Nano1001001001001001001005050680.6%
Qwen 3.5 Plus (2026-04-20)1001001001001001005050505080.0%
Qwen 3.5 27B1001001001001001005050505080.0%
MiniMax M2.51001001001001001005050505080.0%
Qwen 3.5 35B1001001001001001005050505080.0%
Stealth: Hunter Alpha1001001001001001005050505080.0%
Z.AI GLM 4.7100100100100100505050505075.0%
Grok 4 Fast100100100100100505050505075.0%
Z.AI GLM 4.6100100100100100100505033073.3%
GPT-5.2100100100100100505050503373.3%
Cohere Command R+ (Aug. 2024)1001001001001001003325252070.3%
Gemma 4 31B (Reasoning)10010010010050505050505070.0%
Aion 2.010010010010050505050505070.0%
Z.AI GLM 4.5100100100100100100501717168.4%
ByteDance Seed 1.61001001005050505050505065.0%
GPT-4o, Aug. 6th (temp=1)100100100100100333325202063.2%
Gemini 2.5 Pro100100505050505050505060.0%
Grok 41001001005050505033333360.0%
Gemini 2.5 Flash (Reasoning)100100505050505050505060.0%
Qwen3.6 Max Preview1001001005050505033332559.2%
Gemini 3 Flash (Preview, Reasoning)1001001005050505033332559.2%
Qwen 3.5 122B1001001005050505033252558.3%
MoonshotAI: Kimi K2.610050505050505050505055.0%
Gemma 4 26B10010010010033332525141454.5%
MoonshotAI: Kimi K2.510050505050505050503353.3%
Hermes 3 70B1001001005050502517171452.3%
Mistral Medium 3.11001001001003317171711950.4%
Qwen 3.5 397B A17B5050505050505050505050.0%
Gemma 4 31B5050505050505050505050.0%
GPT-5.4 Mini5050505050505050333346.7%
Arcee AI: Trinity Large (Preview)1001001001002017877746.5%
Arcee AI: Trinity Mini1001005050503325204043.2%
Claude Haiku 4.55050505050505033252042.8%
GPT-5.510050505033333325252542.5%
GPT-5.4 Nano1005050505050332011842.2%
Hermes 3 405B100100332525252520202039.3%
Llama 3.1 70B100100503325201713131338.3%
Claude Opus 4.65050505033333325252037.0%
Claude Sonnet 410050333325252525252036.2%
ByteDance Seed 2.0 Lite5033333333333333333335.0%
Qwen 2.5 72B100100252020201714131134.0%
Llama 3.1 Nemotron 70B10010033202014131111532.7%
Ministral 8B100100100433332131.9%
Claude 3.7 Sonnet5033333333252525252530.8%
DeepSeek V4 Pro50505050332020178630.4%
DeepSeek V3 (2024-12-26)5050503325251717171730.0%
Gemini 3.1 Flash Lite3333333333333325201729.5%
Claude Opus 45033333325252525251428.9%
Claude Opus 4.73333333333252525252028.7%
Gemini 2.5 Flash Lite10010017141410987228.1%
Claude Opus 4.53333333325252525252027.8%
Gemini 3.1 Flash Lite (Reasoning)3333332525252525252027.0%
Claude 3.5 Sonnet3333332525252525252027.0%
Grok 4.20 (Beta)1001009999887726.7%
DeepSeek V3 (2025-03-24)5033252525252520171726.2%
GPT-5.410033332017171077725.1%
Qwen 3.5 Plus (2026-02-15)50505017141111119723.0%
Ministral 3 8B1001003332221021.6%
GPT-4o, May 13th (temp=1)5025251717171717171421.4%
Gemini 3.1 Flash Lite (Preview)2525252520202020171721.3%
Grok 4.31002517131311988821.1%
Gemini 3 Flash (Preview)2525252020201717171419.9%
DeepSeek V3.150502020141410108219.8%
GPT-4.1 Mini5050171717141187319.4%
DeepSeek-V2 Chat2525252020201717131119.2%
Mistral Small 3.2 24B1001413111111988519.1%
Claude 3 Haiku2525252520171717111019.1%
Gemma 3 12B3325252020141313131118.6%
Gemini 2.5 Flash2020171717171717141016.4%
Llama 3.1 8B