Alias accuracy

Test: Relationship tree

Avg. Score
66.2%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4.20 (Beta)90.3%$0.0195.5s86%
2Grok 4.2089.8%$0.0207.5s85%
3Gemma 4 26B93.7%$0.001827.2s81%
4Gemma 4 31B96.7%$0.00281.9m91%
5Xiaomi MIMO v2.593.1%$0.00271.1m80%
6GPT-5.188.7%$0.02517.7s75%
7GPT-5.4 (Reasoning, Low)92.4%$0.05335.5s77%
8GPT-5.4 Mini88.3%$0.00877.9s63%
9Gemini 2.5 Pro93.7%$0.09958.2s81%
10Gemini 3 Flash (Preview, Reasoning)88.0%$0.03040.3s68%
11GPT-595.5%$0.0922.3m87%
12GPT-5.289.0%$0.06250.7s73%
13GPT-4.183.7%$0.0248.5s63%
14Qwen 3.5 397B A17B93.5%$0.0452.6m81%
15GPT-5.4 Mini (Reasoning, Low)89.1%$0.01414.4s55%
16DeepSeek V4 Pro (Reasoning)92.5%$0.0202.3m74%
17Z.AI GLM 5 Turbo95.5%$0.0443.6m87%
18Claude Opus 4.690.1%$0.15431.4s84%
19Grok 4.20 (Reasoning)85.6%$0.02740.0s63%
20Gemini 3.5 Flash (Reasoning, Minimal)76.2%$0.0246.3s66%
21DeepSeek V4 Flash81.9%$0.002051.4s63%
22GPT-5.477.1%$0.03317.4s67%
23Gemini 3.1 Pro (Preview)95.5%$0.1721.5m87%
24Qwen3.6 Max Preview96.1%$0.0843.7m90%
25GPT-5.4 Mini (Reasoning)95.7%$0.1172.8m88%
26Qwen 3.5 122B94.3%$0.0354.0m83%
27Qwen 3.6 Flash87.3%$0.01551.7s54%
28Gemini 3.5 Flash (Reasoning)90.9%$0.11349.4s70%
29GPT-5 Mini89.0%$0.0192.0m64%
30GPT-5.583.7%$0.10537.1s72%
31ByteDance Seed 1.686.2%$0.0141.2m55%
32Grok 4.20 (Beta, Reasoning)81.5%$0.03230.1s55%
33ByteDance Seed 2.0 Lite89.4%$0.0143.6m74%
34Grok 4.376.3%$0.00947.8s49%
35Gemini 3.1 Flash Lite (Preview)78.4%$0.00463.2s45%
36Qwen 3.6 27B92.1%$0.0342.7m64%
37Claude Opus 4.789.8%$0.18224.3s73%
38GPT-5.4 (Reasoning)95.5%$0.1752.6m87%
39Gemini 3 Flash (Preview)71.6%$0.00877.1s51%
40Z.AI GLM 4.790.5%$0.0253.7m71%
41Qwen 3.5 Plus (2026-04-20)88.5%$0.0233.1m67%
42Gemini 2.5 Flash72.2%$0.00756.1s49%
43Qwen 3.5 Plus (2026-02-15)89.9%$0.0213.9m72%
44Qwen 3.5 Flash90.6%$0.00403.8m66%
45Writer: Palmyra X577.8%$0.01618.4s46%
46MoonshotAI: Kimi K2.695.5%$0.0825.0m87%
47Mistral Large 279.1%$0.03720.5s48%
48Qwen 3.5 35B83.1%$0.0152.0m55%
49Gemini 3.1 Flash Lite68.4%$0.00313.3s44%
50Claude Opus 4.582.3%$0.13119.8s60%
51GPT-5.5 (Reasoning)95.5%$0.2512.0m87%
52Gemini 3.1 Flash Lite (Reasoning)69.5%$0.00333.2s41%
53Claude Sonnet 4.672.2%$0.08123.3s56%
54Claude Opus 4.7 (Reasoning)86.1%$0.18425.0s63%
55Mistral Small 3.2 24B72.0%$0.001713.8s35%
56Z.AI GLM 4.686.2%$0.0183.0m52%
57Mistral Large72.4%$0.03641.1s45%
58Xiaomi MIMO v2.5 Pro79.2%$0.00761.9m43%
59o4 Mini High79.6%$0.0591.3m46%
60Z.AI GLM 4.570.9%$0.007548.1s37%
61MiniMax M2.570.5%$0.003645.0s36%
62GPT-5.5 (Reasoning, Low)85.1%$0.09742.6s40%
63Gemini 2.5 Flash (Reasoning)74.4%$0.02337.3s34%
64MoonshotAI: Kimi K2.588.5%$0.0315.4m66%
65Qwen 3.6 35B78.4%$0.0141.1m30%
66Z.AI GLM 5.195.5%$0.0817.3m87%
67Mistral Small 466.0%$0.00307.4s28%
68o4 Mini73.0%$0.03745.7s34%
69Qwen 3.5 27B87.3%$0.0293.7m46%
70DeepSeek V3.266.3%$0.004633.9s29%
71GPT-4o, Aug. 6th (temp=0)61.8%$0.0328.4s34%
72DeepSeek V4 Pro64.3%$0.008317.3s28%
73GPT-4o, Aug. 6th (temp=1)60.4%$0.0299.1s35%
74GPT-4o, May 13th (temp=0)59.1%$0.0885.5s46%
75Inception Mercury 264.2%$0.009612.8s26%
76Claude Opus 4.8 (Reasoning, Low)95.5%$0.3831.9m87%
77Claude Opus 4.8 (Reasoning)95.5%$0.3811.9m87%
78Claude Opus 4.6 (Reasoning)95.5%$0.3562.6m87%
79DeepSeek-V2 Chat55.8%$0.004934.0s33%
80MiniMax M2.765.2%$0.006321.4s22%
81Qwen3 235B A22B Instruct 250767.9%$0.00171.0m25%
82Ministral 8B62.0%$0.001615.6s22%
83Claude Sonnet 464.9%$0.07217.5s34%
84Mistral Large 363.0%$0.009117.5s22%
85DeepSeek V3.160.4%$0.005552.0s29%
86ByteDance Seed 2.0 Mini89.9%$0.00628.7m77%
87GPT-4o, May 13th (temp=1)58.4%$0.0865.4s37%
88Claude Haiku 4.545.4%$0.0248.2s36%
89DeepSeek V4 Flash (Reasoning)72.8%$0.00413.5m38%
90Claude Sonnet 4.563.9%$0.07718.6s29%
91Gemini 2.5 Flash Lite (Reasoning)61.2%$0.00731.0m22%
92Grok 4.3 (Reasoning)60.2%$0.01313.9s16%
93Claude 3 Haiku54.9%$0.005611.6s18%
94GPT-4.1 Mini55.9%$0.004626.0s18%
95GPT-5.4 Nano44.5%$0.002611.2s23%
96GPT-5.4 Nano (Reasoning)46.3%$0.008841.7s27%
97GPT-OSS 120B63.1%$0.00402.2m25%
98Mistral Small 4 (Reasoning)54.7%$0.005124.7s13%
99Z.AI GLM 578.9%$0.0374.6m37%
100Claude Opus 488.8%$0.3682.4m73%
101Gemma 4 31B (Reasoning)83.4%$0.00435.8m38%
102Qwen3.7 Max79.8%$0.0453.5m26%
103Mistral Medium 3.151.1%$0.01021.2s14%
104Gemma 4 26B (Reasoning)72.6%$0.00413.1m18%
105Ministral 3 8B42.5%$0.002217.0s17%
106MiniMax M389.2%$0.0369.3m68%
107GPT-5 Nano44.9%$0.00802.2m34%
108DeepSeek V3 (2024-12-26)37.1%$0.004540.2s21%
109GPT-5.4 Nano (Reasoning, Low)34.2%$0.003813.8s19%
110Hermes 3 405B49.4%$0.0151.0m11%
111DeepSeek V3 (2025-03-24)41.9%$0.003750.3s13%
112WizardLM 2 8x22b44.2%$0.00982.3m26%
113Gemini 2.5 Flash Lite30.4%$0.00234.8s14%
114Llama 3.1 70B29.8%$0.006434.0s18%
115Ministral 3 14B24.8%$0.003125.4s19%
116Aion 2.059.5%$0.0243.0m14%
117Ministral 3 3B24.2%$0.00079.9s12%
118Claude Sonnet 4.6 (Reasoning)95.5%$0.4485.9m87%
119Ministral 3B16.9%$0.00079.8s11%
120Z.AI GLM 4.7 Flash40.1%$0.00502.8m14%
121Nemotron 3 Super0.0%$0.00003.2m
122Z.AI GLM 4.5 Air44.2%$0.00713.2m11%
123GPT-4.1 Nano11.5%$0.00096.9s8%
124Cydonia 24B V4.122.1%$0.00401.0m7%
125Mistral NeMO16.7%$0.002415.4s4%
126ByteDance Seed 1.6 Flash14.4%$0.001938.2s9%
127Qwen 2.5 72B19.4%$0.00561.2m5%
128Rocinante 12B12.0%$0.003332.1s5%
129Gemma 3 27B6.9%$0.001627.5s5%
130Llama 3.1 8B10.4%$0.000551.5s6%
131Hermes 3 70B9.3%$0.004542.7s4%
132Gemma 3 4B3.9%$0.000727.5s3%
133Skyfall 36B V25.1%$0.005426.5s0%
134LFM2 24B0.0%$0.00047.8s0%
135Qwen 3.5 9B56.1%$0.00407.8m22%
136Gemma 3 12B1.1%$0.001430.4s0%
137Qwen 3 32B15.5%$0.00422.9m8%
138Cohere Command R+ (Aug. 2024)9.2%$0.05756.5s4%
139Arcee AI: Trinity Mini3.8%$0.00371.9m3%
140Nemotron 3 Nano5.3%$0.00392.8m1%
66.21%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Qwen3.7 Max100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
Z.AI GLM 5.1100100100100100100.0%
Qwen3.6 Max Preview100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
MoonshotAI: Kimi K2.6100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
GPT-5100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.6 27B100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Gemma 4 31B100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
Writer: Palmyra X5100100100100100100.0%
Grok 4.20 (Beta, Reasoning)1001001001008997.7%
Qwen 3.5 122B1001001001008196.3%
Gemini 3 Flash (Preview, Reasoning)1001001001008196.3%
GPT-5.4 (Reasoning, Low)1001001001007394.6%
Claude Opus 4.71001001001007094.0%
Z.AI GLM 4.71001001001007094.0%
Qwen 3.5 Plus (2026-02-15)1001001001007094.0%
Mistral Large939393928892.0%
Claude Opus 4100100100887091.6%
GPT-5.2100100100817090.3%
ByteDance Seed 2.0 Lite100100100817090.3%
Gemini 3.5 Flash (Reasoning)100100100935890.1%
ByteDance Seed 1.61001001001004188.2%
Grok 4.20 (Beta)898988888788.1%
Grok 4.20 (Reasoning)100100100707088.0%
Grok 4.20898888888787.9%
Claude Opus 4.6878787878787.3%
MiniMax M3100100100815587.2%
ByteDance Seed 2.0 Mini1008887817686.5%
Qwen 3.6 Flash1001001001002785.4%
Gemma 4 26B (Reasoning)1001001001002785.4%
Z.AI GLM 4.61001001001002785.4%
o4 Mini1001001001002785.4%
GPT-4o, Aug. 6th (temp=0)1008181818185.2%
Z.AI GLM 5100100100705584.9%
GPT-5.5878787877083.8%
GPT-5.11008888707083.1%
Qwen3 235B A22B Instruct 2507100100100882783.0%
Qwen 3.6 35B100100100882783.0%
GPT-5.4 Mini (Reasoning, Low)10010094932782.8%
Qwen 3.5 27B1001001001001182.3%
Mistral Large 3939393933782.1%
Claude Opus 4.7 (Reasoning)10010070707081.9%
GPT-4.110010070707081.9%
GPT-5.4 Mini100100100555581.9%
Mistral Large 2939393933481.6%
Gemini 2.5 Flash (Reasoning)10010088714680.9%
Mistral Small 410010093931179.4%
DeepSeek V4 Flash (Reasoning)100100100702779.3%
DeepSeek V4 Flash1008886635578.2%
Xiaomi MIMO v2.5 Pro100100100701576.9%
GPT-4o, May 13th (temp=1)818181707076.8%
Mistral Small 4 (Reasoning)10010010082076.5%
GPT-4o, Aug. 6th (temp=1)1008181625676.3%
Gemini 3.5 Flash (Reasoning, Minimal)878773705975.3%
Claude 3 Haiku1008383664074.2%
DeepSeek V4 Pro100100100521773.8%
GPT-5.4737373737373.2%
Gemini 3.1 Flash Lite (Preview)1008170624170.9%
Hermes 3 405B100100100272770.7%
Gemini 3.1 Flash Lite1008170624070.6%
Claude Opus 4.5707070707069.9%
GPT-4o, May 13th (temp=0)707070707069.9%
Z.AI GLM 4.510010081362768.9%
Gemini 3.1 Flash Lite (Reasoning)1008162524868.7%
Grok 4.3888779503668.0%
GPT-OSS 120B100100100271167.6%
Gemini 3 Flash (Preview)877070703766.8%
Claude Sonnet 4.6707070626266.7%
Mistral Medium 3.110010085291966.7%
Gemini 2.5 Flash898466484566.4%
DeepSeek V3.210010039393662.7%
Gemini 2.5 Flash Lite (Reasoning)10010041411158.6%
Claude Sonnet 4.51008827272753.7%
MiniMax M2.587844642853.4%
Ministral 8B9384795453.0%
GPT-4.1 Mini10010027271153.0%
Qwen 3.5 9B83735527047.6%
Mistral Small 3.2 24B934644381146.5%
Grok 4.3 (Reasoning)100100274046.1%
MiniMax M2.7100842711445.2%
DeepSeek V3.170704238043.9%
GPT-5.4 Nano1004727271743.6%
DeepSeek-V2 Chat707038271343.4%
GPT-5.4 Nano (Reasoning)1002727272741.5%
Claude Haiku 4.5444439393940.8%
DeepSeek V3 (2025-03-24)7070620040.4%
Aion 2.010010000040.0%
Inception Mercury 21002727242240.0%
Claude Sonnet 4872727272738.8%
Z.AI GLM 4.5 Air1002727271138.4%
DeepSeek V3 (2024-12-26)707027131238.3%
WizardLM 2 8x22b704939151537.7%
GPT-5 Nano555550171137.6%
Qwen 2.5 72B8744154029.7%
Z.AI GLM 4.7 Flash10027116429.5%
Ministral 3 3B645998328.4%
GPT-5.4 Nano (Reasoning, Low)272727272726.9%
Ministral 3 8B1001465525.8%
Mistral NeMO555594425.2%
Ministral 3 14B492213131322.4%
Gemini 2.5 Flash Lite272727171121.8%
Rocinante 12B42321812020.7%
Cydonia 24B V4.15521189020.5%
Llama 3.1 8B30271910017.2%
Llama 3.1 70B343344415.9%
Hermes 3 70B4016116014.5%
ByteDance Seed 1.6 Flash2727113314.4%
Cohere Command R+ (Aug. 2024)2916115413.1%
Ministral 3B271286311.1%
Qwen 3 32B27114409.1%
Skyfall 36B V24060009.1%
Nemotron 3 Nano27114309.1%
GPT-4.1 Nano1111111109.1%
Gemma 3 4B888667.0%
Gemma 3 27B1930004.4%
Gemma 3 12B1100002.3%
Arcee AI: Trinity Mini443002.1%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Mistral Small 3.2 24B1009797979797.4%
GPT-5.4 Mini (Reasoning, Low)1009797929195.3%
GPT-5.4 Mini979794949294.7%
Claude Opus 4.510010091919194.7%
GPT-5.1949494949494.3%
Gemma 4 31B979791919193.4%
ByteDance Seed 2.0 Mini979694918993.3%
Claude Opus 4.6949494919192.9%
Grok 4.20 (Beta)949492929292.6%
Qwen 3.5 122B979191919192.3%
Qwen 3.5 27B979191919192.3%
Qwen3.6 Max Preview979191919192.2%
Gemini 3.5 Flash (Reasoning)979191918991.8%
Grok 4.20949292929091.7%
GPT-5.4 Mini (Reasoning)929191919191.3%
Claude Opus 4.6 (Reasoning)919191919191.1%
Gemini 3.1 Pro (Preview)919191919191.1%
GPT-5.4 (Reasoning)919191919191.1%
Z.AI GLM 5.1919191919191.1%
GPT-5.5 (Reasoning)919191919191.1%
Claude Sonnet 4.6 (Reasoning)919191919191.1%
Z.AI GLM 5 Turbo919191919191.1%
MoonshotAI: Kimi K2.6919191919191.1%
Claude Opus 4.8 (Reasoning)919191919191.1%
Claude Opus 4.8 (Reasoning, Low)919191919191.1%
GPT-5919191919191.1%
MiniMax M3919191919191.1%
Claude Sonnet 4919191919090.9%
Claude Opus 4.7 (Reasoning)10010084848490.4%
GPT-5.4 (Reasoning, Low)929292918590.2%
Qwen 3.6 Flash979691917189.1%
ByteDance Seed 2.0 Lite919191917888.5%
Inception Mercury 21009691797688.4%
GPT-5.2919191917587.8%
MiniMax M2.5979797915687.6%
Gemma 4 26B979492797687.5%
Gemini 2.5 Pro1009191847187.3%
Qwen 3.5 397B A17B919191847887.1%
Z.AI GLM 4.7919191917187.0%
Z.AI GLM 4.6919191917187.0%
Claude Opus 4919191797986.1%
Xiaomi MIMO v2.5919191797986.1%
Gemini 3.1 Flash Lite (Preview)100100100696186.0%
Qwen 3.5 Plus (2026-02-15)919191787885.9%
Claude Opus 4.7919184847885.6%
DeepSeek V4 Flash919184847885.6%
GPT-4.11009191915485.4%
MiniMax M2.710010091756185.3%
DeepSeek V4 Pro (Reasoning)919191916185.0%
Grok 4.3949494944684.6%
ByteDance Seed 1.61009191914884.3%
Qwen 3.6 27B10010091913984.3%
GPT-5.5929291786683.6%
Grok 4.20 (Reasoning)929290895483.2%
Xiaomi MIMO v2.5 Pro979179717181.5%
Qwen 3.5 Flash949191864481.2%
GPT-5.4927978787881.1%
Gemini 3 Flash (Preview, Reasoning)917878787379.7%
Aion 2.0919191616179.0%
Gemini 2.5 Flash948986675578.1%
GPT-5 Mini1009191545477.9%
Claude Sonnet 4.61007773696977.7%
Gemini 3.5 Flash (Reasoning, Minimal)787878787377.1%
Qwen 3.5 Plus (2026-04-20)919179715477.0%
DeepSeek V3.11009178783776.9%
MoonshotAI: Kimi K2.5919178715476.9%
Mistral Large 2979797474576.6%
Gemini 3 Flash (Preview)848484775476.5%
Grok 4.3 (Reasoning)1009779613574.2%
Claude Sonnet 4.5929191544374.0%
Qwen 3.6 35B96919191073.8%
Z.AI GLM 4.51007871635473.0%
Z.AI GLM 591919191072.9%
Ministral 8B10010071523270.9%
Gemini 3.1 Flash Lite (Reasoning)1007064615670.4%
GPT-5.5 (Reasoning, Low)91919178070.3%
DeepSeek V3.2919184443969.9%
DeepSeek-V2 Chat918470613568.1%
Gemini 2.5 Flash (Reasoning)100917869167.9%
Gemma 4 31B (Reasoning)91919161066.8%
DeepSeek V4 Flash (Reasoning)919161543566.3%
Qwen 3.5 35B848461544866.1%
Gemini 3.1 Flash Lite847873613566.1%
Grok 4.20 (Beta, Reasoning)917155555465.2%
Qwen 3.5 9B100858446964.7%
Gemini 2.5 Flash Lite (Reasoning)917954484763.7%
o4 Mini1006161463560.5%
Gemma 4 26B (Reasoning)10097965159.9%
Qwen3.7 Max97979111259.5%
o4 Mini High797161612559.2%
Ministral 3 8B717161544059.1%
GPT-4.1 Mini1006161373558.7%
GPT-OSS 120B796161464658.6%
Writer: Palmyra X5895854443355.7%
DeepSeek V4 Pro616161543754.7%
Qwen3 235B A22B Instruct 250797765735052.9%
Mistral Large974545403752.8%
Mistral Small 4915746363352.6%
GPT-5 Nano616153464052.2%
GPT-5.4 Nano (Reasoning)545454544051.1%
WizardLM 2 8x22b846148461550.8%
Z.AI GLM 4.7 Flash736449462050.6%
Z.AI GLM 4.5 Air906161251249.9%
Claude Haiku 4.5675544434049.9%
GPT-4o, May 13th (temp=0)484848484848.2%
GPT-5.4 Nano545440404045.4%
GPT-4o, Aug. 6th (temp=1)614646353544.5%
Mistral Large 3454444444143.9%
Llama 3.1 70B59595446143.6%
DeepSeek V3 (2025-03-24)10079380043.3%
GPT-5.4 Nano (Reasoning, Low)615442401141.6%
GPT-4o, May 13th (temp=1)544844292539.9%
Gemini 2.5 Flash Lite91542520438.9%
GPT-4o, Aug. 6th (temp=0)543535353538.4%
DeepSeek V3 (2024-12-26)453735342835.8%
Claude 3 Haiku964015141435.6%
Mistral Medium 3.1463434333235.6%
Mistral Small 4 (Reasoning)58463525033.0%
Hermes 3 405B5446400028.0%
Ministral 3 14B362626252427.3%
Cydonia 24B V4.1654832023.7%
Ministral 3B32313117222.6%
Qwen 3 32B46252512021.8%
Ministral 3 3B27262515720.0%
GPT-4.1 Nano461850013.9%
Gemma 3 27B14129949.3%
Qwen 2.5 72B18119539.1%
Mistral NeMO19155208.1%
Arcee AI: Trinity Mini1653305.5%
Cohere Command R+ (Aug. 2024)2250005.4%
Hermes 3 70B1342104.0%
Llama 3.1 8B862103.5%
Rocinante 12B1521003.4%
Nemotron 3 Nano800001.6%
Skyfall 36B V2330001.1%
Gemma 3 4B211000.8%
Gemma 3 12B000000.0%
Nemotron 3 Super00.0%