Relationship type accuracy

Test: Relationship tree

Avg. Score
76.6%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1LFM2 24B100.0%$0.00047.8s100%
2Gemma 3 12B100.0%$0.001430.4s100%
3GPT-5.4 Nano (Reasoning)95.0%$0.008841.7s86%
4GPT-4o, Aug. 6th (temp=0)94.8%$0.0328.4s86%
5Gemini 2.5 Flash92.3%$0.00756.1s83%
6GPT-5.194.6%$0.02517.7s85%
7Z.AI GLM 4.594.0%$0.007548.1s84%
8Grok 4.2093.0%$0.0207.5s81%
9Grok 4.20 (Beta, Reasoning)95.2%$0.03230.1s84%
10GPT-5.4 (Reasoning, Low)95.2%$0.05335.5s84%
11Nemotron 3 Super100.0%$0.00003.2m
12o4 Mini94.3%$0.03745.7s80%
13GPT-5 Mini95.7%$0.0192.0m86%
14GPT-4o, May 13th (temp=0)93.7%$0.0885.5s83%
15Grok 4.20 (Reasoning)91.6%$0.02740.0s77%
16Claude Opus 4.596.0%$0.13119.8s89%
17GPT-5.294.2%$0.06250.7s82%
18ByteDance Seed 1.690.6%$0.0141.2m76%
19o4 Mini High93.3%$0.0591.3m81%
20Mistral Large 386.9%$0.009117.5s67%
21DeepSeek V3.188.1%$0.005552.0s69%
22Gemini 2.5 Flash (Reasoning)88.7%$0.02337.3s69%
23Mistral Large 289.4%$0.03720.5s66%
24Grok 4.20 (Beta)86.7%$0.0195.5s64%
25Gemini 3 Flash (Preview, Reasoning)87.1%$0.03040.3s71%
26DeepSeek V3.285.1%$0.004633.9s64%
27Claude Opus 4.694.5%$0.15431.4s80%
28GPT-5.4 (Reasoning)99.5%$0.1752.6m98%
29GPT-594.0%$0.0922.3m84%
30Mistral Medium 3.183.3%$0.01021.2s62%
31GPT-5.483.6%$0.03317.4s65%
32Claude 3 Haiku83.0%$0.005611.6s59%
33Gemini 3 Flash (Preview)75.2%$0.00877.1s68%
34Inception Mercury 278.0%$0.009612.8s65%
35ByteDance Seed 1.6 Flash86.7%$0.001938.2s56%
36DeepSeek V3 (2025-03-24)85.7%$0.003750.3s59%
37Mistral Large84.2%$0.03641.1s65%
38Xiaomi MIMO v2.5 Pro87.1%$0.00761.9m67%
39GPT-4.182.0%$0.0248.5s59%
40GPT-4o, May 13th (temp=1)87.3%$0.0865.4s63%
41GPT-4o, Aug. 6th (temp=1)82.9%$0.0299.1s58%
42Gemini 2.5 Flash Lite (Reasoning)83.3%$0.00731.0m61%
43Xiaomi MIMO v2.583.8%$0.00271.1m61%
44GPT-5.5 (Reasoning, Low)88.0%$0.09742.6s70%
45Gemini 2.5 Pro88.5%$0.09958.2s72%
46Ministral 3 14B80.8%$0.003125.4s57%
47GPT-5.580.7%$0.10537.1s78%
48Aion 2.090.0%$0.0243.0m73%
49MiniMax M2.578.0%$0.003645.0s61%
50Gemini 3.5 Flash (Reasoning, Minimal)81.2%$0.0246.3s54%
51Claude Sonnet 4.676.7%$0.08123.3s72%
52DeepSeek V4 Flash (Reasoning)88.9%$0.00413.5m71%
53Gemini 3.5 Flash (Reasoning)84.7%$0.11349.4s69%
54Grok 4.3 (Reasoning)76.6%$0.01313.9s53%
55Gemini 3.1 Flash Lite (Preview)66.0%$0.00463.2s62%
56Z.AI GLM 4.787.2%$0.0253.7m73%
57Qwen 3.6 Flash78.5%$0.01551.7s54%
58Z.AI GLM 5 Turbo88.7%$0.0443.6m72%
59Claude Opus 4.783.9%$0.18224.3s73%
60DeepSeek V4 Pro (Reasoning)82.0%$0.0202.3m61%
61Gemini 3.1 Pro (Preview)87.8%$0.1721.5m75%
62GPT-5.4 Mini (Reasoning)89.4%$0.1172.8m74%
63Qwen3.7 Max88.5%$0.0453.5m67%
64Claude Opus 4.7 (Reasoning)83.0%$0.18425.0s69%
65Gemini 3.1 Flash Lite (Reasoning)67.1%$0.00333.2s50%
66MoonshotAI: Kimi K2.692.1%$0.0825.0m81%
67ByteDance Seed 2.0 Lite83.5%$0.0143.6m64%
68Claude Sonnet 4.577.6%$0.07718.6s53%
69Nemotron 3 Nano81.9%$0.00392.8m56%
70GPT-OSS 120B79.1%$0.00402.2m53%
71Qwen 3.5 397B A17B78.2%$0.0452.6m65%
72Gemma 4 31B68.3%$0.00281.9m62%
73Qwen 3.5 27B83.9%$0.0293.7m63%
74DeepSeek V4 Pro65.0%$0.008317.3s51%
75Cohere Command R+ (Aug. 2024)79.2%$0.05756.5s48%
76GPT-5.4 Mini (Reasoning, Low)64.1%$0.01414.4s52%
77Ministral 8B66.6%$0.001615.6s46%
78Gemini 3.1 Flash Lite64.6%$0.00313.3s47%
79Z.AI GLM 4.676.9%$0.0183.0m62%
80GPT-5.4 Nano (Reasoning, Low)67.9%$0.003813.8s43%
81Gemma 4 26B59.1%$0.001827.2s56%
82Qwen 3.5 Plus (2026-02-15)81.1%$0.0213.9m65%
83MiniMax M2.763.2%$0.006321.4s49%
84Qwen 3.6 27B77.6%$0.0342.7m59%
85Qwen 3 32B74.7%$0.00422.9m57%
86Qwen 3.6 35B70.0%$0.0141.1m49%
87Qwen 3.5 Plus (2026-04-20)81.5%$0.0233.1m54%
88GPT-5.4 Mini58.0%$0.00877.9s51%
89Writer: Palmyra X569.1%$0.01618.4s40%
90WizardLM 2 8x22b75.9%$0.00982.3m49%
91Qwen3.6 Max Preview82.9%$0.0843.7m67%
92MoonshotAI: Kimi K2.586.0%$0.0315.4m68%
93Qwen 3.5 Flash74.9%$0.00403.8m58%
94GPT-5.5 (Reasoning)88.6%$0.2512.0m72%
95Qwen3 235B A22B Instruct 250769.5%$0.00171.0m37%
96Z.AI GLM 581.6%$0.0374.6m62%
97Gemini 2.5 Flash Lite64.7%$0.00234.8s32%
98GPT-4.1 Nano57.8%$0.00096.9s40%
99Mistral Small 3.2 24B53.8%$0.001713.8s45%
100Z.AI GLM 4.7 Flash74.3%$0.00502.8m44%
101Mistral Small 465.4%$0.00307.4s29%
102GPT-4.1 Mini55.2%$0.004626.0s45%
103DeepSeek V3 (2024-12-26)64.8%$0.004540.2s35%
104Qwen 3.5 122B76.3%$0.0354.0m59%
105DeepSeek-V2 Chat64.1%$0.004934.0s35%
106Qwen 3.5 35B65.4%$0.0152.0m49%
107GPT-5 Nano69.2%$0.00802.2m44%
108Claude Sonnet 462.2%$0.07217.5s47%
109DeepSeek V4 Flash61.4%$0.002051.4s38%
110Claude Opus 4.6 (Reasoning)94.1%$0.3562.6m83%
111Grok 4.355.6%$0.00947.8s39%
112Claude Opus 4.8 (Reasoning, Low)92.6%$0.3831.9m82%
113Mistral Small 4 (Reasoning)64.2%$0.005124.7s29%
114Claude Opus 4.8 (Reasoning)91.1%$0.3811.9m80%
115Z.AI GLM 5.182.2%$0.0817.3m80%
116MiniMax M391.1%$0.0369.3m79%
117Ministral 3 8B51.7%$0.002217.0s35%
118ByteDance Seed 2.0 Mini87.6%$0.00628.7m70%
119Gemma 4 26B (Reasoning)65.9%$0.00413.1m43%
120Cydonia 24B V4.159.8%$0.00401.0m30%
121Llama 3.1 70B47.5%$0.006434.0s41%
122GPT-5.4 Nano46.3%$0.002611.2s37%
123Gemma 3 4B52.3%$0.000727.5s30%
124Gemma 4 31B (Reasoning)73.9%$0.00435.8m54%
125Skyfall 36B V258.9%$0.005426.5s21%
126Claude Haiku 4.547.2%$0.0248.2s35%
127Ministral 3B44.7%$0.00079.8s33%
128Hermes 3 405B57.4%$0.0151.0m24%
129Llama 3.1 8B47.0%$0.000551.5s31%
130Ministral 3 3B40.8%$0.00079.9s32%
131Qwen 2.5 72B44.8%$0.00561.2m36%
132Rocinante 12B58.8%$0.003332.1s11%
133Z.AI GLM 4.5 Air57.5%$0.00713.2m38%
134Hermes 3 70B50.3%$0.004542.7s20%
135Gemma 3 27B42.7%$0.001627.5s23%
136Arcee AI: Trinity Mini52.8%$0.00371.9m23%
137Qwen 3.5 9B64.4%$0.00407.8m53%
138Mistral NeMO30.3%$0.002415.4s21%
139Claude Sonnet 4.6 (Reasoning)85.2%$0.4485.9m73%
140Claude Opus 465.2%$0.3682.4m49%
76.56%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)100100100100100100.0%
o4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
Gemma 3 12B100100100100100100.0%
LFM2 24B100100100100100100.0%
Claude Opus 4.51001001001009098.1%
GPT-4o, May 13th (temp=0)1001001001008797.3%
GPT-5.4 Nano (Reasoning)1001001001008597.0%
GPT-5.11001001001008496.9%
GPT-5 Mini1001001001008096.1%
Z.AI GLM 4.5100100100938695.8%
Grok 4.20 (Beta, Reasoning)1001001001007695.2%
Gemini 3.5 Flash (Reasoning, Minimal)100100100958095.0%
o4 Mini High100100100957694.1%
GPT-4o, Aug. 6th (temp=0)100100100898194.0%
Gemini 2.5 Flash (Reasoning)1001001001006593.1%
Gemini 2.5 Flash10010093908192.8%
Claude Opus 4.6100100100937192.8%
Grok 4.20100100100887492.4%
ByteDance Seed 1.6100100100887392.3%
Mistral Large 3100100100907192.2%
MiniMax M3100100100818092.1%
GPT-51009898838292.0%
GPT-5.4 (Reasoning, Low)100100100817992.0%
GPT-5.4 Mini (Reasoning)100100100817991.9%
GPT-5.2100100100857491.8%
Mistral Medium 3.1100100100876690.5%
Grok 4.20 (Reasoning)100100100757590.0%
Claude Opus 4.6 (Reasoning)10010083838389.5%
Gemini 2.5 Pro10010083838389.5%
MoonshotAI: Kimi K2.610010083838289.4%
GPT-4o, May 13th (temp=1)100100100737289.0%
Gemini 3.5 Flash (Reasoning)10010083838089.0%
Gemini 3 Flash (Preview, Reasoning)10010083838089.0%
ByteDance Seed 2.0 Mini100100100727288.9%
Xiaomi MIMO v2.5 Pro100100100776688.6%
Aion 2.0100100100726888.1%
Mistral Large10010087866186.8%
ByteDance Seed 1.6 Flash100100100904486.7%
Z.AI GLM 4.7 Flash10010095756386.7%
Gemini 2.5 Flash Lite (Reasoning)100100100726086.5%
Xiaomi MIMO v2.510010083767286.1%
GPT-5.5 (Reasoning, Low)1008383838386.0%
DeepSeek V3.110010087756585.4%
Claude Opus 4.8 (Reasoning, Low)1008383818085.1%
Claude Opus 4.8 (Reasoning)1008383837784.9%
Z.AI GLM 5 Turbo1008381817984.7%
Gemini 3.1 Pro (Preview)1008383837684.7%
Qwen3.7 Max1008383817684.3%
Grok 4.20 (Beta)100100100675183.7%
Nemotron 3 Nano10010090666383.7%
Z.AI GLM 4.71008682807183.6%
DeepSeek V4 Flash (Reasoning)10010083756083.5%
GPT-5.5 (Reasoning)838383838382.6%
Claude Sonnet 4.6 (Reasoning)838383838382.6%
Claude Opus 4.7 (Reasoning)838383838382.6%
GPT-5.5838383838382.6%
Qwen 3.5 27B1008177767682.0%
Z.AI GLM 5.1838383838082.0%
Ministral 3 14B1009692744681.7%
GPT-4.110010085675581.5%
MoonshotAI: Kimi K2.51008383686679.9%
Qwen3.6 Max Preview838080807779.8%
Qwen 3.5 Flash1007976737179.8%
Claude Opus 4.7838383826879.5%
DeepSeek V4 Pro (Reasoning)1008076726979.4%
GPT-4o, Aug. 6th (temp=1)1009188803779.0%
Gemini 3 Flash (Preview)808079787878.8%
Qwen 3.5 397B A17B838079797478.8%
DeepSeek V3.210010072645678.6%
Claude Sonnet 4.6837979797478.5%
GPT-5.4797878787878.2%
ByteDance Seed 2.0 Lite1008380765278.1%
Qwen 3.5 122B907776747277.8%
Qwen 3.5 Plus (2026-02-15)818079747477.5%
Cohere Command R+ (Aug. 2024)100100100424176.6%
Qwen 3.5 Plus (2026-04-20)1007472716476.4%
Qwen 3 32B1008178626176.2%
GPT-5 Nano948983575575.7%
Qwen 3.6 27B858277716475.7%
Mistral Small 4100100100512575.3%
Qwen 3.6 Flash1007670696175.0%
Inception Mercury 2787875717174.6%
Z.AI GLM 5838380666174.4%
GPT-OSS 120B1007767676074.3%
WizardLM 2 8x22b1009568654274.1%
MiniMax M2.5887674726074.0%
Grok 4.3 (Reasoning)10010058565373.5%
DeepSeek V3 (2025-03-24)10010058584772.7%
Z.AI GLM 4.6808078685371.7%
DeepSeek V4 Pro827671675470.2%
Gemma 4 31B (Reasoning)767272706069.9%
Claude 3 Haiku848071674469.2%
Qwen 3.5 9B807469665669.0%
Gemini 3.1 Flash Lite (Preview)727067676768.9%
Qwen 3.5 35B757266666468.6%
Gemma 4 31B747270646168.2%
Gemini 3.1 Flash Lite (Reasoning)707070705867.5%
Gemma 4 26B (Reasoning)827666585266.9%
GPT-4.1 Nano848258565366.5%
Gemini 2.5 Flash Lite10010071332565.7%
DeepSeek V4 Flash896665575065.3%
GPT-5.4 Mini (Reasoning, Low)836765614464.0%
Z.AI GLM 4.5 Air797569564264.0%
Claude Sonnet 4766666664463.6%
GPT-5.4 Nano (Reasoning, Low)1006559553963.5%
Mistral Small 4 (Reasoning)10010050422463.1%
Qwen 3.6 35B666661616062.9%
Ministral 8B777668504262.7%
Claude Opus 4666666654862.3%
Qwen3 235B A22B Instruct 25071006356504262.2%
Cydonia 24B V4.11006967383662.1%
DeepSeek V3 (2024-12-26)1005852524661.9%
Gemini 3.1 Flash Lite736761584961.7%
Claude Sonnet 4.5767666454561.6%
DeepSeek-V2 Chat867552484661.6%
MiniMax M2.7726363535260.7%
Grok 4.3916358503559.2%
Gemma 4 26B605858565657.6%
Gemma 3 4B857156532357.5%
Llama 3.1 8B736363474157.1%
GPT-5.4 Mini606059505055.7%
Gemma 3 27B1005043404054.6%
GPT-4.1 Mini746649483554.2%
Mistral Small 3.2 24B695954513653.8%
Arcee AI: Trinity Mini1005347362552.1%
Skyfall 36B V21006734302551.2%
Ministral 3 8B685941403649.0%
Writer: Palmyra X5585249463648.2%
Claude Haiku 4.5716038383448.2%
Hermes 3 405B1004338322848.0%
Llama 3.1 70B565450423747.9%
GPT-5.4 Nano626147342445.4%
Ministral 3 3B634040322640.0%
Qwen 2.5 72B484637363239.6%
Mistral NeMO484846252237.8%
Hermes 3 70B434242282736.4%
Ministral 3B484336292636.2%
Rocinante 12B10045258035.7%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.8 (Reasoning, Low)100100100100100100.0%
Nemotron 3 Super100100.0%
Gemma 3 12B100100100100100100.0%
GPT-5.4 (Reasoning)100100100979798.9%
Claude Opus 4.6 (Reasoning)100100100979698.7%
DeepSeek V3 (2025-03-24)1001001001009398.7%
GPT-5.4 (Reasoning, Low)100100100979698.5%
Claude Opus 4.8 (Reasoning)1009797979797.3%
Claude 3 Haiku10010095959496.8%
GPT-5.2989797969696.7%
Claude Opus 4.61009797959396.2%
GPT-51009895959296.1%
GPT-4o, Aug. 6th (temp=0)1009797939295.7%
GPT-5 Mini979795949395.3%
Grok 4.20 (Beta, Reasoning)1009695939395.2%
MoonshotAI: Kimi K2.61009999968094.8%
GPT-5.5 (Reasoning)10010099888894.7%
DeepSeek V4 Flash (Reasoning)999796958594.3%
Claude Opus 4.5969594949193.8%
Claude Sonnet 4.5949494939393.5%
Grok 4.20989895928593.5%
Grok 4.20 (Reasoning)959494929193.2%
GPT-5.4 Nano (Reasoning)1009696888593.0%
Qwen3.7 Max100100100828192.7%
Z.AI GLM 5 Turbo1009696888392.7%
o4 Mini High989591918892.6%
GPT-5.1989392928692.3%
Z.AI GLM 4.5989493918592.2%
MoonshotAI: Kimi K2.5969595957992.1%
Aion 2.0999994947491.9%
Gemini 2.5 Flash939392929091.9%
DeepSeek V3.2969393898891.7%
Gemini 3.1 Pro (Preview)979291918390.9%
Z.AI GLM 4.7959494908190.8%
DeepSeek V3.11009190868690.8%
MiniMax M3969695838090.1%
GPT-4o, May 13th (temp=0)979390898190.1%
GPT-5.5 (Reasoning, Low)10010084848289.9%
Writer: Palmyra X5979289898389.9%
Grok 4.20 (Beta)939391918089.6%
ByteDance Seed 1.6969292848188.9%
GPT-5.4949494937088.9%
ByteDance Seed 2.0 Lite939190908088.9%
Z.AI GLM 51009992827188.7%
o4 Mini969592926888.5%
Claude Opus 4.7948988868488.2%
Claude Sonnet 4.6 (Reasoning)969684828087.8%
Gemini 2.5 Pro969590797787.5%
GPT-5.4 Mini (Reasoning)969586847486.9%
GPT-4o, Aug. 6th (temp=1)979390807386.7%
Qwen 3.5 Plus (2026-04-20)999893747186.7%
ByteDance Seed 2.0 Mini969089837386.3%
Qwen3.6 Max Preview979393757286.0%
Qwen 3.5 27B999685777185.8%
GPT-4o, May 13th (temp=1)10010097874485.6%
Xiaomi MIMO v2.5 Pro989488826685.5%
Gemini 3 Flash (Preview, Reasoning)949290787385.1%
Qwen 3.5 Plus (2026-02-15)949493746984.7%
DeepSeek V4 Pro (Reasoning)1009380767484.5%
Gemini 2.5 Flash (Reasoning)939390796784.2%
GPT-OSS 120B959592706683.9%
Claude Opus 4.7 (Reasoning)999084806583.4%
GPT-4.1929290835582.4%
Z.AI GLM 5.1848483818182.4%
Qwen 3.6 Flash959291745882.1%
Z.AI GLM 4.6928878767682.0%
Cohere Command R+ (Aug. 2024)100100100585181.9%
MiniMax M2.5929086865581.9%
Rocinante 12B100100100100981.9%
Mistral Large 3959393686181.7%
Mistral Large929191696581.6%
Xiaomi MIMO v2.51009593605981.6%
Inception Mercury 2978382816581.4%
Gemini 3.5 Flash (Reasoning)858378787880.4%
Nemotron 3 Nano10010083635580.2%
Gemini 2.5 Flash Lite (Reasoning)929183696780.0%
Ministral 3 14B979489625880.0%
Grok 4.3 (Reasoning)898883825679.7%
Qwen 3.6 27B989476656479.5%
GPT-5.5838080787478.9%
Mistral Large 2939291615878.8%
Gemma 4 31B (Reasoning)1009170666377.9%
WizardLM 2 8x22b948986645577.7%
Qwen 3.5 397B A17B997874736477.5%
Qwen 3.6 35B1009167666377.2%
Qwen3 235B A22B Instruct 25071009177605676.7%
Mistral Medium 3.1848375746576.1%
Claude Sonnet 4.6787776747175.0%
Qwen 3.5 122B1007767656474.7%
Qwen 3 32B878280615673.3%
GPT-5.4 Nano (Reasoning, Low)897769646272.3%
Gemini 3 Flash (Preview)837570666471.5%
Ministral 8B918881514170.4%
Qwen 3.5 Flash807474685570.1%
Gemma 4 31B756966666568.4%
Claude Opus 4987062585368.2%
DeepSeek V3 (2024-12-26)888763534667.7%
Gemini 3.1 Flash Lite976564585567.6%
Gemini 3.5 Flash (Reasoning, Minimal)837269585567.3%
Hermes 3 405B1007956534666.8%
Gemini 3.1 Flash Lite (Reasoning)986059595866.6%
DeepSeek-V2 Chat928856534466.6%
Skyfall 36B V210010053503066.5%
MiniMax M2.7797962604865.6%
Mistral Small 4 (Reasoning)976359565165.2%
Gemma 4 26B (Reasoning)1007572443364.8%
Hermes 3 70B10010054422564.2%
GPT-5.4 Mini (Reasoning, Low)717063615664.1%
Gemini 2.5 Flash Lite878170503163.7%
Gemini 3.1 Flash Lite (Preview)686563606063.0%
GPT-5 Nano827157535162.7%
Qwen 3.5 35B807371622562.2%
Z.AI GLM 4.7 Flash846962583662.0%
Claude Sonnet 4886059494860.8%
Gemma 4 26B636361605760.7%
GPT-5.4 Mini746259555260.3%
DeepSeek V4 Pro746463594059.9%
Qwen 3.5 9B716158575259.8%
Cydonia 24B V4.11006652422857.5%
DeepSeek V4 Flash905551464557.5%
GPT-4.1 Mini595958565156.3%
Mistral Small 4676552504455.5%
Ministral 3 8B815049474654.4%
Mistral Small 3.2 24B585754544553.8%
Arcee AI: Trinity Mini1005648432353.6%
Ministral 3B696051473953.2%
Grok 4.3625954533352.1%
Z.AI GLM 4.5 Air69655855951.1%
Qwen 2.5 72B635352424049.9%
GPT-4.1 Nano585046464549.2%
Llama 3.1 70B565353522347.2%
Gemma 3 4B786556261047.2%
GPT-5.4 Nano514947464247.1%
Claude Haiku 4.5535245443846.2%
Ministral 3 3B484743353541.6%
Llama 3.1 8B503936352536.9%
Gemma 3 27B383433272230.8%
Mistral NeMO353118161422.8%