Relationship precision

Test: Relationship tree

Avg. Score
67.7%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.1 Flash Lite (Reasoning)100.0%$0.00333.2s100%
2LFM2 24B100.0%$0.00047.8s100%
3Gemini 3.1 Flash Lite (Preview)100.0%$0.00463.2s100%
4Gemma 3 12B100.0%$0.001430.4s100%
5Gemma 4 26B96.6%$0.001827.2s89%
6Gemini 2.5 Flash94.5%$0.00756.1s82%
7Qwen 3.6 35B98.4%$0.0141.1m90%
8Qwen 3.6 Flash97.1%$0.01551.7s88%
9GPT-5.4 (Reasoning, Low)97.0%$0.05335.5s85%
10Claude Sonnet 497.3%$0.07217.5s84%
11Claude Sonnet 4.595.3%$0.07718.6s85%
12GPT-5 Mini96.0%$0.0192.0m86%
13Grok 4.20 (Beta, Reasoning)92.0%$0.03230.1s76%
14GPT-5100.0%$0.0922.3m100%
15Nemotron 3 Super100.0%$0.00003.2m
16GPT-5.4 Mini (Reasoning, Low)88.9%$0.01414.4s68%
17GPT-5.294.5%$0.06250.7s78%
18GPT-5.4 Nano (Reasoning)88.4%$0.008841.7s71%
19Xiaomi MIMO v2.588.4%$0.00271.1m69%
20o4 Mini High91.7%$0.0591.3m77%
21DeepSeek V4 Pro (Reasoning)93.1%$0.0202.3m77%
22Gemini 3 Flash (Preview)83.8%$0.00877.1s56%
23Gemini 3.1 Flash Lite88.2%$0.00313.3s49%
24Claude Opus 4.592.4%$0.13119.8s75%
25Nemotron 3 Nano93.1%$0.00392.8m71%
26Grok 4.20 (Reasoning)85.4%$0.02740.0s58%
27Claude Sonnet 4.684.6%$0.08123.3s66%
28GPT-5.480.9%$0.03317.4s57%
29Grok 4.380.0%$0.00947.8s50%
30o4 Mini85.4%$0.03745.7s57%
31Mistral Large 378.2%$0.009117.5s52%
32Grok 4.3 (Reasoning)84.6%$0.01313.9s44%
33Qwen 3.5 35B87.4%$0.0152.0m60%
34Xiaomi MIMO v2.5 Pro84.8%$0.00761.9m60%
35GPT-5.4 Mini (Reasoning)95.6%$0.1172.8m82%
36Qwen 3.5 Plus (2026-04-20)90.3%$0.0233.1m70%
37Qwen3.6 Max Preview95.0%$0.0843.7m84%
38GPT-OSS 120B83.2%$0.00402.2m58%
39Gemini 2.5 Flash (Reasoning)82.4%$0.02337.3s45%
40Qwen 3.5 122B91.0%$0.0354.0m75%
41DeepSeek V4 Pro78.5%$0.008317.3s40%
42DeepSeek V4 Flash79.7%$0.002051.4s43%
43GPT-5.5 (Reasoning, Low)82.0%$0.09742.6s60%
44GPT-4o, Aug. 6th (temp=0)83.1%$0.0328.4s36%
45Inception Mercury 274.8%$0.009612.8s39%
46Claude Opus 4.683.7%$0.15431.4s64%
47Qwen 3.6 27B87.7%$0.0342.7m56%
48DeepSeek V3.273.5%$0.004633.9s41%
49GPT-4o, May 13th (temp=1)83.3%$0.0865.4s43%
50Mistral Large74.8%$0.03641.1s46%
51Gemma 4 26B (Reasoning)89.1%$0.00413.1m50%
52Z.AI GLM 4.574.3%$0.007548.1s39%
53Qwen 3.5 Plus (2026-02-15)87.6%$0.0213.9m58%
54Claude Opus 4.784.4%$0.18224.3s59%
55Gemma 4 31B (Reasoning)93.1%$0.00435.8m69%
56Z.AI GLM 590.4%$0.0374.6m64%
57Gemma 4 31B75.5%$0.00281.9m44%
58Mistral Large 270.1%$0.03720.5s37%
59Z.AI GLM 5 Turbo83.6%$0.0443.6m57%
60Aion 2.080.6%$0.0243.0m49%
61Z.AI GLM 4.5 Air80.9%$0.00713.2m46%
62ByteDance Seed 1.671.4%$0.0141.2m35%
63Qwen 3.5 Flash79.6%$0.00403.8m51%
64Claude Opus 4.7 (Reasoning)84.7%$0.18425.0s49%
65GPT-5.4 Nano (Reasoning, Low)59.2%$0.003813.8s31%
66Claude Haiku 4.567.7%$0.0248.2s25%
67Llama 3.1 70B61.5%$0.006434.0s32%
68DeepSeek-V2 Chat71.6%$0.004934.0s20%
69DeepSeek V3 (2025-03-24)69.2%$0.003750.3s23%
70Gemini 3 Flash (Preview, Reasoning)67.5%$0.03040.3s27%
71DeepSeek V3.171.1%$0.005552.0s19%
72Gemini 2.5 Pro76.7%$0.09958.2s35%
73Gemini 3.5 Flash (Reasoning)72.9%$0.11349.4s37%
74Claude Opus 496.6%$0.3682.4m86%
75GPT-4o, May 13th (temp=0)69.6%$0.0885.5s26%
76Gemini 3.5 Flash (Reasoning, Minimal)62.4%$0.0246.3s20%
77GPT-5.566.6%$0.10537.1s37%
78ByteDance Seed 2.0 Lite81.0%$0.0143.6m31%
79GPT-4o, Aug. 6th (temp=1)58.8%$0.0299.1s20%
80Qwen 3 32B70.0%$0.00422.9m31%
81Qwen 3.5 397B A17B72.1%$0.0452.6m35%
82GPT-4.157.6%$0.0248.5s19%
83GPT-4.1 Mini52.4%$0.004626.0s19%
84ByteDance Seed 2.0 Mini88.8%$0.00628.7m70%
85Mistral Small 456.1%$0.00307.4s11%
86GPT-5 Nano63.4%$0.00802.2m27%
87Qwen 3.5 27B71.5%$0.0293.7m39%
88Hermes 3 70B59.5%$0.004542.7s14%
89Gemma 3 27B53.1%$0.001627.5s15%
90GPT-5.4 Mini43.1%$0.00877.9s22%
91GPT-5.151.0%$0.02517.7s17%
92Rocinante 12B57.7%$0.003332.1s7%
93Mistral Small 4 (Reasoning)49.2%$0.005124.7s15%
94DeepSeek V4 Flash (Reasoning)65.0%$0.00413.5m28%
95DeepSeek V3 (2024-12-26)49.0%$0.004540.2s14%
96Cohere Command R+ (Aug. 2024)61.3%$0.05756.5s14%
97Skyfall 36B V248.5%$0.005426.5s8%
98ByteDance Seed 1.6 Flash45.8%$0.001938.2s12%
99Z.AI GLM 5.185.1%$0.0817.3m59%
100Writer: Palmyra X541.8%$0.01618.4s14%
101MiniMax M2.742.5%$0.006321.4s10%
102MoonshotAI: Kimi K2.675.1%$0.0825.0m42%
103Qwen3 235B A22B Instruct 250741.7%$0.00171.0m15%
104Claude Opus 4.8 (Reasoning)85.2%$0.3811.9m64%
105Qwen 2.5 72B47.0%$0.00561.2m7%
106GPT-5.4 (Reasoning)67.5%$0.1752.6m38%
107Z.AI GLM 4.7 Flash44.8%$0.00502.8m25%
108MoonshotAI: Kimi K2.570.6%$0.0315.4m32%
109Grok 4.2030.5%$0.0207.5s14%
110Qwen3.7 Max60.2%$0.0453.5m26%
111Ministral 3 8B34.5%$0.002217.0s7%
112GPT-5.4 Nano21.8%$0.002611.2s18%
113MiniMax M2.527.1%$0.003645.0s18%
114GPT-4.1 Nano22.8%$0.00096.9s12%
115Ministral 8B29.3%$0.001615.6s7%
116Gemini 2.5 Flash Lite (Reasoning)37.6%$0.00731.0m7%
117Hermes 3 405B38.0%$0.0151.0m7%
118Claude 3 Haiku25.1%$0.005611.6s9%
119GPT-5.5 (Reasoning)68.3%$0.2512.0m35%
120Grok 4.20 (Beta)27.1%$0.0195.5s5%
121Claude Opus 4.6 (Reasoning)78.4%$0.3562.6m54%
122Z.AI GLM 4.747.3%$0.0253.7m22%
123WizardLM 2 8x22b44.7%$0.00982.3m6%
124Z.AI GLM 4.645.0%$0.0183.0m15%
125Gemini 2.5 Flash Lite23.6%$0.00234.8s2%
126Ministral 3 14B19.3%$0.003125.4s9%
127Mistral Medium 3.118.1%$0.01021.2s7%
128Cydonia 24B V4.126.6%$0.00401.0m3%
129Mistral Small 3.2 24B18.3%$0.001713.8s3%
130Arcee AI: Trinity Mini32.6%$0.00371.9m4%
131Gemini 3.1 Pro (Preview)47.8%$0.1721.5m16%
132Ministral 3B7.4%$0.00079.8s4%
133Ministral 3 3B6.6%$0.00079.9s3%
134Llama 3.1 8B12.6%$0.000551.5s5%
135Claude Opus 4.8 (Reasoning, Low)73.7%$0.3831.9m34%
136Mistral NeMO4.4%$0.002415.4s3%
137Claude Sonnet 4.6 (Reasoning)89.8%$0.4485.9m70%
138Gemma 3 4B3.2%$0.000727.5s2%
139Qwen 3.5 9B55.6%$0.00407.8m23%
140MiniMax M355.9%$0.0369.3m22%
67.73%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Qwen3.6 Max Preview100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
GPT-5100100100100100100.0%
Grok 4.3 (Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Grok 4.20 (Reasoning)100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Qwen 3.5 Plus (2026-04-20)100100100100100100.0%
Gemma 4 31B (Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
GPT-5.2100100100100100100.0%
Qwen 3.6 Flash100100100100100100.0%
DeepSeek V4 Pro (Reasoning)100100100100100100.0%
Gemini 2.5 Pro100100100100100100.0%
Qwen 3.6 35B100100100100100100.0%
Claude Sonnet 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 35B100100100100100100.0%
o4 Mini100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Gemma 4 26B100100100100100100.0%
Xiaomi MIMO v2.5100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
Gemma 3 12B100100100100100100.0%
LFM2 24B100100100100100100.0%
GPT-5 Mini1001001001008697.3%
GPT-5.4 Mini (Reasoning, Low)1001001001008496.9%
Claude Opus 4.51001001001008396.7%
o4 Mini High1001001001008196.2%
Claude Opus 41001001001007895.6%
Nemotron 3 Nano1001001001007795.4%
MoonshotAI: Kimi K2.61001001001007494.9%
GPT-OSS 120B1001001001007394.5%
Qwen 3.5 122B1001001001007094.0%
Grok 4.20 (Beta, Reasoning)1001001001006993.8%
Gemini 2.5 Flash100100100887692.7%
Qwen 3.5 27B100100100827992.3%
Z.AI GLM 5 Turbo1001001001005991.9%
Claude Opus 4.6 (Reasoning)10010091917691.8%
GPT-5.4 Nano (Reasoning)100100100797791.3%
Claude Sonnet 4.610010089897691.0%
Xiaomi MIMO v2.5 Pro1001001001005490.8%
Qwen 3.6 27B1001001001005390.6%
GPT-5.5 (Reasoning)100100100767690.5%
Mistral Small 41001001001005290.4%
ByteDance Seed 1.6100100100807089.9%
MoonshotAI: Kimi K2.51001001001004789.4%
Z.AI GLM 5.11001001001004288.4%
Aion 2.01001001001004188.2%
Inception Mercury 21001001001003987.8%
Claude Opus 4.7100100100696987.6%
ByteDance Seed 2.0 Mini100100100686686.8%
GPT-5.4 (Reasoning)10010076767685.8%
Gemini 2.5 Flash (Reasoning)1001001001002885.6%
Qwen 3.5 Flash100100100656285.3%
GPT-5.410010081806484.8%
Grok 4.310010087845184.5%
GPT-5.5 (Reasoning, Low)1009176767684.1%
ByteDance Seed 2.0 Lite1001001001001883.7%
DeepSeek-V2 Chat1001001001001683.3%
DeepSeek V4 Pro1001001001001683.1%
Claude Opus 4.610010073707082.5%
GPT-4o, May 13th (temp=1)1001001001001282.4%
MiniMax M31008382816582.1%
Qwen 2.5 72B100100100664381.6%
Gemini 3.1 Pro (Preview)1007676767681.1%
Qwen 3 32B100100100832281.0%
DeepSeek V4 Flash100100100861880.9%
Gemma 4 26B (Reasoning)100100100881680.8%
DeepSeek V4 Flash (Reasoning)100100100802380.6%
Gemini 3.1 Flash Lite100100100871480.2%
Gemma 3 27B100100100514879.9%
Qwen 3.5 397B A17B10010080803178.1%
Z.AI GLM 4.510010076585176.9%
Cohere Command R+ (Aug. 2024)100100100414176.4%
Gemini 3 Flash (Preview)1008180734475.5%
Mistral Small 4 (Reasoning)10010067663974.4%
GPT-4o, May 13th (temp=0)10010010061673.4%
GPT-5.51007676703772.0%
Claude Opus 4.8 (Reasoning, Low)10010010055171.3%
Claude Haiku 4.5100100100302170.2%
Qwen3.7 Max1008376761570.2%
GPT-4o, Aug. 6th (temp=0)10010010048069.7%
Z.AI GLM 4.5 Air10010066611668.5%
Hermes 3 70B10010073541368.0%
GPT-5.1100787572766.6%
Qwen 3.5 9B1007372483565.6%
Gemini 3 Flash (Preview, Reasoning)1001008342165.3%
Gemini 3.5 Flash (Reasoning, Minimal)10010064421263.8%
Gemini 3.5 Flash (Reasoning)100837655163.1%
DeepSeek V3.21008650413462.3%
Mistral Large 3656565655362.3%
GPT-5 Nano1001006837562.0%
Llama 3.1 70B1007264472761.9%
Mistral Large 2868665472261.1%
Gemma 4 31B1008558322559.8%
Z.AI GLM 4.7 Flash856867591558.7%
DeepSeek V3 (2024-12-26)1007169292358.4%
DeepSeek V3 (2025-03-24)10010041232357.4%
Mistral Large847256542057.2%
GPT-4.11001006113856.4%
Ministral 3 8B1007751451056.3%
Z.AI GLM 4.71006951381655.1%
Gemini 2.5 Flash Lite (Reasoning)1001004323955.0%
GPT-4.1 Mini1006961222054.6%
Grok 4.20 (Beta)1005748352352.5%
GPT-5.4 Nano (Reasoning, Low)857443391250.6%
Qwen3 235B A22B Instruct 2507746953352150.5%
Ministral 8B87714827647.9%
WizardLM 2 8x22b100774310446.9%
ByteDance Seed 1.6 Flash10073446645.8%
MiniMax M2.7100633515944.4%
GPT-5.4 Mini744943302243.8%
GPT-4o, Aug. 6th (temp=1)10066431041.9%
Gemini 2.5 Flash Lite10010030040.7%
Skyfall 36B V210078112238.6%
Arcee AI: Trinity Mini10057246438.3%
Z.AI GLM 4.680462625736.6%
Rocinante 12B1006265335.1%
Writer: Palmyra X5613737291034.8%
Hermes 3 405B100292112032.7%
Ministral 3 14B594033191332.5%
Grok 4.20513528261631.3%
Mistral Small 3.2 24B1002793228.5%
Mistral Medium 3.161461313828.2%
MiniMax M2.5393833171127.5%
Cydonia 24B V4.110016146127.5%
GPT-4.1 Nano513820101025.8%
GPT-5.4 Nano362624231625.0%
Claude 3 Haiku7122205324.1%
Llama 3.1 8B62996618.6%
Ministral 3 3B1817118211.1%
Ministral 3B28884210.1%
Mistral NeMO1254104.6%
Gemma 3 4B1054114.1%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5100100100100100100.0%
Gemini 3.1 Flash Lite (Reasoning)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
Nemotron 3 Super100100.0%
Gemma 3 12B100100100100100100.0%
Claude Opus 41001001001008897.6%
Gemma 4 26B (Reasoning)1001001001008797.4%
Qwen 3.6 35B1001001001008496.8%
GPT-4o, Aug. 6th (temp=0)1001001001008296.5%
Gemini 2.5 Flash1001001001008196.3%
Gemini 3.1 Flash Lite1001001001008196.1%
GPT-5 Mini100100100948094.7%
Claude Sonnet 41001001001007394.5%
Qwen 3.6 Flash100100100888394.2%
Mistral Large 310010091919094.2%
GPT-5.4 (Reasoning, Low)100100100967494.0%
Z.AI GLM 4.5 Air100100100937493.4%
Gemma 4 26B10010091918493.1%
Mistral Large1009190909092.3%
Gemini 3 Flash (Preview)100100100936792.1%
Gemma 4 31B929191919191.2%
GPT-5.4 Mini (Reasoning)10010095907191.2%
Nemotron 3 Nano1001001001005490.8%
ByteDance Seed 2.0 Mini100100100886690.7%
Claude Sonnet 4.510010089877790.5%
Grok 4.20 (Beta, Reasoning)959493868290.1%
Qwen3.6 Max Preview10010088828089.9%
GPT-5.210010095856589.1%
Qwen 3.5 122B10010093796988.1%
Claude Opus 4.51009594906288.1%
o4 Mini High1009391797287.2%
DeepSeek V4 Pro (Reasoning)10010089727086.2%
Gemma 4 31B (Reasoning)10010092914786.1%
GPT-5.4 Nano (Reasoning)959087857085.5%
Claude Opus 4.6919084807984.8%
DeepSeek V3.210010087716684.7%
Qwen 3.6 27B10010094943684.7%
GPT-4o, May 13th (temp=1)10010088864784.1%
Gemini 3.5 Flash (Reasoning)1008482796882.6%
Z.AI GLM 5.11009684715881.8%
Claude Opus 4.71009584794781.1%
DeepSeek V3 (2025-03-24)100100100100581.1%
GPT-5.4 Mini (Reasoning, Low)1008482815780.8%
Z.AI GLM 510010092644880.8%
Qwen 3.5 Plus (2026-04-20)10010076676080.6%
Rocinante 12B100100100100280.3%
GPT-5.5 (Reasoning, Low)1008374746979.9%
Claude Sonnet 4.6 (Reasoning)929183745879.5%
Gemini 2.5 Flash (Reasoning)100100100524479.2%
Mistral Large 210010091624479.2%
Xiaomi MIMO v2.5 Pro1008876755378.7%
DeepSeek V4 Flash10010082565478.5%
ByteDance Seed 2.0 Lite10010010084778.3%
Claude Sonnet 4.6949081685778.2%
GPT-5.41008870636377.0%
Xiaomi MIMO v2.51008178705476.7%
Claude Opus 4.8 (Reasoning, Low)928581735176.1%
GPT-4o, Aug. 6th (temp=1)1008978624975.6%
Grok 4.31009181723475.4%
Z.AI GLM 5 Turbo1009676624275.3%
Qwen 3.5 Plus (2026-02-15)938784813275.3%
Qwen 3.5 35B1008684663774.7%
DeepSeek V4 Pro1009083484873.9%
Qwen 3.5 Flash938273695373.9%
Aion 2.01007472714873.0%
GPT-OSS 120B858467626071.8%
Z.AI GLM 4.510010091402771.6%
o4 Mini1009267534270.8%
Grok 4.20 (Reasoning)898574654070.7%
Claude Opus 4.8 (Reasoning)927165626270.3%
Gemini 3 Flash (Preview, Reasoning)959494363069.7%
Claude Opus 4.7 (Reasoning)1009286393069.4%
Grok 4.3 (Reasoning)10010080561169.2%
GPT-5.4 Nano (Reasoning, Low)1007869603267.8%
Qwen 3.5 397B A17B1008371641266.1%
GPT-4o, May 13th (temp=0)1008680491465.9%
Claude Haiku 4.51005957565465.2%
Claude Opus 4.6 (Reasoning)936563555065.1%
GPT-5 Nano878068622664.7%
Inception Mercury 21006762433861.8%
GPT-5.51005754494661.2%
Gemini 3.5 Flash (Reasoning, Minimal)1009191121161.1%
Llama 3.1 70B1007771382061.0%
DeepSeek-V2 Chat1001008214359.9%
Qwen 3 32B1008845323058.9%
GPT-4.11007368421058.7%
Skyfall 36B V210010041381358.4%
MoonshotAI: Kimi K2.6875847424155.2%
Gemini 2.5 Pro91765049153.5%
Z.AI GLM 4.61007647271753.3%
ByteDance Seed 1.6836655491153.0%
MoonshotAI: Kimi K2.5686446433851.8%
Hermes 3 70B100100486151.1%
Qwen 3.5 27B885148363150.8%
GPT-4.1 Mini898642211250.3%
Qwen3.7 Max100714826650.3%
DeepSeek V4 Flash (Reasoning)726453342449.4%
GPT-5.4 (Reasoning)866036332949.1%
Writer: Palmyra X59285624248.8%
Cohere Command R+ (Aug. 2024)1001001616046.3%
GPT-5.5 (Reasoning)594945423546.1%
Qwen 3.5 9B1006626251045.7%
Hermes 3 405B100693212443.3%
WizardLM 2 8x22b100831312542.5%
GPT-5.4 Mini884038271842.4%
DeepSeek V3.1100811310742.2%
MiniMax M2.789682515740.6%
DeepSeek V3 (2024-12-26)100592411439.6%
Z.AI GLM 4.769474331739.5%
GPT-5.1934121131035.4%
Qwen3 235B A22B Instruct 250710048104333.0%
Z.AI GLM 4.7 Flash65542211231.0%
MiniMax M3393938181429.8%
Grok 4.20755894229.7%
Arcee AI: Trinity Mini10023120027.0%
MiniMax M2.5523817161126.6%
Gemma 3 27B39343225326.3%
Claude 3 Haiku8125165426.0%
Cydonia 24B V4.110017111025.8%
Mistral Small 4 (Reasoning)413421151024.0%
Mistral Small 441262113821.7%
Gemini 2.5 Flash Lite (Reasoning)572685420.2%
GPT-4.1 Nano362713121119.8%
GPT-5.4 Nano291816151418.7%
Gemini 3.1 Pro (Preview)3228101114.5%
Ministral 3 8B1816158512.7%
Qwen 2.5 72B19161312212.4%
Ministral 8B2110108510.6%
Mistral Small 3.2 24B2365428.1%
Mistral Medium 3.12294428.0%
Llama 3.1 8B1585326.6%
Gemini 2.5 Flash Lite1097616.5%
Ministral 3 14B15114106.0%
Ministral 3B1064314.8%
Mistral NeMO1143214.2%
Gemma 3 4B622112.2%
Ministral 3 3B532112.1%
Grok 4.20 (Beta)711101.8%