Relationship recall

Test: Relationship tree

Avg. Score
19.8%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5.4 (Reasoning)96.3%$0.1752.6m87%
2GPT-5.5 (Reasoning)91.5%$0.2512.0m75%
3Claude Opus 4.6 (Reasoning)88.7%$0.3562.6m69%
4GPT-5.5 (Reasoning, Low)65.0%$0.09742.6s20%
5GPT-5.4 (Reasoning, Low)42.3%$0.05335.5s26%
6GPT-5.562.8%$0.10537.1s16%
7Claude Opus 4.664.2%$0.15431.4s21%
8MoonshotAI: Kimi K2.672.6%$0.0825.0m39%
9Gemini 3 Flash (Preview)28.3%$0.00877.1s19%
10Gemini 2.5 Pro58.0%$0.09958.2s14%
11GPT-5.425.4%$0.03317.4s17%
12Claude Opus 4.7 (Reasoning)61.7%$0.18425.0s15%
13DeepSeek V4 Flash (Reasoning)49.8%$0.00413.5m19%
14GPT-5.129.3%$0.02517.7s11%
15Gemini 3 Flash (Preview, Reasoning)41.5%$0.03040.3s4%
16Claude Sonnet 4.636.4%$0.08123.3s15%
17GPT-5.4 Nano (Reasoning)19.5%$0.008841.7s17%
18Gemini 3.1 Pro (Preview)65.3%$0.1721.5m14%
19Gemini 3.5 Flash (Reasoning)49.7%$0.11349.4s10%
20Claude Opus 4.756.1%$0.18224.3s14%
21GPT-5.235.0%$0.06250.7s13%
22Xiaomi MIMO v2.526.5%$0.00271.1m10%
23MoonshotAI: Kimi K2.560.5%$0.0315.4m24%
24Z.AI GLM 5 Turbo47.0%$0.0443.6m21%
25Grok 4.20 (Beta, Reasoning)18.7%$0.03230.1s15%
26Claude Opus 4.8 (Reasoning)76.4%$0.3811.9m43%
27Z.AI GLM 4.633.1%$0.0183.0m22%
28Nemotron 3 Super0.0%$0.00003.2m
29Inception Mercury 214.7%$0.009612.8s8%
30ByteDance Seed 1.618.7%$0.0141.2m13%
31Claude Sonnet 4.518.4%$0.07718.6s16%
32Gemini 3.5 Flash (Reasoning, Minimal)20.7%$0.0246.3s2%
33Grok 4.20 (Reasoning)16.9%$0.02740.0s11%
34Qwen3.7 Max50.8%$0.0453.5m10%
35Claude Opus 4.530.3%$0.13119.8s13%
36DeepSeek V4 Pro (Reasoning)31.7%$0.0202.3m9%
37GPT-551.8%$0.0922.3m4%
38MiniMax M2.511.6%$0.003645.0s7%
39Gemini 2.5 Flash9.3%$0.00756.1s4%
40MiniMax M2.711.0%$0.006321.4s4%
41Mistral Large 38.2%$0.009117.5s6%
42Mistral Medium 3.19.4%$0.01021.2s6%
43Gemini 3.1 Flash Lite (Reasoning)6.0%$0.00333.2s5%
44Qwen 3.5 397B A17B28.3%$0.0452.6m16%
45Grok 4.38.4%$0.00947.8s4%
46Gemini 3.1 Flash Lite (Preview)5.8%$0.00463.2s4%
47Ministral 3 8B7.6%$0.002217.0s4%
48Ministral 8B6.4%$0.001615.6s4%
49o4 Mini High21.0%$0.0591.3m11%
50DeepSeek V3.28.7%$0.004633.9s5%
51DeepSeek V4 Pro7.4%$0.008317.3s4%
52Gemma 4 26B6.5%$0.001827.2s5%
53Gemini 3.1 Flash Lite5.9%$0.00313.3s2%
54Qwen 3.6 Flash11.4%$0.01551.7s7%
55GPT-5.4 Mini (Reasoning, Low)6.8%$0.01414.4s5%
56GPT-5.4 Mini6.1%$0.00877.9s3%
57DeepSeek V4 Flash11.8%$0.002051.4s4%
58Ministral 3 14B6.9%$0.003125.4s4%
59GPT-5 Mini20.3%$0.0192.0m9%
60o4 Mini12.0%$0.03745.7s8%
61Z.AI GLM 4.58.0%$0.007548.1s6%
62Grok 4.205.9%$0.0207.5s4%
63Mistral Large 29.2%$0.03720.5s6%
64GPT-4.17.0%$0.0248.5s4%
65MiniMax M359.9%$0.0369.3m44%
66Xiaomi MIMO v2.5 Pro20.6%$0.00761.9m5%
67Ministral 3 3B3.5%$0.00079.9s3%
68Z.AI GLM 5.168.0%$0.0817.3m26%
69GPT-5.4 Mini (Reasoning)38.2%$0.1172.8m18%
70Ministral 3B4.1%$0.00079.8s2%
71Qwen 3.6 35B10.4%$0.0141.1m8%
72GPT-5.4 Nano3.1%$0.002611.2s2%
73GPT-4o, Aug. 6th (temp=0)9.3%$0.0328.4s1%
74Mistral Small 3.2 24B3.5%$0.001713.8s1%
75GPT-4.1 Mini3.5%$0.004626.0s3%
76Grok 4.20 (Beta)3.3%$0.0195.5s3%
77GPT-5.4 Nano (Reasoning, Low)2.4%$0.003813.8s2%
78Writer: Palmyra X53.9%$0.01618.4s3%
79Claude Haiku 4.54.8%$0.0248.2s2%
80DeepSeek V3.19.1%$0.005552.0s1%
81Mistral NeMO2.2%$0.002415.4s0%
82GPT-4.1 Nano0.2%$0.00096.9s0%
83Gemini 2.5 Flash Lite0.2%$0.00234.8s0%
84Z.AI GLM 4.731.8%$0.0253.7m9%
85DeepSeek-V2 Chat4.3%$0.004934.0s1%
86Mistral Small 40.4%$0.00307.4s0%
87LFM2 24B0.0%$0.00047.8s0%
88Claude Sonnet 412.5%$0.07217.5s4%
89DeepSeek V3 (2024-12-26)3.3%$0.004540.2s3%
90Grok 4.3 (Reasoning)1.5%$0.01313.9s1%
91Mistral Small 4 (Reasoning)2.0%$0.005124.7s1%
92Qwen 3.5 35B13.1%$0.0152.0m8%
93Claude 3 Haiku0.3%$0.005611.6s0%
94Z.AI GLM 540.4%$0.0374.6m11%
95Qwen 3.5 27B25.1%$0.0293.7m15%
96GPT-OSS 120B13.4%$0.00402.2m7%
97Llama 3.1 70B3.1%$0.006434.0s1%
98Mistral Large6.9%$0.03641.1s4%
99Gemini 2.5 Flash Lite (Reasoning)5.6%$0.00731.0m2%
100Gemini 2.5 Flash (Reasoning)6.1%$0.02337.3s1%
101Gemma 3 4B0.1%$0.000727.5s0%
102Skyfall 36B V20.9%$0.005426.5s0%
103Gemma 3 27B0.2%$0.001627.5s0%
104ByteDance Seed 1.6 Flash1.6%$0.001938.2s0%
105GPT-4o, Aug. 6th (temp=1)1.6%$0.0299.1s1%
106Rocinante 12B0.9%$0.003332.1s0%
107Gemma 3 12B0.0%$0.001430.4s0%
108Qwen3 235B A22B Instruct 25073.1%$0.00171.0m2%
109Hermes 3 70B1.2%$0.004542.7s0%
110Cydonia 24B V4.12.6%$0.00401.0m1%
111Aion 2.017.7%$0.0243.0m10%
112Llama 3.1 8B0.4%$0.000551.5s0%
113DeepSeek V3 (2025-03-24)0.8%$0.003750.3s0%
114Qwen 3.5 Plus (2026-02-15)24.9%$0.0213.9m10%
115GPT-4o, May 13th (temp=0)9.1%$0.0885.5s1%
116Qwen3.6 Max Preview34.7%$0.0843.7m10%
117Qwen 2.5 72B1.7%$0.00561.2m1%
118Gemma 4 31B6.8%$0.00281.9m3%
119Hermes 3 405B1.8%$0.0151.0m1%
120Qwen 3.6 27B17.7%$0.0342.7m5%
121Claude Opus 4.8 (Reasoning, Low)61.4%$0.3831.9m25%
122Qwen 3.5 Flash16.3%$0.00403.8m10%
123GPT-4o, May 13th (temp=1)2.4%$0.0865.4s1%
124Qwen 3.5 Plus (2026-04-20)11.8%$0.0233.1m8%
125WizardLM 2 8x22b6.2%$0.00982.3m2%
126ByteDance Seed 2.0 Lite20.3%$0.0143.6m1%
127Arcee AI: Trinity Mini0.5%$0.00371.9m0%
128Gemma 4 26B (Reasoning)8.1%$0.00413.1m3%
129Z.AI GLM 4.7 Flash3.9%$0.00502.8m4%
130Cohere Command R+ (Aug. 2024)0.4%$0.05756.5s0%
131GPT-5 Nano1.9%$0.00802.2m1%
132Qwen 3 32B2.0%$0.00422.9m2%
133Z.AI GLM 4.5 Air4.8%$0.00713.2m2%
134Nemotron 3 Nano0.4%$0.00392.8m0%
135Qwen 3.5 122B12.1%$0.0354.0m6%
136Gemma 4 31B (Reasoning)19.5%$0.00435.8m2%
137Claude Sonnet 4.6 (Reasoning)70.7%$0.4485.9m32%
138ByteDance Seed 2.0 Mini12.7%$0.00628.7m11%
139Qwen 3.5 9B4.6%$0.00407.8m1%
140Claude Opus 413.6%$0.3682.4m7%
19.81%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
GPT-5.5100100100100100100.0%
GPT-51001001001009198.2%
Claude Opus 4.7100100100838393.1%
Z.AI GLM 5.11001001001005591.0%
Gemini 2.5 Pro1001001001005591.0%
MoonshotAI: Kimi K2.6100100100836890.1%
Claude Opus 4.61001001001004989.8%
Qwen3.7 Max100100100682778.9%
MoonshotAI: Kimi K2.5100100100352772.3%
Gemini 3 Flash (Preview, Reasoning)10010010055171.3%
Gemini 3.5 Flash (Reasoning)10010010055171.3%
Z.AI GLM 5 Turbo1006868494465.7%
Claude Opus 4.8 (Reasoning, Low)1001006855164.8%
MiniMax M3836868554463.3%
Qwen3.6 Max Preview1005555553559.8%
Z.AI GLM 510010055271659.5%
GPT-5.4 (Reasoning, Low)757549494458.4%
Claude Sonnet 4.61005544444457.3%
GPT-5.4 Mini (Reasoning)916844442153.4%
GPT-5.2837544352452.0%
Z.AI GLM 4.6555555443949.5%
GPT-5.1836144312448.4%
Z.AI GLM 4.791555535648.2%
Gemini 3 Flash (Preview)555544393946.3%
Claude Opus 4.51004427272745.0%
DeepSeek V4 Flash (Reasoning)1004427242143.1%
DeepSeek V4 Pro (Reasoning)1005527161642.6%
Xiaomi MIMO v2.51003927271641.7%
Qwen 3.5 Plus (2026-02-15)685544212141.6%
Qwen 3.5 397B A17B554444442141.4%
GPT-5.4493939393941.1%
Gemini 3.5 Flash (Reasoning, Minimal)100552711840.3%
ByteDance Seed 2.0 Lite10055277438.5%
o4 Mini High554431272736.7%
Gemma 4 31B (Reasoning)55554427236.6%
Qwen 3.5 27B75352727032.7%
GPT-5 Mini614427161632.6%
ByteDance Seed 1.6443931181829.9%
Qwen 3.6 27B83311813529.9%
Grok 4.20 (Beta, Reasoning)392727272128.2%
Claude Sonnet 4.5272727272727.0%
Grok 4.20 (Reasoning)442424241626.1%
Inception Mercury 2393924131325.7%
Qwen 3.5 Flash443127131325.7%
Xiaomi MIMO v2.5 Pro4935247323.5%
GPT-OSS 120B313121211022.5%
DeepSeek V4 Flash39242416721.8%
Claude Sonnet 427272727121.7%
GPT-5.4 Nano (Reasoning)35242416821.2%
Claude Opus 427272724121.1%
Qwen 3.5 35B27272421620.8%
MiniMax M2.735271616719.9%
Qwen 3.5 122B31272116619.9%
Qwen 3.6 Flash312721111019.9%
MiniMax M2.5272421161119.7%
ByteDance Seed 2.0 Mini272416161619.5%
Qwen 3.6 35B272116161318.4%
GPT-4o, Aug. 6th (temp=0)4921135017.6%
DeepSeek V3.127242413017.6%
o4 Mini27211616817.4%
GPT-4o, May 13th (temp=0)24212121117.4%
Qwen 3.5 Plus (2026-04-20)31211613617.2%
Gemini 2.5 Flash27241810717.0%
Mistral Large 2242113131316.9%
Aion 2.0493500016.8%
Grok 4.33118138715.4%
Gemma 4 26B (Reasoning)2727117214.9%
Mistral Large 3211313131314.8%
DeepSeek V3.224161310613.6%
Ministral 3 8B2116166512.4%
Mistral Large241388712.1%
Ministral 3 14B16161111611.9%
DeepSeek V4 Pro271388211.8%
Gemma 4 31B2116116411.4%
GPT-4.1241188511.2%
Gemma 4 26B131111101011.1%
GPT-5.4 Mini13131310611.1%
Z.AI GLM 4.5181687711.1%
Gemini 3.1 Flash Lite (Reasoning)11111111810.8%
Mistral Medium 3.1241873110.6%
Gemini 3.1 Flash Lite1811118410.6%
GPT-5.4 Mini (Reasoning, Low)1813118110.3%
Gemini 3.1 Flash Lite (Preview)161188810.3%
WizardLM 2 8x22b242421110.3%
Grok 4.2021117659.9%
Ministral 8B18117649.2%
Qwen 3.5 9B16108718.4%
Claude Haiku 4.513138228.0%
DeepSeek-V2 Chat2476217.9%
Z.AI GLM 4.5 Air24103207.8%
Ministral 3B1686607.1%
Gemini 2.5 Flash (Reasoning)2722116.5%
Writer: Palmyra X51186616.4%
Grok 4.20 (Beta)1085556.4%
Mistral Small 3.2 24B1087616.3%
GPT-4.1 Mini876556.0%
Z.AI GLM 4.7 Flash1066545.9%
Llama 3.1 70B1384215.7%
DeepSeek V3 (2024-12-26)776635.6%
Gemini 2.5 Flash Lite (Reasoning)1643225.5%
Ministral 3 3B877415.4%
Qwen3 235B A22B Instruct 2507886215.1%
GPT-5.4 Nano1143214.4%
Mistral NeMO1361004.1%
GPT-4o, May 13th (temp=1)864114.1%
Cydonia 24B V4.11322103.7%
Mistral Small 4 (Reasoning)644303.3%
GPT-5 Nano654203.3%
Qwen 3 32B643213.2%
Qwen 2.5 72B632223.1%
GPT-5.4 Nano (Reasoning, Low)1021103.0%
Hermes 3 405B722102.5%
GPT-4o, Aug. 6th (temp=1)632002.4%
Hermes 3 70B811102.1%
Grok 4.3 (Reasoning)332202.1%
Skyfall 36B V2621001.8%
Rocinante 12B322101.8%
ByteDance Seed 1.6 Flash710001.6%
DeepSeek V3 (2025-03-24)332001.6%
Arcee AI: Trinity Mini500001.0%
Cohere Command R+ (Aug. 2024)220000.8%
Nemotron 3 Nano111000.7%
Llama 3.1 8B200000.6%
Claude 3 Haiku110000.6%
Mistral Small 4200000.5%
Gemma 3 27B100000.3%
GPT-4.1 Nano000000.3%
Gemini 2.5 Flash Lite100000.2%
Gemma 3 4B000000.1%
Gemma 3 12B000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)10010096868092.6%
GPT-5.5 (Reasoning)868686866982.9%
Claude Opus 4.6 (Reasoning)808080776977.3%
Claude Opus 4.8 (Reasoning, Low)865954504258.1%
MiniMax M3806969333256.5%
DeepSeek V4 Flash (Reasoning)937748442156.5%
MoonshotAI: Kimi K2.6905652403855.1%
Claude Opus 4.8 (Reasoning)935042423852.8%
MoonshotAI: Kimi K2.5726159322048.6%
Z.AI GLM 5.1594642404045.1%
Claude Sonnet 4.6 (Reasoning)744432292941.4%
Claude Opus 4.6594038362038.6%
Gemini 3.1 Pro (Preview)806443230.6%
GPT-5.5 (Reasoning, Low)69322921030.1%
Z.AI GLM 5 Turbo6148267028.4%
Gemini 3.5 Flash (Reasoning)442727241928.2%
GPT-5.4 (Reasoning, Low)443820151426.1%
GPT-5.5353532161125.6%
Gemini 2.5 Pro56262616125.0%
Claude Opus 4.7 (Reasoning)5032229423.3%
GPT-5.4 Mini (Reasoning)40322418223.0%
Qwen3.7 Max6126260022.6%
Z.AI GLM 5504684021.4%
DeepSeek V4 Pro (Reasoning)5629116220.8%
Claude Opus 4.733301310919.1%
Aion 2.0322216131118.7%
GPT-5.227201915918.1%
GPT-5.4 Nano (Reasoning)25211917717.8%
Xiaomi MIMO v2.5 Pro661083217.6%
Qwen 3.5 27B292116111117.5%
Z.AI GLM 4.632211511516.8%
Claude Opus 4.52718149915.6%
Claude Sonnet 4.620191713815.5%
Z.AI GLM 4.721181711915.4%
Qwen 3.5 397B A17B25241311415.2%
Gemini 3 Flash (Preview, Reasoning)331183311.8%
Xiaomi MIMO v2.5351092111.3%
Gemini 3 Flash (Preview)211575310.3%
GPT-5.1201287410.1%
Claude Sonnet 4.5131111869.8%
GPT-5.412119979.8%
Qwen3.6 Max Preview1698879.5%
Grok 4.20 (Beta, Reasoning)19137519.2%
Qwen 3.5 Plus (2026-02-15)14106558.3%
Mistral Medium 3.12177328.2%
GPT-5 Mini1399558.0%
Grok 4.20 (Reasoning)12119427.7%
ByteDance Seed 1.619105407.5%
Qwen 3.5 Flash1775416.9%
o4 Mini1096436.5%
Qwen 3.5 Plus (2026-04-20)1366526.3%
Claude Opus 4977526.1%
ByteDance Seed 2.0 Mini1095326.0%
Gemini 2.5 Flash Lite (Reasoning)1973105.8%
Gemini 2.5 Flash (Reasoning)2422205.7%
Qwen 3.6 27B1085325.6%
GPT-51354325.5%
Qwen 3.5 35B1465205.4%
o4 Mini High876425.4%
Z.AI GLM 4.5974324.9%
Qwen 3.5 122B954404.3%
GPT-OSS 120B754324.3%
DeepSeek V3.2653323.8%
Ministral 8B843313.7%
Inception Mercury 2961113.6%
MiniMax M2.5643223.6%
Claude Sonnet 4822223.3%
GPT-5.4 Mini (Reasoning, Low)653213.2%
Qwen 3.6 Flash822213.0%
DeepSeek V4 Pro1011112.9%
GPT-4.1443222.8%
Ministral 3 8B543212.8%
Gemma 4 31B (Reasoning)432202.4%
Qwen 3.6 35B533002.3%
Gemma 4 31B322222.2%
WizardLM 2 8x22b432102.1%
ByteDance Seed 2.0 Lite333202.1%
MiniMax M2.7432112.1%
Gemma 4 26B222222.0%
Grok 4.20432001.9%
Ministral 3 14B332111.9%
Z.AI GLM 4.7 Flash522001.9%
GPT-5.4 Nano322211.8%
DeepSeek V4 Flash322211.8%
GPT-5.4 Nano (Reasoning, Low)322111.8%
Mistral Large321111.7%
Z.AI GLM 4.5 Air421101.7%
Ministral 3 3B421101.6%
Gemini 2.5 Flash222111.6%
Claude Haiku 4.5222111.6%
Mistral Large 3222111.5%
Mistral Large 2221111.5%
Grok 4.3321101.4%
Cydonia 24B V4.1421101.4%
Writer: Palmyra X5331001.4%
Gemini 3.1 Flash Lite221111.3%
Gemini 3.1 Flash Lite (Preview)211111.3%
Gemma 4 26B (Reasoning)600001.3%
Gemini 3.1 Flash Lite (Reasoning)211111.2%
Hermes 3 405B510001.1%
Ministral 3B211101.1%
Qwen3 235B A22B Instruct 2507411001.1%
GPT-4o, Aug. 6th (temp=0)211111.1%
Gemini 3.5 Flash (Reasoning, Minimal)221001.1%
GPT-5.4 Mini211101.0%
GPT-4.1 Mini211101.0%
DeepSeek V3 (2024-12-26)211001.0%
Grok 4.3 (Reasoning)211100.9%
GPT-4o, May 13th (temp=0)211000.9%
Qwen 3 32B311000.9%
GPT-4o, Aug. 6th (temp=1)111100.9%
Qwen 3.5 9B111100.9%
GPT-4o, May 13th (temp=1)111100.8%
Mistral Small 4 (Reasoning)111100.8%
DeepSeek-V2 Chat211000.7%
Mistral Small 3.2 24B111000.6%
DeepSeek V3.1110000.6%
GPT-5 Nano111000.5%
Llama 3.1 70B110000.5%
Mistral Small 4110000.4%
Hermes 3 70B200000.4%
Qwen 2.5 72B100000.4%
Mistral NeMO100000.3%
Grok 4.20 (Beta)100000.2%
Gemini 2.5 Flash Lite100000.2%
Llama 3.1 8B000000.1%
Claude 3 Haiku000000.1%
Gemma 3 4B000000.1%
Skyfall 36B V2000000.1%
Gemma 3 27B000000.0%
Cohere Command R+ (Aug. 2024)000000.0%
GPT-4.1 Nano000000.0%
Arcee AI: Trinity Mini000000.0%
Rocinante 12B000000.0%
DeepSeek V3 (2025-03-24)000000.0%
Nemotron 3 Nano000000.0%
Nemotron 3 Super00.0%
Gemma 3 12B000000.0%