Relationship category recall

Test: Relationship tree

Avg. Score
24.5%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1GPT-5.4 (Reasoning)94.3%$0.1752.6m80%
2GPT-5.5 (Reasoning)90.0%$0.2512.0m72%
3GPT-5.570.8%$0.10537.1s31%
4MoonshotAI: Kimi K2.680.3%$0.0825.0m56%
5GPT-5.5 (Reasoning, Low)70.3%$0.09742.6s27%
6Claude Opus 4.672.3%$0.15431.4s33%
7Claude Opus 4.6 (Reasoning)89.6%$0.3562.6m72%
8GPT-5.4 (Reasoning, Low)47.0%$0.05335.5s33%
9Gemini 3 Flash (Preview)34.9%$0.00877.1s23%
10Claude Opus 4.8 (Reasoning)84.0%$0.3811.9m61%
11Gemini 2.5 Pro63.0%$0.09958.2s19%
12Claude Opus 4.763.8%$0.18224.3s25%
13Claude Opus 4.7 (Reasoning)67.2%$0.18425.0s21%
14Gemini 3.5 Flash (Reasoning)58.2%$0.11349.4s20%
15Claude Sonnet 4.643.9%$0.08123.3s22%
16GPT-5.429.2%$0.03317.4s23%
17Grok 4.20 (Beta, Reasoning)27.8%$0.03230.1s25%
18DeepSeek V4 Flash (Reasoning)54.5%$0.00413.5m23%
19GPT-5.240.9%$0.06250.7s21%
20Gemini 3 Flash (Preview, Reasoning)47.3%$0.03040.3s8%
21MoonshotAI: Kimi K2.566.3%$0.0315.4m31%
22Z.AI GLM 5 Turbo53.5%$0.0443.6m27%
23GPT-5.132.7%$0.02517.7s13%
24GPT-5.4 Nano (Reasoning)24.7%$0.008841.7s21%
25Xiaomi MIMO v2.533.8%$0.00271.1m15%
26Z.AI GLM 4.639.3%$0.0183.0m29%
27Grok 4.20 (Reasoning)26.5%$0.02740.0s20%
28Gemini 3.1 Pro (Preview)66.3%$0.1721.5m15%
29Claude Sonnet 4.525.0%$0.07718.6s23%
30GPT-5.4 Mini (Reasoning)51.1%$0.1172.8m29%
31ByteDance Seed 1.625.4%$0.0141.2m18%
32Qwen3.7 Max55.8%$0.0453.5m15%
33Qwen 3.5 397B A17B35.0%$0.0452.6m23%
34o4 Mini20.7%$0.03745.7s18%
35Claude Opus 4.536.6%$0.13119.8s17%
36Inception Mercury 218.1%$0.009612.8s10%
37Mistral Medium 3.116.2%$0.01021.2s13%
38MiniMax M2.517.9%$0.003645.0s14%
39MiniMax M2.717.2%$0.006321.4s11%
40Z.AI GLM 5.173.7%$0.0817.3m36%
41GPT-556.5%$0.0922.3m9%
42DeepSeek V4 Pro (Reasoning)36.9%$0.0202.3m12%
43Gemma 4 26B13.1%$0.001827.2s13%
44Ministral 8B14.1%$0.001615.6s10%
45Gemini 3.1 Flash Lite (Reasoning)11.9%$0.00333.2s10%
46Gemini 3.5 Flash (Reasoning, Minimal)24.1%$0.0246.3s4%
47Ministral 3 8B14.2%$0.002217.0s10%
48Gemini 3.1 Flash Lite (Preview)11.7%$0.00463.2s10%
49Xiaomi MIMO v2.5 Pro26.5%$0.00761.9m14%
50Gemini 2.5 Flash14.3%$0.00756.1s8%
51Grok 4.313.4%$0.00947.8s9%
52DeepSeek V3.214.0%$0.004633.9s11%
53Z.AI GLM 4.740.9%$0.0253.7m19%
54Ministral 3 14B12.8%$0.003125.4s10%
55Gemini 3.1 Flash Lite12.0%$0.00313.3s8%
56Qwen 3.6 Flash17.4%$0.01551.7s12%
57o4 Mini High27.4%$0.0591.3m15%
58GPT-4.113.4%$0.0248.5s11%
59DeepSeek V4 Pro13.1%$0.008317.3s10%
60Nemotron 3 Super0.0%$0.00003.2m
61Mistral Large 312.0%$0.009117.5s10%
62GPT-5.4 Mini (Reasoning, Low)11.4%$0.01414.4s10%
63GPT-5 Mini27.1%$0.0192.0m13%
64Qwen 3.6 35B15.1%$0.0141.1m14%
65Qwen 3.5 35B19.6%$0.0152.0m17%
66DeepSeek V4 Flash15.5%$0.002051.4s8%
67Z.AI GLM 4.513.3%$0.007548.1s10%
68MiniMax M365.5%$0.0369.3m44%
69Mistral Large 213.3%$0.03720.5s10%
70Qwen 3.5 27B32.1%$0.0293.7m22%
71Grok 4.208.6%$0.0207.5s9%
72GPT-5.4 Mini9.1%$0.00877.9s6%
73GPT-4o, Aug. 6th (temp=0)15.2%$0.0328.4s5%
74GPT-5.4 Nano7.4%$0.002611.2s6%
75Ministral 3B7.5%$0.00079.8s4%
76Qwen3.6 Max Preview43.8%$0.0843.7m19%
77Claude Sonnet 417.7%$0.07217.5s9%
78Ministral 3 3B6.1%$0.00079.9s5%
79Qwen 3.5 Plus (2026-02-15)33.7%$0.0213.9m17%
80Mistral Small 3.2 24B6.6%$0.001713.8s5%
81DeepSeek V3 (2024-12-26)7.6%$0.004540.2s8%
82Claude Opus 4.8 (Reasoning, Low)69.3%$0.3831.9m33%
83GPT-4.1 Mini7.0%$0.004626.0s6%
84Mistral Large12.5%$0.03641.1s9%
85Qwen 3.5 Plus (2026-04-20)22.3%$0.0233.1m19%
86GPT-5.4 Nano (Reasoning, Low)5.6%$0.003813.8s4%
87Grok 4.3 (Reasoning)5.5%$0.01313.9s6%
88DeepSeek-V2 Chat7.5%$0.004934.0s6%
89Claude Haiku 4.57.6%$0.0248.2s5%
90Writer: Palmyra X55.8%$0.01618.4s7%
91Qwen 3.6 27B26.3%$0.0342.7m13%
92GPT-OSS 120B17.1%$0.00402.2m11%
93DeepSeek V3.111.2%$0.005552.0s5%
94Mistral Small 4 (Reasoning)5.9%$0.005124.7s5%
95Grok 4.20 (Beta)4.5%$0.0195.5s5%
96Gemma 4 31B14.1%$0.00281.9m10%
97Aion 2.024.4%$0.0243.0m14%
98Gemini 2.5 Flash (Reasoning)10.7%$0.02337.3s4%
99Gemini 2.5 Flash Lite (Reasoning)10.3%$0.00731.0m5%
100Z.AI GLM 543.2%$0.0374.6m13%
101Llama 3.1 70B5.6%$0.006434.0s4%
102Mistral NeMO4.3%$0.002415.4s1%
103GPT-4o, Aug. 6th (temp=1)4.3%$0.0299.1s4%
104Mistral Small 41.8%$0.00307.4s0%
105GPT-4.1 Nano0.6%$0.00096.9s0%
106GPT-4o, May 13th (temp=0)14.0%$0.0885.5s4%
107Gemini 2.5 Flash Lite0.7%$0.00234.8s0%
108Claude 3 Haiku1.3%$0.005611.6s1%
109Qwen 3.5 Flash22.4%$0.00403.8m14%
110LFM2 24B0.0%$0.00047.8s0%
111ByteDance Seed 1.6 Flash3.5%$0.001938.2s1%
112Qwen 2.5 72B5.0%$0.00561.2m5%
113Qwen3 235B A22B Instruct 25074.3%$0.00171.0m4%
114Cydonia 24B V4.15.4%$0.00401.0m3%
115Gemma 3 27B0.8%$0.001627.5s0%
116Skyfall 36B V21.4%$0.005426.5s0%
117Hermes 3 70B2.4%$0.004542.7s1%
118Gemma 3 4B0.3%$0.000727.5s0%
119Rocinante 12B1.4%$0.003332.1s0%
120Gemma 3 12B0.0%$0.001430.4s0%
121Hermes 3 405B4.0%$0.0151.0m2%
122Llama 3.1 8B1.2%$0.000551.5s0%
123DeepSeek V3 (2025-03-24)1.8%$0.003750.3s0%
124GPT-4o, May 13th (temp=1)6.0%$0.0865.4s4%
125WizardLM 2 8x22b9.8%$0.00982.3m6%
126ByteDance Seed 2.0 Lite25.0%$0.0143.6m4%
127Gemma 4 26B (Reasoning)11.6%$0.00413.1m6%
128Z.AI GLM 4.7 Flash7.3%$0.00502.8m7%
129GPT-5 Nano4.8%$0.00802.2m3%
130Qwen 3.5 122B17.9%$0.0354.0m12%
131Z.AI GLM 4.5 Air8.5%$0.00713.2m6%
132Arcee AI: Trinity Mini0.7%$0.00371.9m0%
133Qwen 3 32B5.3%$0.00422.9m4%
134Cohere Command R+ (Aug. 2024)1.0%$0.05756.5s0%
135Gemma 4 31B (Reasoning)25.2%$0.00435.8m7%
136Claude Sonnet 4.6 (Reasoning)75.9%$0.4485.9m43%
137Nemotron 3 Nano0.9%$0.00392.8m0%
138ByteDance Seed 2.0 Mini20.8%$0.00628.7m20%
139Claude Opus 420.0%$0.3682.4m16%
140Qwen 3.5 9B8.5%$0.00407.8m4%
24.45%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning)100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100.0%
Claude Opus 4.7 (Reasoning)100100100100100100.0%
GPT-5.5 (Reasoning, Low)100100100100100100.0%
Claude Opus 4.8 (Reasoning)100100100100100100.0%
GPT-5.5100100100100100100.0%
GPT-51001001001009498.8%
Claude Opus 4.7100100100848493.7%
MoonshotAI: Kimi K2.6100100100887793.1%
Z.AI GLM 5.11001001001005991.7%
Gemini 2.5 Pro1001001001005991.7%
Claude Opus 4.61001001001005390.6%
Qwen3.7 Max100100100773281.8%
MoonshotAI: Kimi K2.5100100100444477.4%
Gemini 3.5 Flash (Reasoning)10010010059472.5%
Gemini 3 Flash (Preview, Reasoning)10010010059472.5%
Z.AI GLM 5 Turbo1007171555369.8%
Claude Opus 4.8 (Reasoning, Low)1001007759467.9%
MiniMax M3847771594867.8%
Qwen3.6 Max Preview1005959594463.9%
GPT-5.4 Mini (Reasoning)947759533262.9%
Z.AI GLM 510010059321861.7%
Claude Sonnet 4.61005948484860.7%
GPT-5.4 (Reasoning, Low)686848464655.4%
GPT-5.2846848412954.2%
Z.AI GLM 4.6595959464252.7%
Z.AI GLM 4.792595939751.1%
Gemini 3 Flash (Preview)595948454551.0%
GPT-5.1845648302448.7%
Claude Opus 4.51004832323248.6%
DeepSeek V4 Pro (Reasoning)1005932222247.0%
Qwen 3.5 Plus (2026-02-15)715948282846.8%
DeepSeek V4 Flash (Reasoning)1004232302946.6%
Qwen 3.5 397B A17B594848462745.6%
Xiaomi MIMO v2.51003832322244.7%
o4 Mini High645334323242.9%
Gemini 3.5 Flash (Reasoning, Minimal)1005932131042.6%
Gemma 4 31B (Reasoning)64594832541.6%
ByteDance Seed 2.0 Lite100593211541.3%
GPT-5.4463838383839.4%
GPT-5 Mini724832222239.4%
Qwen 3.6 27B88392723837.1%
ByteDance Seed 1.6454340242034.5%
Qwen 3.5 27B68393232034.1%
Grok 4.20 (Beta, Reasoning)383535352533.8%
Grok 4.20 (Reasoning)493333292233.2%
Claude Sonnet 4.5323232323231.6%
Qwen 3.5 Flash513632201731.1%
Inception Mercury 2483824221729.7%
Xiaomi MIMO v2.5 Pro59442711729.4%
MiniMax M2.7383222221525.8%
Qwen 3.5 35B323229271025.8%
Qwen 3.6 Flash383227171525.7%
Qwen 3.5 122B363228221025.7%
Claude Sonnet 432323232125.6%
o4 Mini353222221525.4%
MiniMax M2.5323024221925.3%
GPT-OSS 120B373022221425.2%
Qwen 3.5 Plus (2026-04-20)402725231025.0%
ByteDance Seed 2.0 Mini322422222224.5%
GPT-4o, Aug. 6th (temp=0)55322411124.4%
GPT-5.4 Nano (Reasoning)392424221224.4%
Claude Opus 432323224224.3%
GPT-4o, May 13th (temp=0)34282828224.0%
Qwen 3.6 35B322222222023.7%
DeepSeek V4 Flash382424201123.3%
Gemini 2.5 Flash322920141121.0%
Gemma 4 26B (Reasoning)35321914420.6%
Mistral Large 2272417171720.1%
Aion 2.0554400019.6%
Grok 4.3382017121119.5%
DeepSeek V3.132242417019.2%
Ministral 3 8B30242311718.9%
Gemma 4 31B272519121018.6%
Mistral Large 3241717171718.0%
Mistral Large321715141317.9%
Gemini 3.1 Flash Lite (Reasoning)191919191517.9%
Ministral 3 14B222119131217.6%
Gemma 4 26B191917171717.5%
Gemini 3.1 Flash Lite241919151017.4%
Gemini 3.1 Flash Lite (Preview)221915151517.3%
DeepSeek V4 Pro32201515417.2%
DeepSeek V3.224221714816.8%
Z.AI GLM 4.5222115111116.1%
GPT-4.124191212915.1%
Ministral 8B28171410715.1%
GPT-5.4 Mini (Reasoning, Low)20171616114.3%
GPT-5.4 Mini17171711813.6%
Qwen 3.5 9B21181412213.3%
Mistral Medium 3.12421117213.1%
Grok 4.2018121111711.7%
Ministral 3B211899111.5%
DeepSeek-V2 Chat2411107411.2%
Claude Haiku 4.51717126511.2%
WizardLM 2 8x22b242433211.1%
Z.AI GLM 4.5 Air291175010.3%
Z.AI GLM 4.7 Flash171198710.3%
DeepSeek V3 (2024-12-26)12111110710.2%
GPT-4.1 Mini121110979.7%
Mistral Small 3.2 24B1411101049.5%
Gemini 2.5 Flash (Reasoning)3254229.1%
Gemini 2.5 Flash Lite (Reasoning)2275448.5%
Llama 3.1 70B17108538.3%
Ministral 3 3B121010738.3%
Writer: Palmyra X512108848.2%
Grok 4.20 (Beta)11107778.1%
Qwen 2.5 72B1396668.0%
GPT-4o, May 13th (temp=1)15117538.0%
Mistral Small 4 (Reasoning)121110707.9%
Mistral NeMO22123207.9%
GPT-5.4 Nano1976527.9%
Qwen 3 32B11106637.1%
GPT-5 Nano1398416.9%
Cydonia 24B V4.12073206.5%
Qwen3 235B A22B Instruct 25071086536.0%
Grok 4.3 (Reasoning)977605.8%
GPT-5.4 Nano (Reasoning, Low)1453225.1%
GPT-4o, Aug. 6th (temp=1)1076115.1%
Hermes 3 405B1163204.6%
Hermes 3 70B1132213.7%
DeepSeek V3 (2025-03-24)774003.6%
ByteDance Seed 1.6 Flash1411103.5%
Rocinante 12B543202.9%
Skyfall 36B V2832002.6%
Cohere Command R+ (Aug. 2024)730001.9%
Llama 3.1 8B720001.8%
Claude 3 Haiku322111.8%
Nemotron 3 Nano422001.7%
Mistral Small 4800001.5%
Gemma 3 27B222201.4%
Arcee AI: Trinity Mini511001.4%
GPT-4.1 Nano221001.1%
Gemini 2.5 Flash Lite110000.4%
Gemma 3 4B110000.3%
Gemma 3 12B000000.0%
LFM2 24B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
GPT-5.4 (Reasoning)10010091817188.7%
GPT-5.5 (Reasoning)818181817480.1%
Claude Opus 4.6 (Reasoning)847979797479.1%
Claude Opus 4.8 (Reasoning, Low)887068666270.7%
Claude Opus 4.8 (Reasoning)906662626068.0%
MoonshotAI: Kimi K2.6896766605667.6%
MiniMax M3927458484363.1%
DeepSeek V4 Flash (Reasoning)907858483862.4%
Z.AI GLM 5.1706261503655.7%
MoonshotAI: Kimi K2.5717056502955.2%
Claude Opus 4.6706059473454.0%
Claude Sonnet 4.6 (Reasoning)775049424251.9%
Gemini 3.5 Flash (Reasoning)545149392744.0%
GPT-5.5564545402241.6%
GPT-5.5 (Reasoning, Low)74544133040.6%
GPT-5.4 Mini (Reasoning)56545126939.2%
GPT-5.4 (Reasoning, Low)605331252438.6%
Z.AI GLM 5 Turbo66653520037.2%
Claude Opus 4.7 (Reasoning)66612910734.5%
Gemini 2.5 Pro64463425334.3%
Claude Opus 4.7604424212033.9%
Gemini 3.1 Pro (Preview)797055432.7%
Z.AI GLM 4.7353231292530.6%
Qwen 3.5 27B373431252230.0%
Qwen3.7 Max6155340029.9%
Aion 2.0433824201929.1%
GPT-5.2383325241827.6%
Claude Sonnet 4.6323130222027.1%
DeepSeek V4 Pro (Reasoning)59292019926.8%
Z.AI GLM 4.6432522211825.9%
GPT-5.4 Nano (Reasoning)332726231725.1%
Z.AI GLM 545422116024.8%
Claude Opus 4.5362423211824.6%
Qwen 3.5 397B A17B382724221124.4%
Qwen3.6 Max Preview292624221723.7%
Xiaomi MIMO v2.5 Pro5826198723.7%
Xiaomi MIMO v2.55330206622.9%
Gemini 3 Flash (Preview, Reasoning)392723111122.2%
Grok 4.20 (Beta, Reasoning)31302020921.9%
Qwen 3.5 Plus (2026-02-15)252420181620.7%
Grok 4.20 (Reasoning)29232315919.7%
Qwen 3.5 Plus (2026-04-20)292219141419.6%
Mistral Medium 3.1282620111119.3%
GPT-5.4202019181718.9%
Gemini 3 Flash (Preview)252318161118.8%
Claude Sonnet 4.5232019181218.4%
ByteDance Seed 2.0 Mini22211716917.0%
GPT-5.1242315111116.8%
ByteDance Seed 1.625231616116.3%
o4 Mini25251411416.0%
Claude Opus 420201713915.7%
Qwen 3.6 27B25231311515.4%
GPT-5 Mini191814131014.8%
GPT-5201514111014.2%
Qwen 3.5 Flash23151412513.8%
Qwen 3.5 35B2618149013.4%
Ministral 8B29121110313.1%
Gemini 2.5 Flash (Reasoning)39887012.3%
Gemini 2.5 Flash Lite (Reasoning)321874012.1%
o4 Mini High16151210711.9%
GPT-4.11816117711.7%
DeepSeek V3.21712109811.1%
MiniMax M2.51410109910.5%
Z.AI GLM 4.5161099810.4%
Qwen 3.5 122B18121210010.2%
Claude Sonnet 41499999.9%
Gemma 4 31B1399989.6%
Ministral 3 8B13139859.5%
Qwen 3.6 Flash1998549.2%
GPT-OSS 120B13108859.0%
DeepSeek V4 Pro1988649.0%
ByteDance Seed 2.0 Lite141110908.8%
Gemma 4 31B (Reasoning)13129908.8%
Gemma 4 26B11109778.7%
WizardLM 2 8x22b171010618.6%
MiniMax M2.712108768.6%
GPT-5.4 Mini (Reasoning, Low)131110548.5%
Ministral 3 14B11108658.0%
DeepSeek V4 Flash1097767.7%
Gemini 2.5 Flash987777.6%
Grok 4.31387727.4%
Mistral Large888557.1%
GPT-5.4 Nano1276636.9%
Z.AI GLM 4.5 Air1098706.7%
Mistral Large 2777766.6%
Gemini 3.1 Flash Lite877556.6%
Qwen 3.6 35B1598006.6%
Inception Mercury 21277336.4%
Gemini 3.1 Flash Lite (Preview)776556.1%
GPT-4o, Aug. 6th (temp=0)966546.1%
GPT-5.4 Nano (Reasoning, Low)877546.1%
Mistral Large 3766656.0%
Gemini 3.1 Flash Lite (Reasoning)866545.9%
Gemini 3.5 Flash (Reasoning, Minimal)988115.5%
Grok 4.201196005.4%
Grok 4.3 (Reasoning)885315.2%
DeepSeek V3 (2024-12-26)1174204.9%
GPT-5.4 Mini755424.7%
Cydonia 24B V4.11173104.3%
Z.AI GLM 4.7 Flash865204.3%
GPT-4.1 Mini654424.2%
Claude Haiku 4.5654334.1%
GPT-4o, May 13th (temp=1)655324.0%
Ministral 3 3B1232213.9%
GPT-4o, May 13th (temp=0)655223.9%
Mistral Small 4 (Reasoning)844403.9%
DeepSeek-V2 Chat763213.9%
Qwen 3.5 9B554323.8%
Mistral Small 3.2 24B544323.7%
Ministral 3B554413.6%
GPT-4o, Aug. 6th (temp=1)554323.6%
Hermes 3 405B1430003.5%
Writer: Palmyra X5772103.4%
Qwen 3 32B1043003.4%
DeepSeek V3.1752113.2%
Llama 3.1 70B543202.8%
GPT-5 Nano433212.6%
Qwen3 235B A22B Instruct 2507831002.6%
Gemma 4 26B (Reasoning)1300002.6%
Mistral Small 4631102.1%
Qwen 2.5 72B522002.0%
Hermes 3 70B500001.0%
Grok 4.20 (Beta)400001.0%
Gemini 2.5 Flash Lite400000.9%
Claude 3 Haiku111110.8%
Mistral NeMO210000.7%
Llama 3.1 8B200000.5%
Gemma 3 27B000000.3%
Skyfall 36B V2100000.2%
GPT-4.1 Nano000000.2%
Gemma 3 4B100000.2%
Cohere Command R+ (Aug. 2024)000000.1%
DeepSeek V3 (2025-03-24)000000.1%
Arcee AI: Trinity Mini000000.1%
Rocinante 12B000000.0%
Nemotron 3 Nano000000.0%
Nemotron 3 Super00.0%
Gemma 3 12B000000.0%