Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items

Junting Wang; Chenghuan Guo; Jiao Yang; Yanhui Guo; Hari Sundaram; Yan Gao

arXiv:2507.22268·cs.IR·May 5, 2026

Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items

Junting Wang, Chenghuan Guo, Jiao Yang, Yanhui Guo, Hari Sundaram, Yan Gao

PDF

1 Repo

TL;DR

This paper introduces MMSC, a self-supervised multi-modal framework that improves inference of substitutable and complementary items by effectively handling noisy user behavior data and sparse associations.

Contribution

MMSC uniquely combines multi-modal encoding, self-supervised denoising, hierarchical aggregation, and LLM-assisted supervision to enhance item relationship inference.

Findings

01

MMSC outperforms baselines by 26.1% in substitutable item inference.

02

MMSC improves complementary item inference accuracy by 39.2%.

03

The method remains effective for cold-start items.

Abstract

We study the problem of inferring substitutable and complementary items, which underpins applications such as alternative and follow-up purchase suggestions. Existing approaches typically learn from behavior-derived item-item associations using GNNs or leverage item content alone. However, these methods often overlook two key challenges: (i) user behaviors (e.g., co-view/co-purchase) only provide noisy weak supervision, and (ii) behavior signals are long-tailed, leaving many items with sparse associations. We propose MMSC, a self-supervised multi-modal relational representation learning framework that combines a multi-modal foundation model adapted to encode item metadata and a self-supervised denoising module that learns relationship-aware representations from noisy user behaviors, unified by a hierarchical aggregation mechanism. We further use LLM-assisted supervision to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.