Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

Zhongtao Rao; Peilin Zhou; Dading Chong; Zhiwei Chen; Shoujin Wang; Nan Tang

arXiv:2512.06883·cs.IR·April 28, 2026

Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

Zhongtao Rao, Peilin Zhou, Dading Chong, Zhiwei Chen, Shoujin Wang, Nan Tang

PDF

1 Repo

TL;DR

This paper introduces SDA, a lightweight framework that improves multimodal recommendation by aligning cross-modal representations and disentangling gradient conflicts in large vision-language models.

Contribution

It proposes a novel Structural and Disentangled Adaptation method combining CMSA and MoDA to enhance recommendation performance with minimal overhead.

Findings

01

Achieves 6.15% average gain in Hit@10

02

Achieves 8.64% average gain in NDCG@10

03

Improves long-tail item recommendations by up to 18.70%

Abstract

Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RaoZhongtao/SDA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.