TL;DR
This paper introduces SDA, a lightweight framework that improves multimodal recommendation by aligning cross-modal representations and disentangling gradient conflicts in large vision-language models.
Contribution
It proposes a novel Structural and Disentangled Adaptation method combining CMSA and MoDA to enhance recommendation performance with minimal overhead.
Findings
Achieves 6.15% average gain in Hit@10
Achieves 8.64% average gain in NDCG@10
Improves long-tail item recommendations by up to 18.70%
Abstract
Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
