VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Junyoung Kim; Woojoo Kim; Jaehyung Lim; Dongha Kim; Hwanjo Yu

arXiv:2603.17450·cs.IR·March 19, 2026

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Junyoung Kim, Woojoo Kim, Jaehyung Lim, Dongha Kim, Hwanjo Yu

PDF

Open Access

TL;DR

This paper introduces VLM2Rec, a framework that leverages vision-language models for multimodal sequential recommendation, addressing modality collapse issues and improving recommendation accuracy and robustness.

Contribution

VLM2Rec is the first to effectively utilize VLMs for SR by proposing techniques to balance modality contributions and preserve cross-modal relationships.

Findings

01

VLM2Rec outperforms state-of-the-art methods in accuracy.

02

VLM2Rec demonstrates robustness across diverse scenarios.

03

The proposed regularization techniques effectively mitigate modality collapse.

Abstract

Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Advanced Graph Neural Networks · Machine Learning in Healthcare