Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion
Huatuan Sun, Yunshan Ma, Changguang Wu, Yanxin Zhang, Pengfei Wang, Xiaoyu Du

TL;DR
This paper systematically evaluates how to best extract and fuse features from frozen LVLMs for micro-video recommendation, revealing key principles and proposing a novel fusion framework that improves performance.
Contribution
It provides the first comprehensive empirical analysis of LVLM feature extraction and integration strategies, introducing the Dual Feature Fusion (DFF) framework for enhanced recommendation accuracy.
Findings
Intermediate hidden states outperform caption-based features.
ID embeddings are essential and fusion is superior to replacement.
Layer-wise effectiveness of decoder features varies significantly.
Abstract
Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
