Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Huatuan Sun; Yunshan Ma; Changguang Wu; Yanxin Zhang; Pengfei Wang; Xiaoyu Du

arXiv:2512.21863·cs.IR·May 6, 2026

Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Huatuan Sun, Yunshan Ma, Changguang Wu, Yanxin Zhang, Pengfei Wang, Xiaoyu Du

PDF

TL;DR

This paper systematically evaluates how to best extract and fuse features from frozen LVLMs for micro-video recommendation, revealing key principles and proposing a novel fusion framework that improves performance.

Contribution

It provides the first comprehensive empirical analysis of LVLM feature extraction and integration strategies, introducing the Dual Feature Fusion (DFF) framework for enhanced recommendation accuracy.

Findings

01

Intermediate hidden states outperform caption-based features.

02

ID embeddings are essential and fusion is superior to replacement.

03

Layer-wise effectiveness of decoder features varies significantly.

Abstract

Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.