From Image to Video, what do we need in multimodal LLMs?
Suyuan Huang, Haoxin Zhang, Linqing Zhong, Honggu Chen, Yan Gao, Yao Hu, Zengchang Qin

TL;DR
This paper presents RED-VILLM, a resource-efficient pipeline that leverages existing Image LLMs to develop robust Video LLMs, reducing data and training requirements while enhancing performance.
Contribution
It introduces a novel temporal adaptation structure and a pipeline that effectively builds Video LLMs from Image LLMs, including the first Chinese Video LLM.
Findings
Video LLMs outperform conventional models with less data.
The approach reduces training resources significantly.
The pipeline is scalable and cost-effective.
Abstract
Covering from Image LLMs to the more complex Video LLMs, the Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in comprehending cross-modal information as numerous studies have illustrated. Previous methods delve into designing comprehensive Video LLMs through integrating video foundation models with primitive LLMs. Despite its effectiveness, such paradigm renders Video LLM's structure verbose and typically requires substantial video data for pre-training. Crucially, it neglects leveraging the foundational contributions of ready-made Image LLMs. In this paper, we introduce RED-VILLM, a Resource-Efficient Development pipeline which builds robust Video LLMs through leveraging the prior knowledge of Image LLMs. Specifically, since a video is naturally a combination of images along the temporal dimension, we devise a temporal adaptation plug-and-play…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Translation Studies and Practices
