TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang

TL;DR
TinyLLaVA-Video introduces a lightweight 3.6B parameter model for video understanding, utilizing a novel group resampler to improve efficiency and temporal comprehension, surpassing larger models on multiple benchmarks.
Contribution
The paper presents a new lightweight video understanding model with a novel group resampler mechanism that enhances temporal understanding and efficiency.
Findings
Requires only one day of training on 8 GPUs.
Outperforms several existing 7B-parameter models on benchmarks.
Effectively reduces visual token redundancy while improving temporal comprehension.
Abstract
Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques
