TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler

Xingjian Zhang; Xi Weng; Yihao Yue; Zhaoxin Fan; Wenjun Wu; Lei Huang

arXiv:2501.15513·cs.CV·June 11, 2025

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler

Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang

PDF

Open Access 1 Repo 4 Models 2 Datasets

TL;DR

TinyLLaVA-Video introduces a lightweight 3.6B parameter model for video understanding, utilizing a novel group resampler to improve efficiency and temporal comprehension, surpassing larger models on multiple benchmarks.

Contribution

The paper presents a new lightweight video understanding model with a novel group resampler mechanism that enhances temporal understanding and efficiency.

Findings

01

Requires only one day of training on 8 GPUs.

02

Outperforms several existing 7B-parameter models on benchmarks.

03

Effectively reduces visual token redundancy while improving temporal comprehension.

Abstract

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangxj199/tinyllava-video
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques