TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
Xiangtian Zheng, Zishuo Wang, Yuxin Peng

TL;DR
TiFRe is a novel framework that intelligently reduces video frames for multi-modal LLMs by selecting and merging frames based on user input, significantly lowering computation while maintaining or improving task performance.
Contribution
The paper introduces TiFRe, a text-guided frame reduction method that preserves essential video information through semantic-aware sampling and merging, enhancing efficiency in video-language models.
Findings
Reduces computational costs significantly.
Improves performance on video-language tasks.
Effectively preserves video semantics during reduction.
Abstract
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
