TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

Xiangtian Zheng; Zishuo Wang; Yuxin Peng

arXiv:2602.08861·cs.CV·February 10, 2026

TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

Xiangtian Zheng, Zishuo Wang, Yuxin Peng

PDF

Open Access

TL;DR

TiFRe is a novel framework that intelligently reduces video frames for multi-modal LLMs by selecting and merging frames based on user input, significantly lowering computation while maintaining or improving task performance.

Contribution

The paper introduces TiFRe, a text-guided frame reduction method that preserves essential video information through semantic-aware sampling and merging, enhancing efficiency in video-language models.

Findings

01

Reduces computational costs significantly.

02

Improves performance on video-language tasks.

03

Effectively preserves video semantics during reduction.

Abstract

With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis