FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

Yunzhu Zhang; Yu Lu; Tianyi Wang; Fengyun Rao; Yi Yang; Linchao Zhu

arXiv:2506.00993·cs.CV·June 3, 2025

FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu

PDF

Open Access 3 Models

TL;DR

FlexSelect introduces a novel token selection method that efficiently identifies and retains relevant content in long videos, significantly reducing computational costs while maintaining performance across multiple benchmarks.

Contribution

It proposes a flexible, training-free token ranking and a lightweight selector that can be integrated into existing VideoLLMs to extend their temporal understanding capabilities.

Findings

01

Achieves up to 9x speed-up on LLaVA-Video-7B

02

Improves performance on VideoMME, MLVU, LongVB, LVBench benchmarks

03

Seamlessly integrates into various VideoLLM architectures

Abstract

Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques · Advanced Neural Network Applications