FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu

TL;DR
FlexSelect introduces a novel token selection method that efficiently identifies and retains relevant content in long videos, significantly reducing computational costs while maintaining performance across multiple benchmarks.
Contribution
It proposes a flexible, training-free token ranking and a lightweight selector that can be integrated into existing VideoLLMs to extend their temporal understanding capabilities.
Findings
Achieves up to 9x speed-up on LLaVA-Video-7B
Improves performance on VideoMME, MLVU, LongVB, LVBench benchmarks
Seamlessly integrates into various VideoLLM architectures
Abstract
Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques · Advanced Neural Network Applications
