TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, and Hao Sun

TL;DR
TSPO introduces a reinforcement learning-based approach to optimize frame sampling in long-form video language models, significantly improving their understanding and performance on various benchmarks.
Contribution
The paper presents a novel trainable, event-aware temporal sampling policy optimized via reinforcement learning for enhanced long-video understanding in multimodal language models.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Demonstrates transferability across different Video-MLLMs.
Balances temporal understanding and key segment localization effectively.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
