TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang; Zifan Han; Hongbo Sun; Sanping Zhou; Xuchong Zhang; Xin Wei; Ye Yuan; Huayu Zhang; Jinglin Xu; and Hao Sun

arXiv:2508.04369·cs.CV·November 14, 2025

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, and Hao Sun

PDF

1 Models 1 Datasets 1 Video

TL;DR

TSPO introduces a reinforcement learning-based approach to optimize frame sampling in long-form video language models, significantly improving their understanding and performance on various benchmarks.

Contribution

The paper presents a novel trainable, event-aware temporal sampling policy optimized via reinforcement learning for enhanced long-video understanding in multimodal language models.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Demonstrates transferability across different Video-MLLMs.

03

Balances temporal understanding and key segment localization effectively.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hzf666/TSPO-0.4B
model· 238 dl
238 dl

Datasets

Canhui99/TSPO-10K
dataset· 24 dl
24 dl

Videos

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding· underline