Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

Feng Yue; Zhaoxing Zhang; Junming Jiao; Zhengyu Liang; Shiwen Cao; Feifei Zhang; Rong Shen

arXiv:2507.04702·cs.CV·July 8, 2025

Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

Feng Yue, Zhaoxing Zhang, Junming Jiao, Zhengyu Liang, Shiwen Cao, Feifei Zhang, Rong Shen

PDF

TL;DR

Tempo-R0 introduces an innovative Video-MLLM for temporal video grounding that leverages efficient temporal sensing and reinforcement learning to improve boundary detection and relevance understanding in videos.

Contribution

The paper presents a novel Video-MLLM architecture with specialized preprocessing and reinforcement learning techniques for enhanced temporal video grounding.

Findings

01

Achieves around 3.5% improvement over SOTA on QVHighlights benchmarks.

02

Employs Self-adaptive Attention Allocation for efficient attention use.

03

Utilizes PIR-GRPO for improved temporal reasoning.

Abstract

Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.