Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Ruizhe Chen; Zhiting Fan; Tianze Luo; Heqing Zou; Zhaopeng Feng; Guiyang Xie; Hansheng Zhang; Zhuochen Wang; Zuozhu Liu; Huaijian Zhang

arXiv:2507.18100·cs.CV·July 25, 2025

Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Ruizhe Chen, Zhiting Fan, Tianze Luo, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, Huaijian Zhang

PDF

Open Access 1 Video

TL;DR

This paper presents a novel two-stage training framework combining supervised fine-tuning and reinforcement learning to enhance the accuracy and robustness of video temporal grounding models, especially in challenging scenarios.

Contribution

It introduces a new training approach that leverages high-quality cold start data and difficulty-controlled RL, improving temporal localization and reasoning in VTG models.

Findings

01

Outperforms existing VTG models on multiple benchmarks.

02

Enhances model robustness in open-domain scenarios.

03

Highlights the importance of dataset quality and training strategies.

Abstract

Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning· underline

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition