VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu; Yunxiao Wang; Shijie Ma; Meng Liu; Qile Su; Tianke Zhang; Haonan Fan; Changyi Liu; Kaiyu Jiang; Jiankang Chen; Kaiyu Tang; Bin Wen; Fan Yang; Tingting Gao; Han Li; Yinwei Wei; Xuemeng Song

arXiv:2602.07801·cs.CV·March 16, 2026

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

PDF

Open Access

TL;DR

VideoTemp-o3 introduces a unified framework for long-video understanding that improves localization, efficiency, and accuracy in agentic thinking-with-videos by jointly modeling grounding and question answering.

Contribution

It presents a novel joint modeling approach with a masking mechanism and reinforcement learning rewards, along with a new dataset and benchmark for long video grounded QA.

Findings

01

Achieves state-of-the-art performance on long video understanding tasks.

02

Demonstrates improved localization and reduced hallucinations.

03

Provides a new benchmark for systematic evaluation.

Abstract

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition