EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung; Byoung-Tak Zhang; Lorenzo Torresani

arXiv:2605.13803·cs.CV·May 14, 2026

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani

PDF

TL;DR

EvoGround introduces a self-evolving framework with two agents that learn video temporal grounding from unlabeled videos, eliminating the need for manual annotations and achieving competitive results.

Contribution

The paper presents a novel self-reinforcing reinforcement learning approach with coupled agents that learn VTG from raw videos without human labels.

Findings

01

Matches or surpasses supervised models on VTG benchmarks.

02

Emerges as a state-of-the-art fine-grained video captioner.

03

Learns effectively from only 2.5K unlabeled videos.

Abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.