Video-Zero: Self-Evolution Video Understanding
Ruixu Zhang, Deyi Ji, Lanyun Zhu, Xuanyi Liu, Yuxin Meng, Ruihang Chu, Yujiu Yang

TL;DR
Video-Zero introduces an annotation-free, evidence-grounded self-evolution framework for video understanding, enhancing reasoning models by focusing on temporally localized evidence discovery and alignment.
Contribution
It presents a novel co-evolution approach that emphasizes evidence grounding in video reasoning, improving performance across multiple benchmarks without requiring annotations.
Findings
Consistently improves multiple video VLM backbones across 13 benchmarks.
Effectively discovers and utilizes temporally localized evidence for reasoning.
Demonstrates transferability of evidence-centered self-evolution methods.
Abstract
Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
