OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu

TL;DR
OmniVTG introduces a large-scale dataset and a novel Self-Correction Chain-of-Thought training paradigm to improve open-world video temporal grounding, addressing semantic diversity and rare concept challenges.
Contribution
The paper presents OmniVTG dataset and a Self-Correction CoT training method, enhancing MLLMs' grounding ability in open-world video understanding tasks.
Findings
OmniVTG achieves state-of-the-art zero-shot performance on VTG benchmarks.
Self-Correction CoT training reduces performance gap between common and rare concepts.
The dataset covers a broader semantic space than previous datasets.
Abstract
Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
