UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick,, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

TL;DR
UniVTG proposes a unified framework for diverse video-language temporal grounding tasks, enabling scalable annotation, flexible modeling, and zero-shot capabilities, demonstrated across multiple datasets and tasks.
Contribution
The paper introduces a unified formulation for VTG tasks, scalable pseudo supervision, and a flexible model capable of handling various labels and zero-shot learning.
Findings
Effective across multiple datasets and tasks.
Enables zero-shot temporal grounding.
Outperforms task-specific models in generalization.
Abstract
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
