UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin; Pengchuan Zhang; Joya Chen; Shraman Pramanick,; Difei Gao; Alex Jinpeng Wang; Rui Yan; Mike Zheng Shou

arXiv:2307.16715·cs.CV·August 21, 2023·6 cites

UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick,, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

UniVTG proposes a unified framework for diverse video-language temporal grounding tasks, enabling scalable annotation, flexible modeling, and zero-shot capabilities, demonstrated across multiple datasets and tasks.

Contribution

The paper introduces a unified formulation for VTG tasks, scalable pseudo supervision, and a flexible model capable of handling various labels and zero-shot learning.

Findings

01

Effective across multiple datasets and tasks.

02

Enables zero-shot temporal grounding.

03

Outperforms task-specific models in generalization.

Abstract

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/univtg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition