LocVTP: Video-Text Pre-training for Temporal Localization

Meng Cao; Tianyu Yang; Junwu Weng; Can Zhang; Jue Wang; and Yuexian; Zou

arXiv:2207.10362·cs.CV·July 22, 2022·5 cites

LocVTP: Video-Text Pre-training for Temporal Localization

Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian, Zou

PDF

Open Access 1 Repo

TL;DR

LocVTP introduces a novel video-text pre-training framework specifically designed to improve temporal localization tasks, demonstrating state-of-the-art results across multiple datasets and tasks.

Contribution

The paper proposes a new localization-oriented pre-training method with fine-grained alignment and temporal reasoning enhancements, addressing limitations of existing VTP methods for localization tasks.

Findings

01

Achieves state-of-the-art performance on multiple datasets.

02

Effectively improves temporal localization accuracy.

03

Enhances transferability of video-text representations.

Abstract

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mengcaopku/locvtp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research

MethodsAttentive Walk-Aggregating Graph Neural Network