LocVTP: Video-Text Pre-training for Temporal Localization
Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian, Zou

TL;DR
LocVTP introduces a novel video-text pre-training framework specifically designed to improve temporal localization tasks, demonstrating state-of-the-art results across multiple datasets and tasks.
Contribution
The paper proposes a new localization-oriented pre-training method with fine-grained alignment and temporal reasoning enhancements, addressing limitations of existing VTP methods for localization tasks.
Findings
Achieves state-of-the-art performance on multiple datasets.
Effectively improves temporal localization accuracy.
Enhances transferability of video-text representations.
Abstract
Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research
MethodsAttentive Walk-Aggregating Graph Neural Network
