Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in   the Wild

Peijun Bao; Chenqi Kong; Zihao Shao; Boon Poh Ng; Meng Hwa Er; Alex C.; Kot

arXiv:2412.00811·cs.CV·December 3, 2024

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Peijun Bao, Chenqi Kong, Zihao Shao, Boon Poh Ng, Meng Hwa Er, Alex C., Kot

PDF

Open Access 1 Repo

TL;DR

Vid-Morp introduces a novel pretraining approach for video moment retrieval using unlabeled videos and pseudo annotations, significantly reducing annotation costs and achieving strong zero-shot and unsupervised performance.

Contribution

The paper proposes Vid-Morp, a large-scale unlabeled video dataset and the ReCorrect algorithm for effective pretraining without manual annotations.

Findings

01

ReCorrect achieves over 75% of fully-supervised performance in zero-shot settings.

02

Unsupervised ReCorrect reaches about 85% performance on benchmarks.

03

Pretraining with pseudo labels reduces annotation costs significantly.

Abstract

Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baopj/vid-morp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization