Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection   to Image-Text Pre-Training

Dezhao Luo; Jiabo Huang; Shaogang Gong; Hailin Jin; Yang Liu

arXiv:2303.00040·cs.CV·March 27, 2023·1 cites

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

PDF

Open Access

TL;DR

This paper introduces Visual-Dynamic Injection (VDI), a novel method that enhances video moment retrieval by aligning visual context and motion information with descriptive text, improving generalization across diverse and unseen video scenes.

Contribution

The paper proposes VDI, a new approach that injects visual and dynamic information into text embeddings, enabling better video-text alignment and generalization in VMR tasks.

Findings

01

Achieves state-of-the-art results on Charades-STA and ActivityNet-Captions.

02

Demonstrates improved performance on out-of-distribution data with novel scenes and vocabulary.

03

Highlights the importance of modeling temporal changes in pre-training for VMR.

Abstract

The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques

MethodsAttentive Walk-Aggregating Graph Neural Network