Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training
Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

TL;DR
This paper introduces Visual-Dynamic Injection (VDI), a novel method that enhances video moment retrieval by aligning visual context and motion information with descriptive text, improving generalization across diverse and unseen video scenes.
Contribution
The paper proposes VDI, a new approach that injects visual and dynamic information into text embeddings, enabling better video-text alignment and generalization in VMR tasks.
Findings
Achieves state-of-the-art results on Charades-STA and ActivityNet-Captions.
Demonstrates improved performance on out-of-distribution data with novel scenes and vocabulary.
Highlights the importance of modeling temporal changes in pre-training for VMR.
Abstract
The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
