Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

TL;DR
This paper introduces a zero-shot video moment retrieval method leveraging large-scale vision-language models, enabling accurate retrieval without training on specific VMR data, especially effective for unseen words and scenes.
Contribution
The work proposes a novel zero-shot approach with a boundary-aware feature refinement and bottom-up proposal generation, reducing reliance on annotated data and domain discrepancies.
Findings
Achieves state-of-the-art zero-shot VMR performance on benchmark datasets.
Effectively handles out-of-distribution scenarios with novel words and locations.
Demonstrates significant advantages over supervised methods in zero-shot settings.
Abstract
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Zero-Shot Video Moment Retrieval From Frozen Vision-Language Models· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
