Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Dezhao Luo; Jiabo Huang; Shaogang Gong; Hailin Jin; Yang Liu

arXiv:2309.00661·cs.CV·September 6, 2023

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces a zero-shot video moment retrieval method leveraging large-scale vision-language models, enabling accurate retrieval without training on specific VMR data, especially effective for unseen words and scenes.

Contribution

The work proposes a novel zero-shot approach with a boundary-aware feature refinement and bottom-up proposal generation, reducing reliance on annotated data and domain discrepancies.

Findings

01

Achieves state-of-the-art zero-shot VMR performance on benchmark datasets.

02

Effectively handles out-of-distribution scenarios with novel words and locations.

03

Demonstrates significant advantages over supervised methods in zero-shot settings.

Abstract

Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Zero-Shot Video Moment Retrieval From Frozen Vision-Language Models· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning