Zero-shot Video Moment Retrieval With Off-the-Shelf Models
Anuj Diwan, Puyuan Peng, Raymond J. Mooney

TL;DR
This paper introduces a zero-shot method for Video Moment Retrieval that leverages off-the-shelf models without additional training, significantly outperforming previous zero-shot approaches and approaching supervised model performance.
Contribution
The paper presents a simple, three-step zero-shot approach for VMR using only off-the-shelf models, eliminating the need for finetuning or annotated data.
Findings
Outperforms previous zero-shot methods by at least 2.5x on all metrics
Reduces the gap between zero-shot and supervised models by over 74%
Outperforms non-pretrained supervised models on recall metrics and performs well on shorter moments
Abstract
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
