Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large   Language Models

Yifang Xu; Yunzhuo Sun; Benxiang Zhai; Ming Li; Wenxin Liang; Yang Li,; Sidan Du

arXiv:2501.07972·cs.MM·January 15, 2025

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li,, Sidan Du

PDF

Open Access 1 Video

TL;DR

This paper introduces Moment-GPT, a zero-shot, tuning-free approach for video moment retrieval that leverages frozen multimodal large language models to improve accuracy without fine-tuning.

Contribution

It proposes a novel pipeline combining query rephrasing, span generation, and selection using off-the-shelf MLLMs, addressing language bias and reducing reliance on large datasets.

Findings

01

Outperforms state-of-the-art MLLM-based zero-shot methods

02

Effective in mitigating language bias in queries

03

Achieves superior results on multiple public datasets

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models· underline

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition