Multi-video Moment Ranking with Multimodal Clue

Danyang Hou; Liang Pang; Yanyan Lan; Huawei Shen; Xueqi Cheng

arXiv:2301.13606·cs.CV·February 1, 2023

Multi-video Moment Ranking with Multimodal Clue

Danyang Hou, Liang Pang, Yanyan Lan, Huawei Shen, Xueqi Cheng

PDF

Open Access

TL;DR

This paper introduces MINUTE, a two-stage multimodal model for video corpus moment retrieval that addresses prediction bias and leverages key content across modalities, achieving state-of-the-art results.

Contribution

MINUTE employs shared normalization for unbiased moment prediction and multimodal clue mining for improved localization in VCMR.

Findings

01

Outperforms baselines on TVR and DiDeMo datasets

02

Achieves new state-of-the-art in VCMR

03

Effectively discovers key content across modalities

Abstract

Video corpus moment retrieval~(VCMR) is the task of retrieving a relevant video moment from a large corpus of untrimmed videos via a natural language query. State-of-the-art work for VCMR is based on two-stage method. In this paper, we focus on improving two problems of two-stage method: (1) Moment prediction bias: The predicted moments for most queries come from the top retrieved videos, ignoring the possibility that the target moment is in the bottom retrieved videos, which is caused by the inconsistency of Shared Normalization during training and inference. (2) Latent key content: Different modalities of video have different key information for moment localization. To this end, we propose a two-stage model \textbf{M}ult\textbf{I}-video ra\textbf{N}king with m\textbf{U}l\textbf{T}imodal clu\textbf{E}~(MINUTE). MINUTE uses Shared Normalization during both training and inference to rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning