QD-VMR: Query Debiasing with Contextual Understanding Enhancement for   Video Moment Retrieval

Chenghua Gao; Min Li; Jianshuo Liu; Junxing Ren; Lin Chen; Haoyu Liu,; Bo Meng; Jitao Fu; Wenwen Su

arXiv:2408.12981·cs.AI·August 26, 2024

QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval

Chenghua Gao, Min Li, Jianshuo Liu, Junxing Ren, Lin Chen, Haoyu Liu,, Bo Meng, Jitao Fu, Wenwen Su

PDF

Open Access

TL;DR

This paper introduces QD-VMR, a novel video moment retrieval model that enhances query understanding and debiasing to improve accuracy in retrieving relevant video segments, achieving state-of-the-art results.

Contribution

The paper proposes a new query debiasing framework with enhanced contextual understanding for VMR, combining alignment, contrastive learning, and a DETR-based prediction structure.

Findings

01

Achieves state-of-the-art performance on three benchmark datasets.

02

Effectively improves cross-modal understanding and query relevance filtering.

03

Demonstrates the effectiveness of query debiasing and visual enhancement modules.

Abstract

Video Moment Retrieval (VMR) aims to retrieve relevant moments of an untrimmed video corresponding to the query. While cross-modal interaction approaches have shown progress in filtering out query-irrelevant information in videos, they assume the precise alignment between the query semantics and the corresponding video moments, potentially overlooking the misunderstanding of the natural language semantics. To address this challenge, we propose a novel model called \textit{QD-VMR}, a query debiasing model with enhanced contextual understanding. Firstly, we leverage a Global Partial Aligner module via video clip and query features alignment and video-query contrastive learning to enhance the cross-modal understanding capabilities of the model. Subsequently, we employ a Query Debiasing Module to obtain debiased query features efficiently, and a Visual Enhancement module to refine the video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Feedforward Network · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Convolution