Disentangle and denoise: Tackling context misalignment for video moment   retrieval

Kaijing Ma; Han Fang; Xianghao Zang; Chao Ban; Lanxiang Zhou,; Zhongjiang He; Yongxiang Li; Hao Sun; Zerun Feng; Xingsong Hou

arXiv:2408.07600·cs.CV·August 15, 2024

Disentangle and denoise: Tackling context misalignment for video moment retrieval

Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou,, Zhongjiang He, Yongxiang Li, Hao Sun, Zerun Feng, Xingsong Hou

PDF

Open Access

TL;DR

This paper introduces CDNet, a novel network for video moment retrieval that disentangles semantic correlations and denoises irrelevant background, significantly improving accuracy in locating moments based on natural language queries.

Contribution

The paper presents a cross-modal Context Denoising Network with query-guided semantic disentanglement and context-aware dynamic denoising, addressing noise and uneven semantic distribution in video retrieval.

Findings

01

Achieves state-of-the-art performance on public benchmarks.

02

Effectively disentangles complex correlations for accurate retrieval.

03

Enhances understanding of spatial-temporal details through query relevance.

Abstract

Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsFocus