Disentangle and denoise: Tackling context misalignment for video moment retrieval
Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou,, Zhongjiang He, Yongxiang Li, Hao Sun, Zerun Feng, Xingsong Hou

TL;DR
This paper introduces CDNet, a novel network for video moment retrieval that disentangles semantic correlations and denoises irrelevant background, significantly improving accuracy in locating moments based on natural language queries.
Contribution
The paper presents a cross-modal Context Denoising Network with query-guided semantic disentanglement and context-aware dynamic denoising, addressing noise and uneven semantic distribution in video retrieval.
Findings
Achieves state-of-the-art performance on public benchmarks.
Effectively disentangles complex correlations for accurate retrieval.
Enhances understanding of spatial-temporal details through query relevance.
Abstract
Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsFocus
