Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian

TL;DR
This paper introduces a novel denoise-then-retrieve paradigm for video moment retrieval that filters out irrelevant video clips to improve multimodal alignment and retrieval accuracy, demonstrating superior results on benchmark datasets.
Contribution
The paper proposes a new denoise-then-retrieve framework with a specialized network that explicitly filters irrelevant clips, enhancing the performance of text-driven video retrieval models.
Findings
Outperforms state-of-the-art on Charades-STA and QVHighlights datasets
Effectively filters irrelevant clips to improve multimodal alignment
The paradigm can be integrated into existing models for performance boost
Abstract
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
