Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Weijia Liu; Jiuxin Cao; Bo Miao; Zhiheng Fu; Xuelin Zhu; Jiawei Ge; Bo Liu; Mehwish Nasim; Ajmal Mian

arXiv:2508.11313·cs.CV·August 18, 2025

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian

PDF

TL;DR

This paper introduces a novel denoise-then-retrieve paradigm for video moment retrieval that filters out irrelevant video clips to improve multimodal alignment and retrieval accuracy, demonstrating superior results on benchmark datasets.

Contribution

The paper proposes a new denoise-then-retrieve framework with a specialized network that explicitly filters irrelevant clips, enhancing the performance of text-driven video retrieval models.

Findings

01

Outperforms state-of-the-art on Charades-STA and QVHighlights datasets

02

Effectively filters irrelevant clips to improve multimodal alignment

03

The paradigm can be integrated into existing models for performance boost

Abstract

Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.