TL;DR
DreamPRVR introduces a diffusion-guided, coarse-to-fine approach for partially relevant video retrieval, improving global context understanding and cross-modal matching accuracy.
Contribution
It proposes a novel diffusion-based method to generate and refine global semantic registers for better video-text retrieval performance.
Findings
Outperforms state-of-the-art PRVR methods on benchmark datasets.
Effectively models global context with diffusion-guided semantic registers.
Enhances cross-modal matching through register-augmented attention.
Abstract
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
