TL;DR
ProPy leverages a novel prompt pyramid and interaction mechanism built on CLIP to improve partially relevant video retrieval, achieving state-of-the-art results on multiple datasets.
Contribution
The paper introduces ProPy, a new architecture adapting CLIP with a prompt pyramid and interaction mechanism for PRVR, a novel approach in this domain.
Findings
ProPy outperforms previous models on three public datasets.
ProPy achieves significant improvements in retrieval accuracy.
The prompt pyramid effectively captures multi-granularity semantics.
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
