ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan; Yujia Zhang; Michael Kampffmeyer; Xiaoguang Zhao

arXiv:2508.19024·cs.CV·August 27, 2025

ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

PDF

1 Video

TL;DR

ProPy leverages a novel prompt pyramid and interaction mechanism built on CLIP to improve partially relevant video retrieval, achieving state-of-the-art results on multiple datasets.

Contribution

The paper introduces ProPy, a new architecture adapting CLIP with a prompt pyramid and interaction mechanism for PRVR, a novel approach in this domain.

Findings

01

ProPy outperforms previous models on three public datasets.

02

ProPy achieves significant improvements in retrieval accuracy.

03

The prompt pyramid effectively captures multi-granularity semantics.

Abstract

Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval· underline