Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
Zequn Xie, Boyun Zhang, Yuxiao Lin, Tao Jin

TL;DR
This paper introduces HVP-Net, a hierarchical visual perception framework that extracts multi-level features from video encoders to improve the accuracy and robustness of video-text retrieval tasks, achieving state-of-the-art results.
Contribution
The novel HVP-Net framework leverages intermediate layer features of vision encoders to enhance video representations for retrieval, addressing redundancy and detail loss in existing methods.
Findings
Achieves state-of-the-art performance on MSRVTT, DiDeMo, and ActivityNet benchmarks.
Effectively mines richer video semantics through hierarchical feature extraction.
Demonstrates robustness and improved matching accuracy in video-text retrieval.
Abstract
Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
