Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

Zequn Xie; Boyun Zhang; Yuxiao Lin; Tao Jin

arXiv:2601.12768·cs.CV·January 21, 2026

Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

Zequn Xie, Boyun Zhang, Yuxiao Lin, Tao Jin

PDF

Open Access

TL;DR

This paper introduces HVP-Net, a hierarchical visual perception framework that extracts multi-level features from video encoders to improve the accuracy and robustness of video-text retrieval tasks, achieving state-of-the-art results.

Contribution

The novel HVP-Net framework leverages intermediate layer features of vision encoders to enhance video representations for retrieval, addressing redundancy and detail loss in existing methods.

Findings

01

Achieves state-of-the-art performance on MSRVTT, DiDeMo, and ActivityNet benchmarks.

02

Effectively mines richer video semantics through hierarchical feature extraction.

03

Demonstrates robustness and improved matching accuracy in video-text retrieval.

Abstract

Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization