HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Zequn Xie; Xin Liu; Boyun Zhang; Yuxiao Lin; Sihang Cai; Tao Jin

arXiv:2601.16155·cs.CV·January 23, 2026

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin

PDF

Open Access

TL;DR

This paper introduces HVD, a human vision-inspired model for text-video retrieval that improves focus on key visual elements by mimicking human perception, leading to state-of-the-art results.

Contribution

The paper proposes a novel coarse-to-fine alignment framework with modules mimicking human macro- and micro-perception for enhanced text-video retrieval.

Findings

01

Achieves state-of-the-art performance on five benchmarks.

02

Effectively filters temporal redundancy and highlights salient visual entities.

03

Demonstrates human-like visual focus improves retrieval accuracy.

Abstract

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization