HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen, Xiuyi Chen, Jiaxin Shi, Duzhen Zhang and, Jianlong Chang, Qi Tian

TL;DR
HiVLP introduces a hierarchical vision-language pre-training approach that enables fast and scalable image-text retrieval by using different representation dimensions for coarse and fine retrieval, significantly improving inference speed.
Contribution
This paper proposes a novel hierarchical retrieval objective for vision-language pre-training, enabling rapid image-text retrieval with scalable, multi-level representations.
Findings
HiVLP is 1,427 to 120,649 times faster than UNITER.
It is 2 to 5 times faster than LightingDot.
Achieves +4.9 AR on COCO and +3.8 AR on Flickr30K, comparable to SOTA models.
Abstract
In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · UNiversal Image-TExt Representation Learning
