HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text   Retrieval

Feilong Chen; Xiuyi Chen; Jiaxin Shi; Duzhen Zhang and; Jianlong Chang; Qi Tian

arXiv:2205.12105·cs.CV·June 1, 2022·6 cites

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

Feilong Chen, Xiuyi Chen, Jiaxin Shi, Duzhen Zhang and, Jianlong Chang, Qi Tian

PDF

Open Access

TL;DR

HiVLP introduces a hierarchical vision-language pre-training approach that enables fast and scalable image-text retrieval by using different representation dimensions for coarse and fine retrieval, significantly improving inference speed.

Contribution

This paper proposes a novel hierarchical retrieval objective for vision-language pre-training, enabling rapid image-text retrieval with scalable, multi-level representations.

Findings

01

HiVLP is 1,427 to 120,649 times faster than UNITER.

02

It is 2 to 5 times faster than LightingDot.

03

Achieves +4.9 AR on COCO and +3.8 AR on Flickr30K, comparable to SOTA models.

Abstract

In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · UNiversal Image-TExt Representation Learning