LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale   Image-Text Retrieval

Ziyang luo; Pu Zhao; Can Xu; Xiubo Geng; Tao Shen; Chongyang Tao; Jing; Ma; Qingwen lin; Daxin Jiang

arXiv:2302.02908·cs.CV·February 7, 2023·1 cites

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Ziyang luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing, Ma, Qingwen lin, Daxin Jiang

PDF

Open Access 1 Repo

TL;DR

LexLIP introduces a lexicon-weighting paradigm with sparse representations for image-text retrieval, significantly improving retrieval speed and efficiency while maintaining state-of-the-art accuracy.

Contribution

The paper proposes LexLIP, a novel pre-training framework that learns importance-aware lexicon representations, bridging the gap between continuous image data and sparse vocabulary space.

Findings

01

Achieves state-of-the-art performance on MSCOCO and Flickr30k datasets.

02

Outperforms CLIP with 5.5 to 221.3 times faster retrieval speed.

03

Uses 13.2 to 48.8 times less index storage memory.

Abstract

Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chiyeunglaw/lexlip-iccv23
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training