LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time   Image-Text Retrieval

Siqi Sun; Yen-Chun Chen; Linjie Li; Shuohang Wang; Yuwei Fang,; Jingjing Liu

arXiv:2103.08784·cs.CL·April 13, 2021·6 cites

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang,, Jingjing Liu

PDF

Open Access 2 Repos

TL;DR

LightningDOT is a novel approach that drastically accelerates image-text retrieval by removing cross-modal attention during inference, achieving state-of-the-art accuracy with significantly reduced computation and latency.

Contribution

It introduces a pre-training method that enables instant dot-product matching for image-text retrieval, eliminating the need for slow cross-modal attention during inference.

Findings

01

Achieves thousands of times faster inference speed.

02

Outperforms existing models on Flickr30k, COCO, and Multi30K benchmarks.

03

Maintains high accuracy comparable to state-of-the-art methods.

Abstract

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Attention Is All You Need · Byte Pair Encoding · Label Smoothing · Dropout · Residual Connection