LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang,, Jingjing Liu

TL;DR
LightningDOT is a novel approach that drastically accelerates image-text retrieval by removing cross-modal attention during inference, achieving state-of-the-art accuracy with significantly reduced computation and latency.
Contribution
It introduces a pre-training method that enables instant dot-product matching for image-text retrieval, eliminating the need for slow cross-modal attention during inference.
Findings
Achieves thousands of times faster inference speed.
Outperforms existing models on Flickr30k, COCO, and Multi30K benchmarks.
Maintains high accuracy comparable to state-of-the-art methods.
Abstract
Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Attention Is All You Need · Byte Pair Encoding · Label Smoothing · Dropout · Residual Connection
