ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi,, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara

TL;DR
ALADIN introduces a method to efficiently perform image-text matching by distilling fine-grained alignment scores into a shared embedding space, achieving near state-of-the-art accuracy with significantly reduced computational cost.
Contribution
The paper presents ALADIN, a novel approach that combines fine-grained alignment with score distillation to enable fast and effective image-text retrieval.
Findings
ALADIN achieves comparable accuracy to large VL Transformers.
It is approximately 90 times faster at inference.
The method is effective on MS-COCO dataset.
Abstract
Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Linear Layer · Dropout · Dense Connections · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Attention Is All You Need
