ALADIN: Distilling Fine-grained Alignment Scores for Efficient   Image-Text Matching and Retrieval

Nicola Messina; Matteo Stefanini; Marcella Cornia; Lorenzo Baraldi,; Fabrizio Falchi; Giuseppe Amato; Rita Cucchiara

arXiv:2207.14757·cs.CV·August 1, 2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi,, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara

PDF

Open Access 1 Repo

TL;DR

ALADIN introduces a method to efficiently perform image-text matching by distilling fine-grained alignment scores into a shared embedding space, achieving near state-of-the-art accuracy with significantly reduced computational cost.

Contribution

The paper presents ALADIN, a novel approach that combines fine-grained alignment with score distillation to enable fast and effective image-text retrieval.

Findings

01

ALADIN achieves comparable accuracy to large VL Transformers.

02

It is approximately 90 times faster at inference.

03

The method is effective on MS-COCO dataset.

Abstract

Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mesnico/aladin
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Linear Layer · Dropout · Dense Connections · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Attention Is All You Need