Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using   Transformer Encoders

Nicola Messina; Giuseppe Amato; Andrea Esuli; Fabrizio Falchi; Claudio; Gennaro; St\'ephane Marchand-Maillet

arXiv:2008.05231·cs.CV·March 3, 2021

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio, Gennaro, St\'ephane Marchand-Maillet

PDF

1 Repo

TL;DR

This paper introduces TERAN, a transformer-based model that achieves state-of-the-art fine-grained image-sentence alignment for cross-modal retrieval, emphasizing scalable and efficient retrieval pipelines.

Contribution

The paper proposes a novel transformer encoder approach that enforces fine-grained image-text alignment while maintaining separate data pipelines for scalable retrieval.

Findings

01

State-of-the-art results on MS-COCO and Flickr30k datasets.

02

Improved Recall@1 by 5.7% for image retrieval on MS-COCO.

03

Enhanced sentence retrieval performance with a 3.5% increase in Recall@1.

Abstract

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mesnico/TERAN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout