TL;DR
This paper introduces TERAN, a transformer-based model that achieves state-of-the-art fine-grained image-sentence alignment for cross-modal retrieval, emphasizing scalable and efficient retrieval pipelines.
Contribution
The paper proposes a novel transformer encoder approach that enforces fine-grained image-text alignment while maintaining separate data pipelines for scalable retrieval.
Findings
State-of-the-art results on MS-COCO and Flickr30k datasets.
Improved Recall@1 by 5.7% for image retrieval on MS-COCO.
Enhanced sentence retrieval performance with a 3.5% increase in Recall@1.
Abstract
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout
