TL;DR
This paper introduces Hierarchical Alignment Transformers (HAT), a unified two-stream Transformer framework for cross-modal retrieval that improves semantic alignment between images and texts, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper proposes a novel unified Transformer-based architecture for both image and text encoders, enabling better semantic alignment and multi-level correspondence exploration in cross-modal retrieval.
Findings
HAT outperforms SOTA methods on MSCOCO and Flickr30K datasets.
HAT achieves 7.6% and 16.7% improvements in Recall@1 on MSCOCO.
HAT achieves 4.4% and 11.6% improvements in Recall@1 on Flickr30K.
Abstract
Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Linear Layer · Adam · Dense Connections · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding
