Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Yi Bin; Haoxuan Li; Yahui Xu; Xing Xu; Yang Yang; Heng Tao Shen

arXiv:2308.04343·cs.CV·August 9, 2023

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, Heng Tao Shen

PDF

1 Repo

TL;DR

This paper introduces Hierarchical Alignment Transformers (HAT), a unified two-stream Transformer framework for cross-modal retrieval that improves semantic alignment between images and texts, achieving state-of-the-art results on benchmark datasets.

Contribution

The paper proposes a novel unified Transformer-based architecture for both image and text encoders, enabling better semantic alignment and multi-level correspondence exploration in cross-modal retrieval.

Findings

01

HAT outperforms SOTA methods on MSCOCO and Flickr30K datasets.

02

HAT achieves 7.6% and 16.7% improvements in Recall@1 on MSCOCO.

03

HAT achieves 4.4% and 11.6% improvements in Recall@1 on Flickr30K.

Abstract

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luminosityx/hat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Linear Layer · Adam · Dense Connections · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding