TransHash: Transformer-based Hamming Hashing for Efficient Image   Retrieval

Yongbiao Chen (1); Sheng Zhang (2); Fangxin Liu (1); Zhigang Chang; (1); Mang Ye (3); Zhengwei Qi (1) ((1) Shanghai Jiao Tong University; (2); University of Southern California; (3) Wuhan University)

arXiv:2105.01823·cs.CV·May 6, 2021

TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

Yongbiao Chen (1), Sheng Zhang (2), Fangxin Liu (1), Zhigang Chang, (1), Mang Ye (3), Zhengwei Qi (1) ((1) Shanghai Jiao Tong University, (2), University of Southern California, (3) Wuhan University)

PDF

Open Access

TL;DR

TransHash introduces a novel transformer-based deep hashing framework for efficient image retrieval, outperforming CNN-based methods by leveraging vision transformers and Bayesian learning for superior accuracy.

Contribution

This work is the first to develop a pure transformer-based deep hashing method for image retrieval, eliminating the need for convolutional neural networks.

Findings

01

Achieves significant performance gains over state-of-the-art methods.

02

Demonstrates effectiveness on CIFAR-10, NUSWIDE, and ImageNet datasets.

03

Outperforms existing methods in mean Average Precision (mAP).

Abstract

Deep hamming hashing has gained growing popularity in approximate nearest neighbour search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. \texttt{Resnet}\cite{he2016deep}. In this paper, inspired by the recent advancements of vision transformers, we present \textbf{Transhash}, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based on \textit{Vision Transformer} (ViT), we design a siamese vision transformer backbone for image feature extraction. To learn fine-grained features, we innovate a dual-stream feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Softmax · Dense Connections · Vision Transformer