Vision Transformer Hashing for Image Retrieval
Shiv Ram Dubey, Satish Kumar Singh, Wei-Ta Chu

TL;DR
This paper introduces a Vision Transformer based hashing method (VTS) for image retrieval, leveraging pre-trained ViT and fine-tuning it across multiple frameworks, resulting in superior performance over existing hashing techniques.
Contribution
The paper proposes a novel Vision Transformer based hashing approach that outperforms state-of-the-art methods and demonstrates the effectiveness of ViT as a backbone for image retrieval hashing.
Findings
VTS outperforms recent hashing techniques on multiple datasets.
VTS backbone surpasses AlexNet and ResNet in retrieval tasks.
Extensive experiments validate the effectiveness of the proposed method.
Abstract
Deep learning has shown a tremendous growth in hashing techniques for image retrieval. Recently, Transformer has emerged as a new architecture by utilizing self-attention without convolution. Transformer is also extended to Vision Transformer (ViT) for the visual recognition with a promising performance on ImageNet. In this paper, we propose a Vision Transformer based Hashing (VTS) for image retrieval. We utilize the pre-trained ViT on ImageNet as the backbone network and add the hashing head. The proposed VTS model is fine tuned for hashing under six different image retrieval frameworks, including Deep Supervised Hashing (DSH), HashNet, GreedyHash, Improved Deep Hashing Network (IDHN), Deep Polarized Network (DPN) and Central Similarity Quantization (CSQ) with their objective functions. We perform the extensive experiments on CIFAR10, ImageNet, NUS-Wide, and COCO datasets. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · 1x1 Convolution · Batch Normalization · Average Pooling · Max Pooling · Residual Block · Bottleneck Residual Block · Dropout
