RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition
Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He,, Hui Xue

TL;DR
RAMS-Trans introduces a recursive multi-scale transformer that leverages attention weights for fine-grained image recognition, outperforming existing models by effectively capturing discriminative regions without additional annotations.
Contribution
The paper proposes RAMS-Trans, a novel recurrent attention multi-scale transformer that uses attention weights for dynamic region proposal and multi-scale feature learning in FGIR.
Findings
Achieves state-of-the-art results on three benchmark datasets.
Outperforms concurrent CNN-based and transformer models.
Requires only attention weights from ViT, enabling end-to-end training.
Abstract
In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results on computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dense Connections · Softmax · Layer Normalization · Vision Transformer
