Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

TL;DR
This paper introduces a computationally efficient method for fine-grained image recognition using vision transformers by selecting discriminative regions based on similarity between global and local representations, improving accuracy with lower cost.
Contribution
The authors propose a novel similarity-based region selection technique for vision transformers that enhances fine-grained recognition accuracy while reducing computational expense.
Findings
Achieves higher accuracy across multiple datasets.
Reduces computational cost compared to existing methods.
Effective in selecting discriminative image regions.
Abstract
Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need · Feature Selection
