Global-Local Similarity for Efficient Fine-Grained Image Recognition   with Vision Transformers

Edwin Arkel Rios; Min-Chun Hu; Bo-Cheng Lai

arXiv:2407.12891·cs.CV·July 19, 2024·1 cites

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a computationally efficient method for fine-grained image recognition using vision transformers by selecting discriminative regions based on similarity between global and local representations, improving accuracy with lower cost.

Contribution

The authors propose a novel similarity-based region selection technique for vision transformers that enhances fine-grained recognition accuracy while reducing computational expense.

Findings

01

Achieves higher accuracy across multiple datasets.

02

Reduces computational cost compared to existing methods.

03

Effective in selecting discriminative image regions.

Abstract

Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arkel23/GLSim
pytorchOfficial

Models

🤗
NYCU-PCSxNTHU-MIS/GLSim
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Feature Selection