M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition
Jiyong Moon, Junseok Lee, Yunju Lee, and Seongsik Park

TL;DR
M2Former introduces multi-scale patch selection and cross-scale interaction mechanisms to enhance fine-grained visual recognition, outperforming existing CNN and ViT models on standard benchmarks.
Contribution
The paper proposes a novel multi-scale patch selection method with cross-scale interactions for ViT-based FGVR, improving representational richness and scale robustness.
Findings
Outperforms CNN and ViT models on FGVR benchmarks.
Enhances recognition of objects across various sizes.
Improves feature hierarchy and model performance.
Abstract
Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Byte Pair Encoding
