M2Former: Multi-Scale Patch Selection for Fine-Grained Visual   Recognition

Jiyong Moon; Junseok Lee; Yunju Lee; and Seongsik Park

arXiv:2308.02161·cs.CV·October 8, 2024

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

Jiyong Moon, Junseok Lee, Yunju Lee, and Seongsik Park

PDF

Open Access

TL;DR

M2Former introduces multi-scale patch selection and cross-scale interaction mechanisms to enhance fine-grained visual recognition, outperforming existing CNN and ViT models on standard benchmarks.

Contribution

The paper proposes a novel multi-scale patch selection method with cross-scale interactions for ViT-based FGVR, improving representational richness and scale robustness.

Findings

01

Outperforms CNN and ViT models on FGVR benchmarks.

02

Enhances recognition of objects across various sizes.

03

Improves feature hierarchy and model performance.

Abstract

Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Byte Pair Encoding