The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

Naren Sengodan

arXiv:2508.16663·cs.CV·May 19, 2026

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

Naren Sengodan

PDF

TL;DR

The paper introduces The Loupe, a lightweight, plug-and-play spatial gating module for Vision Transformers that enhances fine-grained classification by focusing on discriminative regions, improving accuracy with minimal additional parameters.

Contribution

It proposes a novel spatial gating module that can be easily integrated into Vision Transformers to improve fine-grained visual classification performance.

Findings

01

The Loupe improves Swin-Base accuracy from 88.36% to 91.72%.

02

The Loupe improves Swin-Tiny accuracy from 85.14% to 88.61%.

03

Controlled spatial gating outperforms naive multi-scale masking.

Abstract

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.