The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers
Naren Sengodan

TL;DR
The paper introduces The Loupe, a lightweight, plug-and-play spatial gating module for Vision Transformers that enhances fine-grained classification by focusing on discriminative regions, improving accuracy with minimal additional parameters.
Contribution
It proposes a novel spatial gating module that can be easily integrated into Vision Transformers to improve fine-grained visual classification performance.
Findings
The Loupe improves Swin-Base accuracy from 88.36% to 91.72%.
The Loupe improves Swin-Tiny accuracy from 85.14% to 88.61%.
Controlled spatial gating outperforms naive multi-scale masking.
Abstract
Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
