Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carri\'on-Ojeda; Stefan Roth; Simone Schaub-Meyer

arXiv:2507.23642·cs.CV·February 4, 2026

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carri\'on-Ojeda, Stefan Roth, Simone Schaub-Meyer

PDF

TL;DR

The paper introduces EMAT, an efficient transformer model that enhances few-shot classification and segmentation, especially for small objects, by using novel attention, downscaling, and parameter-efficient techniques, outperforming existing methods.

Contribution

EMAT is a new, resource-efficient transformer that improves small object detection in few-shot tasks and introduces evaluation settings that utilize available annotations.

Findings

01

EMAT outperforms all FS-CS methods on PASCAL-5i and COCO-20i datasets.

02

EMAT uses at least four times fewer trainable parameters than previous methods.

03

New evaluation settings better reflect practical scenarios by utilizing available annotations.

Abstract

Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5 $^{i}$ and COCO-20 $^{i}$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.