Rethinking Generalization in Few-Shot Classification
Markus Hiller, Rongkai Ma, Mehrtash Harandi, Tom Drummond

TL;DR
This paper introduces a novel few-shot classification method using Vision Transformers to identify and optimize the most informative image patches, achieving state-of-the-art results without relying on detailed annotations.
Contribution
It proposes a patch-based approach with online optimization for interpretability and leverages masked image modeling to improve generalization in few-shot learning.
Findings
Achieves new state-of-the-art on four few-shot benchmarks.
Effectively identifies key image regions for classification.
Avoids supervision collapse through unsupervised training.
Abstract
Single image-level annotations only correctly describe an often small subset of an image's content, particularly when complex real-world scenes are depicted. While this might be acceptable in many classification scenarios, it poses a significant challenge for applications where the set of classes differs significantly between training and test time. In this paper, we take a closer look at the implications in the context of . Splitting the input samples into patches and encoding these via the help of Vision Transformers allows us to establish semantic correspondences between local regions across images and independent of their respective class. The most informative patch embeddings for the task at hand are then determined as a function of the support set via online optimization at inference time, additionally providing visual interpretability of `$\textit{what…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications
MethodsTest · Residual Connection · Layer Normalization · Swin Transformer · Linear Layer · Softmax · Multi-Head Attention · Attention Is All You Need · Vision Transformer
