Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray

TL;DR
This paper introduces a method that uses self-supervised Vision Transformers to perform weakly-supervised few-shot classification and segmentation, effectively leveraging attention maps as pseudo-labels without requiring extensive pixel-level annotations.
Contribution
It proposes a novel approach that utilizes self-supervised ViT token correlations and attention maps for weakly-supervised segmentation and classification, including a pseudo-label enhancement technique.
Findings
Significant performance improvements on Pascal-5i and COCO-20i datasets.
Effective in scenarios with minimal pixel-level labels.
Demonstrates the viability of self-supervised ViT for weakly-supervised tasks.
Abstract
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dropout
