Distilling Self-Supervised Vision Transformers for Weakly-Supervised   Few-Shot Classification & Segmentation

Dahyun Kang; Piotr Koniusz; Minsu Cho; Naila Murray

arXiv:2307.03407·cs.CV·July 10, 2023

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray

PDF

Open Access

TL;DR

This paper introduces a method that uses self-supervised Vision Transformers to perform weakly-supervised few-shot classification and segmentation, effectively leveraging attention maps as pseudo-labels without requiring extensive pixel-level annotations.

Contribution

It proposes a novel approach that utilizes self-supervised ViT token correlations and attention maps for weakly-supervised segmentation and classification, including a pseudo-label enhancement technique.

Findings

01

Significant performance improvements on Pascal-5i and COCO-20i datasets.

02

Effective in scenarios with minimal pixel-level labels.

03

Demonstrates the viability of self-supervised ViT for weakly-supervised tasks.

Abstract

We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dropout