NamedMask: Distilling Segmenters from Complementary Foundation Models
Gyungin Shin, Weidi Xie, Samuel Albanie

TL;DR
NamedMask is a novel approach that distills the strengths of CLIP and DINO foundation models to perform zero-label semantic segmentation, achieving competitive results on multiple benchmarks.
Contribution
It introduces a method to generate high-quality segmentation masks without pixel labels by combining CLIP's naming ability with DINO's spatial understanding.
Findings
Achieves strong performance on VOC2012, COCO, and ImageNet-S datasets.
Effectively segments both single-object and multi-object images.
Outperforms prior methods in zero-label semantic segmentation.
Abstract
The goal of this work is to segment and name regions of images without access to pixel-level labels during training. To tackle this task, we construct segmenters by distilling the complementary strengths of two foundation models. The first, CLIP (Radford et al. 2021), exhibits the ability to assign names to image content but lacks an accessible representation of object structure. The second, DINO (Caron et al. 2021), captures the spatial extent of objects but has no knowledge of object names. Our method, termed NamedMask, begins by using CLIP to construct category-specific archives of images. These images are pseudo-labelled with a category-agnostic salient object detector bootstrapped from DINO, then refined by category-specific segmenters using the CLIP archive labels. Thanks to the high quality of the refined masks, we show that a standard segmentation architecture trained on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer · Contrastive Language-Image Pre-training
