Unsupervised Semantic Segmentation by Distilling Feature Correspondences
Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely,, William T. Freeman

TL;DR
This paper introduces STEGO, a novel framework for unsupervised semantic segmentation that separates feature learning from clustering, using a contrastive loss to produce semantically meaningful pixel features, significantly improving state-of-the-art results.
Contribution
The paper proposes STEGO, a new framework that distills unsupervised features into discrete semantic labels with a novel contrastive loss, enhancing segmentation performance.
Findings
Achieves +14 mIoU on CocoStuff
Achieves +9 mIoU on Cityscapes
Outperforms previous state-of-the-art methods
Abstract
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO (elf-supervised ransformer with nergy-based raph ptimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsVision Transformer · Linear Layer · Residual Connection · Dropout · Adam · Softmax · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Transformer
