A Lightweight Clustering Framework for Unsupervised Semantic Segmentation
Yau Shing Jonathan Cheung, Xi Chen, Lihe Yang, Hengshuang Zhao

TL;DR
This paper introduces a lightweight clustering framework leveraging self-supervised Vision Transformer features for unsupervised semantic segmentation, achieving state-of-the-art results without neural network training.
Contribution
The authors propose a novel multilevel clustering approach that exploits attention features for effective unsupervised segmentation, reducing computational complexity.
Findings
Achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
Demonstrates strong foreground-background differentiation in self-supervised Vision Transformer features.
Provides comprehensive analysis comparing DINO and DINOv2 features.
Abstract
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model accuracy, most required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thus propose a lightweight clustering framework for unsupervised semantic segmentation. We discovered that attention features of the self-supervised Vision Transformer exhibit strong foreground-background differentiability. Therefore, clustering can be employed to effectively separate foreground and background image patches. In our framework, we first perform multilevel clustering across the Dataset-level, Category-level, and Image-level, and maintain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Dropout · Label Smoothing · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Transformer · self-DIstillation with NO labels
