TL;DR
ReCLIP++ introduces a method to explicitly model and rectify biases in CLIP, improving unsupervised semantic segmentation performance across multiple benchmarks.
Contribution
It designs a learnable prompt and bias encoding mechanism to explicitly address class and space biases in CLIP for segmentation.
Findings
Outperforms previous state-of-the-art methods on PASCAL VOC, ADE20K, and other benchmarks.
Effectively models and reduces class-preference and space-preference biases.
Achieves more accurate and contextually consistent segmentation masks.
Abstract
Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
