SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation
Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao

TL;DR
This paper introduces SSR, a novel method that improves CLIP-based weakly supervised segmentation by addressing over-activation issues through semantic and spatial rectification techniques, leading to state-of-the-art results.
Contribution
The paper proposes Semantic and Spatial Rectification (SSR), combining cross-modal prototype alignment and superpixel-guided correction to enhance CLIP-based segmentation accuracy.
Findings
Outperforms existing methods on PASCAL VOC and MS COCO datasets.
Achieves 79.5% and 50.6% mIoU scores respectively.
Effectively reduces over-activation in non-target regions.
Abstract
In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
