SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Xiuli Bi; Die Xiao; Junchao Fan; Bin Xiao

arXiv:2512.01701·cs.CV·December 23, 2025

SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao

PDF

Open Access

TL;DR

This paper introduces SSR, a novel method that improves CLIP-based weakly supervised segmentation by addressing over-activation issues through semantic and spatial rectification techniques, leading to state-of-the-art results.

Contribution

The paper proposes Semantic and Spatial Rectification (SSR), combining cross-modal prototype alignment and superpixel-guided correction to enhance CLIP-based segmentation accuracy.

Findings

01

Outperforms existing methods on PASCAL VOC and MS COCO datasets.

02

Achieves 79.5% and 50.6% mIoU scores respectively.

03

Effectively reduces over-activation in non-target regions.

Abstract

In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection