CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Dengke Zhang, Fagui Liu, Quan Tang

TL;DR
CorrCLIP enhances open-vocabulary semantic segmentation by reconstructing patch correlations in CLIP, using SAM and self-supervised models to improve spatial and semantic consistency, leading to superior benchmark performance.
Contribution
It introduces CorrCLIP, a novel method that reconstructs patch correlations in CLIP for better segmentation, integrating SAM and self-supervised models to reduce inter-class confusion.
Findings
Achieves superior performance across eight benchmarks.
Effectively reduces inter-class patch correlations.
Improves spatial and semantic feature representations.
Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsDense Connections · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training · Segment Anything Model
