CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Dengke Zhang; Fagui Liu; Quan Tang

arXiv:2411.10086·cs.CV·August 4, 2025

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Dengke Zhang, Fagui Liu, Quan Tang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CorrCLIP enhances open-vocabulary semantic segmentation by reconstructing patch correlations in CLIP, using SAM and self-supervised models to improve spatial and semantic consistency, leading to superior benchmark performance.

Contribution

It introduces CorrCLIP, a novel method that reconstructs patch correlations in CLIP for better segmentation, integrating SAM and self-supervised models to reduce inter-class confusion.

Findings

01

Achieves superior performance across eight benchmarks.

02

Effectively reduces inter-class patch correlations.

03

Improves spatial and semantic feature representations.

Abstract

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zdk258/CorrCLIP
pytorchOfficial

Datasets

dk258/CorrCLIP
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsDense Connections · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training · Segment Anything Model