CLIPer: Hierarchically Improving Spatial Representation of CLIP for   Open-Vocabulary Semantic Segmentation

Lin Sun; Jiale Cao; Jin Xie; Xiaoheng Jiang; Yanwei Pang

arXiv:2411.13836·cs.CV·November 22, 2024

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, Yanwei Pang

PDF

Open Access 1 Repo

TL;DR

CLIPer is a hierarchical framework that enhances CLIP's spatial representation for open-vocabulary semantic segmentation, achieving state-of-the-art results across multiple datasets by combining early-layer fusion and fine-grained compensation modules.

Contribution

The paper introduces a novel hierarchical approach, CLIPer, which improves CLIP's spatial features for pixel-level segmentation without additional training.

Findings

01

Achieves 69.8% mIoU on VOC with ViT-L

02

Outperforms ProxyCLIP by 9.2% on VOC

03

Outperforms ProxyCLIP by 4.1% on COCO Object

Abstract

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linsun449/cliper.code
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training