CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, Yanwei Pang

TL;DR
CLIPer is a hierarchical framework that enhances CLIP's spatial representation for open-vocabulary semantic segmentation, achieving state-of-the-art results across multiple datasets by combining early-layer fusion and fine-grained compensation modules.
Contribution
The paper introduces a novel hierarchical approach, CLIPer, which improves CLIP's spatial features for pixel-level segmentation without additional training.
Findings
Achieves 69.8% mIoU on VOC with ViT-L
Outperforms ProxyCLIP by 9.2% on VOC
Outperforms ProxyCLIP by 4.1% on COCO Object
Abstract
Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training
