Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic   Segmentation

Tong Shao; Zhuotao Tian; Hang Zhao; Jingyong Su

arXiv:2407.08268·cs.CV·July 12, 2024

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLIPtrase, a training-free method leveraging CLIP's features for improved open-vocabulary semantic segmentation, significantly enhancing accuracy without additional training.

Contribution

It proposes a novel, training-free approach that recalibrates patch correlations in CLIP to improve local feature discrimination for semantic segmentation.

Findings

01

Achieves 22.3% higher accuracy than CLIP on average across 9 benchmarks.

02

Outperforms existing state-of-the-art training-free segmentation methods.

03

Enhances local feature awareness and semantic coherence in segmentation tasks.

Abstract

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leaves162/cliptrase
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling

MethodsContrastive Language-Image Pre-training