Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for   Zero-Shot Semantic Segmentation

Yunheng Li; ZhongYu Li; Quansheng Zeng; Qibin Hou; Ming-Ming Cheng

arXiv:2406.00670·cs.CV·June 7, 2024

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

PDF

Open Access 1 Repo

TL;DR

Cascade-CLIP introduces a cascaded decoders framework that effectively aligns multi-level visual features with text embeddings, significantly improving zero-shot semantic segmentation performance across multiple benchmarks.

Contribution

The paper proposes a novel cascaded decoders approach to better align multi-level visual features with text embeddings, enhancing zero-shot segmentation capabilities.

Findings

01

Achieves superior zero-shot segmentation results on COCO-Stuff, Pascal-VOC, and Pascal-Context.

02

Effectively utilizes multi-level features without weakening zero-shot ability.

03

Flexible framework compatible with existing methods.

Abstract

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hvision-nku/cascade-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsALIGN · Contrastive Language-Image Pre-training