Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation
Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

TL;DR
Cascade-CLIP introduces a cascaded decoders framework that effectively aligns multi-level visual features with text embeddings, significantly improving zero-shot semantic segmentation performance across multiple benchmarks.
Contribution
The paper proposes a novel cascaded decoders approach to better align multi-level visual features with text embeddings, enhancing zero-shot segmentation capabilities.
Findings
Achieves superior zero-shot segmentation results on COCO-Stuff, Pascal-VOC, and Pascal-Context.
Effectively utilizes multi-level features without weakening zero-shot ability.
Flexible framework compatible with existing methods.
Abstract
Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsALIGN · Contrastive Language-Image Pre-training
