Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation
Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi

TL;DR
This paper introduces Chimera-Seg, a novel zero-shot semantic segmentation model that combines a segmentation backbone with a CLIP-based semantic head, addressing alignment challenges and achieving improved performance.
Contribution
Chimera-Seg integrates a segmentation model with a CLIP-based semantic head and proposes Selective Global Distillation for better alignment in zero-shot segmentation.
Findings
Achieves 0.9% and 1.2% improvements in hIoU on two benchmarks.
Effectively aligns dense visual features with CLIP's semantic space.
Demonstrates the effectiveness of partial CLIP modules in segmentation.
Abstract
Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
