Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, Shijian Lu

TL;DR
This paper introduces Concept Curation (CoCu), a method that uses CLIP to bridge semantic gaps between visual concepts and textual descriptions, significantly improving zero-shot language-supervised semantic segmentation performance.
Contribution
The paper proposes CoCu, a novel pipeline that enhances language-supervised segmentation by compensating missing semantics using CLIP and concept curation techniques.
Findings
Achieves state-of-the-art zero-shot segmentation performance
Significantly boosts baseline results across 8 benchmarks
Demonstrates the effectiveness of bridging semantic gaps in pre-training
Abstract
Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
