Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
Xiangheng Shan, Dongyue Wu, Guilin Zhu, Yuanjie Shao, Nong Sang,, Changxin Gao

TL;DR
This paper introduces EBSeg, a novel framework for open-vocabulary semantic segmentation that balances image embeddings and enforces semantic structure consistency, significantly improving generalization to new classes.
Contribution
The paper proposes EBSeg with an Adaptively Balanced Decoder and SSC Loss, enhancing CLIP-based segmentation by balancing embeddings and aligning semantic structures for better generalization.
Findings
Outperforms state-of-the-art methods on various benchmarks
Effectively balances training and new class recognition
Improves semantic structure understanding in segmentation tasks
Abstract
Open-vocabulary semantic segmentation is a challenging task, which requires the model to output semantic masks of an image beyond a close-set vocabulary. Although many efforts have been made to utilize powerful CLIP models to accomplish this task, they are still easily overfitting to training classes due to the natural gaps in semantic information between training and new classes. To overcome this challenge, we propose a novel framework for openvocabulary semantic segmentation called EBSeg, incorporating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder is designed to generate different image embeddings for both training and new classes. Subsequently, these two types of embeddings are adaptively balanced to fully exploit their ability to recognize training classes and generalization ability for new classes. To learn a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Segment Anything Model
