Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming, Gong, Tongliang Liu

TL;DR
Unpair-Seg introduces a weakly-supervised framework for open-vocabulary segmentation that learns from unpaired image-mask and image-text data, reducing annotation costs and improving performance.
Contribution
It proposes a novel method to perform open-vocabulary segmentation using unpaired supervision, addressing noise issues with a vision-language model and multi-scale matching.
Findings
Achieves 14.6% mIoU on ADE-847 dataset.
Achieves 19.5% mIoU on PASCAL Context-459 dataset.
Narrowing the gap with fully-supervised methods.
Abstract
Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · Adapter · ALIGN · Contrastive Language-Image Pre-training
