Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Zhaoqing Wang; Xiaobo Xia; Ziye Chen; Xiao He; Yandong Guo; Mingming; Gong; Tongliang Liu

arXiv:2402.08960·cs.CV·June 12, 2024·2 cites

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming, Gong, Tongliang Liu

PDF

Open Access 2 Repos

TL;DR

Unpair-Seg introduces a weakly-supervised framework for open-vocabulary segmentation that learns from unpaired image-mask and image-text data, reducing annotation costs and improving performance.

Contribution

It proposes a novel method to perform open-vocabulary segmentation using unpaired supervision, addressing noise issues with a vision-language model and multi-scale matching.

Findings

01

Achieves 14.6% mIoU on ADE-847 dataset.

02

Achieves 19.5% mIoU on PASCAL Context-459 dataset.

03

Narrowing the gap with fully-supervised methods.

Abstract

Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Adapter · ALIGN · Contrastive Language-Image Pre-training