Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision
Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi, Xie

TL;DR
This paper introduces OVSegmentor, a transformer-based model for open-vocabulary semantic segmentation trained solely on web image-text data, achieving state-of-the-art zero-shot results on multiple benchmarks.
Contribution
The paper presents a novel transformer model with a new training paradigm using proxy tasks and a curated dataset, enabling effective open-vocabulary segmentation without mask annotations.
Findings
Achieves superior zero-shot segmentation results on PASCAL VOC, PASCAL Context, and COCO.
Uses only 3% of data compared to previous methods, demonstrating high efficiency.
Introduces proxy tasks that improve fine-grained visual-text alignment.
Abstract
In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
