Learning Open-vocabulary Semantic Segmentation Models From Natural   Language Supervision

Jilan Xu; Junlin Hou; Yuejie Zhang; Rui Feng; Yi Wang; Yu Qiao; Weidi; Xie

arXiv:2301.09121·cs.CV·March 6, 2023·1 cites

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi, Xie

PDF

Open Access 1 Repo

TL;DR

This paper introduces OVSegmentor, a transformer-based model for open-vocabulary semantic segmentation trained solely on web image-text data, achieving state-of-the-art zero-shot results on multiple benchmarks.

Contribution

The paper presents a novel transformer model with a new training paradigm using proxy tasks and a curated dataset, enabling effective open-vocabulary segmentation without mask annotations.

Findings

01

Achieves superior zero-shot segmentation results on PASCAL VOC, PASCAL Context, and COCO.

02

Uses only 3% of data compared to previous methods, demonstrating high efficiency.

03

Introduces proxy tasks that improve fine-grained visual-text alignment.

Abstract

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jazzcharles/OVSegmentor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning