Open-vocabulary Object Segmentation with Diffusion Models
Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR
This paper introduces a method to extract segmentation maps from pre-trained diffusion models using a novel grounding module, enabling open-vocabulary segmentation and zero-shot performance on segmentation benchmarks.
Contribution
It presents a new grounding module paired with Stable Diffusion, an automatic dataset construction pipeline, and demonstrates zero-shot segmentation capabilities with diffusion models.
Findings
The grounding module effectively segments unseen object categories.
Synthetic datasets from diffusion models improve zero-shot segmentation performance.
The approach enables open-vocabulary segmentation using pre-trained diffusion models.
Abstract
The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we pair the existing Stable Diffusion model with a novel grounding module, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we establish an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Open-vocabulary Object Segmentation with Diffusion Models· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsALIGN · Diffusion
