JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion
Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding

TL;DR
JoDiffusion is a novel diffusion-based framework that jointly generates images and pixel-level annotations from text prompts, improving scalability and annotation consistency for semantic segmentation datasets.
Contribution
It introduces a joint generative model combining diffusion and VAE techniques to produce paired images and annotations from text, addressing annotation cost and inconsistency issues.
Findings
Generated datasets improve segmentation performance on Pascal VOC, COCO, ADE20K.
Outperforms existing synthetic data generation methods in quality and scalability.
Mask optimization reduces annotation noise during generation.
Abstract
Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Multimodal Machine Learning Applications
