JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Haoyu Wang; Lei Zhang; Wenrui Liu; Dengyang Jiang; Wei Wei; Chen Ding

arXiv:2512.13014·cs.CV·December 16, 2025

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding

PDF

Open Access 1 Video

TL;DR

JoDiffusion is a novel diffusion-based framework that jointly generates images and pixel-level annotations from text prompts, improving scalability and annotation consistency for semantic segmentation datasets.

Contribution

It introduces a joint generative model combining diffusion and VAE techniques to produce paired images and annotations from text, addressing annotation cost and inconsistency issues.

Findings

01

Generated datasets improve segmentation performance on Pascal VOC, COCO, ADE20K.

02

Outperforms existing synthetic data generation methods in quality and scalability.

03

Mask optimization reduces annotation noise during generation.

Abstract

Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Multimodal Machine Learning Applications