Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team; Huanqia Cai; Sihan Cao; Ruoyi Du; Peng Gao; Steven Hoi; Zhaohui Hou; Shijie Huang; Dengyang Jiang; Xin Jin; Liangchen Li; Zhen Li; Zhong-Yu Li; David Liu; Dongyang Liu; Junhan Shi; Qilong Wu; Feng Yu; Chi Zhang; Shifeng Zhang; Shilin Zhou

arXiv:2511.22699·cs.CV·December 9, 2025

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou

PDF

Open Access 10 Models

TL;DR

Z-Image introduces a highly efficient 6B-parameter image generation model using a novel single-stream diffusion transformer, achieving competitive performance with significantly reduced computational resources and enabling broad accessibility.

Contribution

The paper presents Z-Image, a scalable, efficient image generation model with a streamlined training process and a new architecture that challenges the scale-at-all-costs paradigm in the field.

Findings

01

Achieves performance comparable to top-tier models.

02

Enables sub-second inference on consumer hardware.

03

Reduces training costs to approximately $630K.

Abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Neural Network Applications