HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai; Jingwen Chen; Chengmin Gao; Zijian Gong; Yehao Li; Yingwei Pan; Yi Peng; Zhaofan Qiu; Kai Yu; Yiheng Zhang; Hao Ai; Siying Bai; Yang Chen; Zhihui Chen; Fengbin Gao; Ying Guo; Dong Li; Zhen Shen; Leilei Shi; Jing Wang; Siyu Wang; Yimeng Wang; Rui Zheng; Ting Yao; Tao Mei

arXiv:2605.11061·cs.CV·May 13, 2026

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Yingwei Pan, Yi Peng, Zhaofan Qiu, Kai Yu, Yiheng Zhang, Hao Ai, Siying Bai, Yang Chen, Zhihui Chen, Fengbin Gao, Ying Guo, Dong Li, Zhen Shen, Leilei Shi, Jing Wang, Siyu Wang, Yimeng Wang, Rui Zheng, Ting Yao, Tao Mei

PDF

7 Models

TL;DR

HiDream-O1-Image introduces a unified pixel-space diffusion transformer that integrates multimodal inputs for versatile image generation and editing, achieving high performance with scalable architecture.

Contribution

The paper presents a novel end-to-end unified transformer model that eliminates the need for separate encoders and VAEs, enabling scalable, multimodal image generation and editing.

Findings

01

Achieves state-of-the-art results across various tasks with only 8B parameters.

02

Successfully scales architecture up to over 200B parameters, surpassing larger models.

03

Demonstrates superior performance and versatility in image generation and editing tasks.

Abstract

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.