HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Yingwei Pan, Yi Peng, Zhaofan Qiu, Kai Yu, Yiheng Zhang, Hao Ai, Siying Bai, Yang Chen, Zhihui Chen, Fengbin Gao, Ying Guo, Dong Li, Zhen Shen, Leilei Shi, Jing Wang, Siyu Wang, Yimeng Wang, Rui Zheng, Ting Yao, Tao Mei

TL;DR
HiDream-O1-Image introduces a unified pixel-space diffusion transformer that integrates multimodal inputs for versatile image generation and editing, achieving high performance with scalable architecture.
Contribution
The paper presents a novel end-to-end unified transformer model that eliminates the need for separate encoders and VAEs, enabling scalable, multimodal image generation and editing.
Findings
Achieves state-of-the-art results across various tasks with only 8B parameters.
Successfully scales architecture up to over 200B parameters, surpassing larger models.
Demonstrates superior performance and versatility in image generation and editing tasks.
Abstract
The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HiDream-ai/HiDream-O1-Imagemodel· 23k dl· ♡ 42223k dl♡ 422
- 🤗HiDream-ai/HiDream-O1-Image-Dev-2604model· 921 dl· ♡ 53921 dl♡ 53
- 🤗HiDream-ai/HiDream-O1-Image-Devmodel· 7.1k dl· ♡ 1057.1k dl♡ 105
- 🤗Jennensje/HiDream-O1-Image-Devmodel· 15 dl15 dl
- 🤗Akhmad123/HiDream-O1-Image1model· 22 dl22 dl
- 🤗HodgeMann/HiDream-O1-Image-Dev-2604-FP16-mergedmodel
- 🤗xowj/HiDream-O1-Imagemodel· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
