DiP: Taming Diffusion Models in Pixel Space

Zhennan Chen; Junwei Zhu; Xu Chen; Jiangning Zhang; Xiaobin Hu; Hanzhen Zhao; Chengjie Wang; Jian Yang; Ying Tai

arXiv:2511.18822·cs.CV·March 27, 2026

DiP: Taming Diffusion Models in Pixel Space

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai

PDF

Open Access

TL;DR

DiP introduces an efficient pixel space diffusion framework that combines global structure generation with local detail restoration, achieving high-quality high-resolution image synthesis with significantly improved speed and minimal parameter increase.

Contribution

The paper presents DiP, a novel pixel space diffusion model that decouples global and local generation, enabling faster inference without VAE reliance and minimal parameter overhead.

Findings

01

Up to 10x faster inference than previous methods

02

Achieves 1.79 FID score on ImageNet 256x256

03

Maintains high-quality synthesis with minimal parameter increase

Abstract

Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10 $\times$ faster inference speeds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning