PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu; Wei Xiong; Weili Nie; Yichen Sheng; Shiqiu Liu; Jiebo Luo

arXiv:2511.20645·cs.CV·April 17, 2026

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo

PDF

1 Repo 3 Models

TL;DR

PixelDiT introduces a single-stage, pixel-space diffusion transformer that directly models images without autoencoders, achieving state-of-the-art results in image generation and text-to-image tasks.

Contribution

It proposes a novel end-to-end pixel-space diffusion transformer with a dual-level design, eliminating autoencoder reliance and improving image generation quality.

Findings

01

Achieves 1.61 FID on ImageNet 256

02

Achieves 1.81 FID on ImageNet 512

03

Pretrained PixelDiT approaches state-of-the-art in text-to-image generation

Abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVlabs/PixelDiT
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.