PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma; Ruihan Xu; and Shiliang Zhang

arXiv:2602.02493·cs.CV·May 8, 2026

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma, Ruihan Xu, and Shiliang Zhang

PDF

1 Repo

TL;DR

PixelGen introduces perceptual supervision into pixel diffusion models, significantly improving image quality and training efficiency on ImageNet and text-to-image tasks.

Contribution

It proposes a novel end-to-end pixel diffusion framework with perceptual losses and noise-gating, enhancing sample quality without complex two-stage pipelines.

Findings

01

Achieves an FID of 5.11 on ImageNet-256 in 80 epochs.

02

Reaches a GenEval score of 0.79 in text-to-image generation.

03

Outperforms latent diffusion baselines on key metrics.

Abstract

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zehong-Ma/PixelGen
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.