TL;DR
PixelGen introduces perceptual supervision into pixel diffusion models, significantly improving image quality and training efficiency on ImageNet and text-to-image tasks.
Contribution
It proposes a novel end-to-end pixel diffusion framework with perceptual losses and noise-gating, enhancing sample quality without complex two-stage pipelines.
Findings
Achieves an FID of 5.11 on ImageNet-256 in 80 epochs.
Reaches a GenEval score of 0.79 in text-to-image generation.
Outperforms latent diffusion baselines on key metrics.
Abstract
Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
