TL;DR
DeCo introduces a frequency-decoupled pixel diffusion framework that separates high and low frequency generation, leading to more efficient training and superior image quality in pixel-based image generation.
Contribution
The paper proposes a novel frequency-decoupled pixel diffusion method with a lightweight decoder and frequency-aware loss, improving efficiency and performance over existing pixel diffusion models.
Findings
Achieves state-of-the-art FID scores of 1.62 and 2.22 on ImageNet at 256x256 and 512x512 resolutions.
Pretrained text-to-image model scores 0.86 on GenEval, outperforming previous models.
Demonstrates that decoupling frequency components enhances pixel diffusion efficiency and quality.
Abstract
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
