TL;DR
This paper investigates what makes a latent space suitable for diffusion models, proposing a new autoencoder that explicitly shapes the latent manifold to improve efficiency and quality.
Contribution
It introduces the Prior-Aligned AutoEncoder (PAE), which explicitly organizes the latent manifold using priors and regularization, outperforming existing tokenizers.
Findings
PAE achieves state-of-the-art gFID of 1.03 on ImageNet 256x256.
PAE converges up to 13x faster than RAE under the same setup.
Organizing the latent manifold improves diffusion model performance.
Abstract
Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
