Boosting Latent Diffusion Models via Disentangled Representation Alignment
John Page, Xuesong Niu, Kai Wu, Kun Gai

TL;DR
This paper introduces Send-VAE, a novel semantic-disentangled VAE that improves latent space structure for high-quality image generation by aligning local features with dense semantics, outperforming previous methods.
Contribution
Send-VAE employs a non-linear mapping to enhance semantic disentanglement in VAEs, addressing limitations of shallow alignment strategies for latent diffusion models.
Findings
Achieves state-of-the-art FID of 1.21 on ImageNet 256x256.
Demonstrates improved attribute separability in VAE latent spaces.
Establishes a new evaluation paradigm for VAE latent representations.
Abstract
Latent Diffusion Models (LDMs) rely heavily on the compressed latent space provided by Variational Autoencoders (VAEs) for high-quality image generation. Recent studies have attempted to obtain generation-friendly VAEs by directly adopting alignment strategies from LDM training, leveraging Vision Foundation Models (VFMs) as representation alignment targets. However, such alignment paradigms overlook the fundamental differences in representational requirements between LDMs and VAEs. Simple feature mapping from local patches to high-dimensional semantics can induce semantic collapse, leading to the loss of fine-grained attributes. In this paper, we reveal a key insight: unlike LDMs that benefit from high-level global semantics, a generation-friendly VAE must possess strong semantic disentanglement capabilities to preserve fine-grained, attribute-level information in a structured manner.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning
