Improving the Diffusability of Autoencoders
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

TL;DR
This paper identifies high-frequency components in autoencoder latent spaces that hinder diffusion quality and proposes a simple regularization method to improve image and video generation performance.
Contribution
It introduces scale equivariance regularization to autoencoders, significantly enhancing diffusion-based image and video synthesis quality with minimal fine-tuning.
Findings
Reduces FID by 19% on ImageNet-1K 256^2 images.
Decreases FVD by at least 44% on Kinetics-700 videos.
Identifies high-frequency interference in autoencoder latent spaces.
Abstract
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStatistical and Computational Modeling
MethodsDiffusion
