Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han

TL;DR
This paper introduces Deep Compression Autoencoder (DC-AE), a novel autoencoder architecture that significantly accelerates high-resolution diffusion models by achieving higher compression ratios while maintaining quality, leading to faster inference and training.
Contribution
The paper proposes residual autoencoding and a decoupled training strategy to enhance high spatial compression ratios in autoencoders for diffusion models, enabling faster high-resolution image generation.
Findings
Achieves up to 128x spatial compression ratio with maintained reconstruction quality.
Provides 19.1x inference speedup on ImageNet 512x512.
Delivers 17.9x training speedup with improved FID scores.
Abstract
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while…
Peer Reviews
Decision·ICLR 2025 Poster
* The paper is well written and easy to follow. * The method allows to significantly reduce the computational requirements to train large scale diffusion models at high resolution, making it more accessible as a research topic. * Evaluation of multiple metrics across 4 different datasets are reported. * Evaluations on downstream diffusion training is also provided. * Qualitative examples clearly showcase the improvements brought about by the proposed model.
* What parameters were used for sampling the diffusion models (sampler, number of sampling steps, guidance scale) ? A more through investigation on the impact on sampling quality would be useful to get a better grasp of the limitations of this method. * Missing ablations on the constituent parts of DC-AE. * In tables 3 and 4, missing comparisons with more recent autoencoders such as the one from SD-XL, SD3, and Asymetric Autoencoder. * In table 5. the PixArt model trained with DC-AE achieves a l
The primary strength of this paper is the development of an effective and relatively simple solution to a real-world deficiency that impacts all research and products using autoencoders. Specifically, the use of residual autoencoding and DHRA allow either better visual quality (e.g., in Table 3 look at rows witht he same number of tokens like SD-VAE-f16 with patch-size=2 or SD-VAE-f32 with patch-size=1 to DC-AE-f32) or similar visual quality with lower latency and higher throughput (e.g., Table
A more detailed summary of the DC-AE architecture would help. Fig. 4 provides a high-level overview, but I'm not sure how literal or complete it is, and the details of the "Encoder/Decoder Stages" are not provided. To be fair, the authors do provide code, but architecture and training details should be in the paper (probably in the appendix). The conjecture in Section 3.3 that DC-AE helps because otherwise the "diffusion model needs to simultaneously learn denoising and token compression when u
1. The paper is well-written and easy to follow. 2. The baselines using stable diffusion are well-known and recognized by the community. 3. The qualitative examples seem compelling. 4. The advantages with using less compute could be of significant impact.
1. The authors mostly use stable diffusion for all experiments, although several datasets are considered. 2. A more detailed quantitative analysis of why standard staged training has difficulties with gradient propagation is not presented. 3. The proposed modifications are somewhat incremental - this seems close to simply removing some skip connection convolutions + a reshape operation in the standard stable diffusion autoencoder. The extra fine-tuning procedure is more detailed, but is only nec
Code & Models
- 🤗mit-han-lab/dc-ae-f32c32-mix-1.0model· 16 dl· ♡ 216 dl♡ 2
- 🤗mit-han-lab/dc-ae-f64c128-mix-1.0model· 2 dl· ♡ 22 dl♡ 2
- 🤗mit-han-lab/dc-ae-f128c512-mix-1.0model· 109 dl· ♡ 5109 dl♡ 5
- 🤗mit-han-lab/dc-ae-f32c32-in-1.0model· 693 dl· ♡ 9693 dl♡ 9
- 🤗mit-han-lab/dc-ae-f64c128-in-1.0model· 361 dl· ♡ 8361 dl♡ 8
- 🤗mit-han-lab/dc-ae-f128c512-in-1.0model· 11 dl· ♡ 211 dl♡ 2
- 🤗mit-han-lab/dc-ae-f32c32-in-1.0-dit-xl-in-512pxmodel· 5 dl· ♡ 95 dl♡ 9
- 🤗mit-han-lab/dc-ae-f32c32-in-1.0-uvit-s-in-512pxmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗mit-han-lab/dc-ae-f32c32-in-1.0-uvit-h-in-512pxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512pxmodel· 28 dl· ♡ 328 dl♡ 3
Videos
Introduction to Deep Compression Autoencoder· youtube
Image Reconstruction Demo of Deep Compression Autoencoder· youtube
Efficient Diffusion Models with Deep Compression Autoencoder· youtube
Taxonomy
TopicsMedical Imaging Techniques and Applications · Image and Signal Denoising Methods · Advanced Data Compression Techniques
MethodsDiffusion
