Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen; Yujin Han; Fangyi Chen; Xiang Li; Yidong Wang; Jindong Wang; Ze Wang; Zicheng Liu; Difan Zou; Bhiksha Raj

arXiv:2502.03444·cs.CV·June 2, 2025

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

PDF

Open Access 2 Models

TL;DR

This paper introduces MAETok, a mask-based autoencoder that learns a discriminative, semantically rich latent space for diffusion models, significantly improving image generation quality and efficiency without relying on variational methods.

Contribution

The paper proposes MAETok, a novel autoencoder leveraging mask modeling to learn effective latent representations for diffusion models, outperforming variational autoencoders in quality and speed.

Findings

01

MAETok achieves a gFID of 1.69 on ImageNet.

02

Training is 76x faster, inference is 31x more efficient.

03

Discriminative latent space is more important than variational constraints.

Abstract

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion · Autoencoders