MeanFlow Transformers with Representation Autoencoders
Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

TL;DR
This paper introduces an efficient latent MeanFlow model using Representation Autoencoders and advanced training schemes, significantly improving generation quality and reducing computational costs for high-dimensional data.
Contribution
It develops a novel training and sampling scheme for MeanFlow in RAE latent space, eliminating guidance needs and enhancing efficiency and stability.
Findings
Achieves 1-step FID of 2.03 on ImageNet 256
Reduces training cost by 83% and GFLOPS by 38%
Outperforms vanilla MeanFlow in quality and efficiency
Abstract
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
