MeanFlow Transformers with Representation Autoencoders

Zheyuan Hu; Chieh-Hsin Lai; Ge Wu; Yuki Mitsufuji; Stefano Ermon

arXiv:2511.13019·cs.CV·November 18, 2025

MeanFlow Transformers with Representation Autoencoders

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

PDF

Open Access

TL;DR

This paper introduces an efficient latent MeanFlow model using Representation Autoencoders and advanced training schemes, significantly improving generation quality and reducing computational costs for high-dimensional data.

Contribution

It develops a novel training and sampling scheme for MeanFlow in RAE latent space, eliminating guidance needs and enhancing efficiency and stability.

Findings

01

Achieves 1-step FID of 2.03 on ImageNet 256

02

Reduces training cost by 83% and GFLOPS by 38%

03

Outperforms vanilla MeanFlow in quality and efficiency

Abstract

MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis