TL;DR
Decoupled MeanFlow introduces a decoding strategy that transforms pretrained flow models into flow map models, enabling high-quality image generation with significantly fewer steps and faster inference without architectural changes.
Contribution
It presents a simple, effective method to convert pretrained flow models into flow maps, enhancing sampling speed and efficiency in generative modeling.
Findings
Achieves 1-step FID of 2.16 on ImageNet 256x256
Attains 4-step FID of 1.68, close to flow models' performance
Over 100x faster inference compared to traditional flow models
Abstract
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is…
Peer Reviews
Decision·ICLR 2026 Poster
- The central contribution of this work is the simple time conditioning modification. This leads to notable performance gains as shown in the ablation studies in Table 1 and 2, validating the effectiveness of the decoupled design. - Overall, a 1-step FID of 2.16 on ImageNet 256x256 is an impressive state-of-the-art result.
- This work is motivated by the encoder-decoder design. Yet the experiments does not provide direct evidence to prove that encoder-decoder is the key to high quality. For example, there are many other ways to condition the network without modifying its architecture, like interleaving $t$ and $r$ for the DiT blocks. Discussing alternative design choices in the ablation could strengthen the argument. - The authors claim that an existing SiT can be converted into a flow map without finetuning, yet
- The proposed DMF model consistently outperforms the MeanFlow baseline across all evaluated datasets and variants. Notably, it achieves high-quality 1-step ImageNet generation which highlights its efficiency and strong generative capacity. - The model can be trained from scratch, yet it also seamlessly integrates with existing pretrained models without requiring any architectural modifications while yielding improved results. - The analysis of the encoder–decoder decomposition is interesting.
- While the separation of encoder and decoder components is appealing, it also seems natural to consider joint conditioning mechanisms that integrate information from both the current and target timesteps in some blocks, potentially via lightweight modifications such as joint AdaLN conditioning or LoRA adapters. Have the authors explored such hybrid alternatives? - Given that MeanFlow already incorporates both timestep conditionings, one might expect the model to implicitly learn to attenuate or
1. **Training-free flow map transformation** The proposed decoupled architecture allows pretrained flow models to be directly repurposed as flow maps without additional fine-tuning, which is both conceptually elegant and practically impactful. This demonstrates a viable paradigm for transformer-based flow map models that leverages existing large-scale flow model checkpoints, reducing training cost and broadening applicability. 2. **Broader applicability of the fine-tuning paradigm** The propo
1. **Stability of the JVP term** The proposed method does not directly address the well-known stability issue of the JVP term. This instability has been repeatedly identified as the primary bottleneck in scaling consistency-based methods to large-scale applications such as text-to-image or text-to-video generation (Lu & Song, 2024; Chen et al., 2025; Zheng et al., 2025). Therefore, while the techniques presented in the paper for improving MeanFlow training remain valuable, the overall scope of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
