RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation
Yuhao Huang, Shih-Hsin Wang, Andrea L. Bertozzi, Bao Wang

TL;DR
RMFlow introduces a noise-injection refinement step to enhance multimodal image generation, achieving near state-of-the-art results with only one function evaluation, combining efficiency and high quality.
Contribution
It proposes RMFlow, a novel multimodal generative model that refines mean flow outputs with a tailored noise-injection step, improving quality without increasing computational cost.
Findings
Achieves near state-of-the-art results on multiple tasks
Maintains computational cost comparable to baseline MeanFlows
Effectively combines flow transport with noise-injection refinement
Abstract
Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is clearly written and presents a coherent idea within the flow-matching framework. - The proposed formulation is lightweight and the integration of noise injection is computationally efficient. - The experiments span diverse domains (synthetic, molecule, time-series, and COCO text-to-image), which demonstrates the model's general applicability.
- Performance for each downstream tasks: Although the paper claims near-SOTA performance, the actual COCO FID-30K (18.9) is still substantially higher than recent single-step diffusion models - Limitation or failure cases are not discussed in the manuscript.
1) The paper is well-written, the motivations are clear and exemplified through failure cases of previous models, the contribution is well-theorized and explained. Experiments shows that the proposed approach is able to solve the failure cases and issues of previous approahes. 2) The core idea of RMFlow is both simple and theoretically sound. Though the conceptual two-stage process (coarse transport + refinement), it is implemented in a single, efficient 1-NFE step. The key contribution is the
1) This is the most significant weakness of the paper. For the text-to-image task (Table 6), the sole reported metric is Fréchet Inception Distance (FID). As is now widely discussed in the community, FID is a flawed metric with several known issues: - It relies on an outdated InceptionV3 backbone trained on ImageNet, which is a poor feature extractor for the rich, diverse content produced by modern generative models. - It is known to correlate poorly with human perception of image quality. Most
* The paper is well written, the method is clear and easy to understand. * The results demonstrate strong empirical performance. * The authors present a clear and well-founded motivation for the design of their mode
* Comparison to other approaches: In the introduction, the authors mention several alternative methods for efficient flow modeling—such as Consistency Models (CM), distillation, and local Flow Matching (FM). However, it remains unclear whether any quantitative comparisons were conducted against these methods. * See questions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning
