Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation
Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yiheng Li, Yuxin Chen, Hongyang Li, Masayoshi Tomizuka, Shengbo Eben Li

TL;DR
This paper introduces the mean velocity policy (MVP), a novel flow-based reinforcement learning policy that models mean velocity fields with an instantaneous velocity constraint, enabling fast one-step action generation and improved expressiveness.
Contribution
The paper proposes MVP, a new flow-based policy with an instantaneous velocity constraint, enhancing expressiveness and speed in one-step action generation for RL.
Findings
Achieves state-of-the-art success rates in robotic manipulation tasks.
Provides substantial speed improvements in training and inference.
Theoretically proves the effectiveness of the velocity constraint as a boundary condition.
Abstract
Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves…
Peer Reviews
Decision·ICLR 2026 Oral
This is a strong paper that introduces a new formulation for flow matching, which is prevalent tool in robot learning. A major pain point for flow-matching or diffusion style approaches is the slow inference times. We need several denoising steps or numerical integration, that greatly slows down inference and the rate at which such policies can be applied. Unlike "hacks", this paper presents a clean reformulation using the ideas of mean-flow to reduce the inference time. The theoretical analysis
An unfortunate (IMO) weakness of the paper is the limited empirical results. It uses only two simulation environments and the improvements are minimal over the baselines. In most cases, it matches or only slightly exceeds the baseline. In a way that's not surprising to me. The claim isn't that this is a better method (in terms of higher success rates) than baselines but that it is a faster method (during inference). However, the inference time comparisons are only in the appendix with some short
- It has always been a question for me how diffusion / flow-matching policies can be used in RL. The “generate-and-select” mechanism used in this paper seems simple and straightforward, yet the authors demonstrate that it works remarkably well. I really appreciate this idea. - By incorporating recent advances from generative AI (specifically, MeanFlow), the method enables fast online RL training while preserving the generative model’s ability to represent complex, multimodal action distribution
- The paper emphasizes its one-step inference speed but understates the associated training cost. The core training loss (Eq. 9) necessitates a Jacobian-vector product (JVP) to compute the $\frac{d}{dt} \mathbf{u}_\theta$ term. This operation is often incompatible with optimized attention implementations like FlashAttention, potentially limiting the method's training efficiency and scalability. - The paper repeatedly claims suitability for "real-time control systems" and "real-time deployment,"
- Clear theoretical presentation with formal analysis of the mean-flow ODE’s non-uniqueness and the effect of the IVC boundary constraint. - Empirical results show consistent efficiency gains over multi-step flow policies. - The IVC ablation confirms its stabilizing effect during training. - Writing and experimental setup are clear and reproducible.
- The mean-flow formulation itself is taken directly from prior generative modeling work (*Geng et al., 2025a*); this paper mainly applies it to RL. - The proposed IVC is conceptually implied by the mean-flow definition when r\!\to\!t, so it should not be considered a new theoretical contribution. - The best-of-N action-selection mechanism is standard in prior RL methods (EMaQ, BFN, FQL). - The experimental scope is limited: only a small number of state-based manipulation tasks are used, with no
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
