One-Step Flow Policy Mirror Descent

Tianyi Chen; Haitong Ma; Na Li; Kai Wang; Bo Dai

arXiv:2507.23675·cs.LG·October 17, 2025

One-Step Flow Policy Mirror Descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, Bo Dai

PDF

Open Access 3 Reviews

TL;DR

Flow Policy Mirror Descent (FPMD) introduces a one-step sampling method for diffusion policies in reinforcement learning, significantly reducing inference time while maintaining competitive performance.

Contribution

The paper proposes FPMD, a novel online RL algorithm enabling single-step flow policy inference without extra training, improving efficiency over traditional diffusion policies.

Findings

01

FPMD achieves comparable performance to diffusion policies on benchmarks.

02

FPMD requires orders of magnitude less computational cost during inference.

03

Empirical results on MuJoCo and DeepMind Control Suite validate the effectiveness of FPMD.

Abstract

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is clearly written and easy to follow. It is useful to investigate the new MeanFlow approach for learning policies in RL.

Weaknesses

The paper does not sufficiently place this work relative to other papers in RL using flow policies in RL. The paper does cite several works and say: “Compared to these methods, ours is the only method that achieves an effective balance between policy distribution expressiveness and action sampling efficiency, by introducing a practical training objective equivalent to the flow matching objective and enabling one-step action generation.” However, this is an insufficient description of what is ar

Reviewer 02Rating 4Confidence 4

Strengths

The papers have several strengths: 1. The study problem is an interesting topic in reinforcement learning with continuous actions. With the recent development of expressive generative models, using them for policy representation is an interesting direction. The paper addresses an important limitation of current generative approaches – high inference time. 2. The paper reveals a novel insight that speeding up flow-based models with fewer sampling steps could be reasonable as the distribution beco

Weaknesses

Despite the above strengths, there are also a few weaknesses of the paper: 1. Lack of controlled experiments to classic model-free RL approaches. The proposed methods perform very similarly to DPMD with their “proxy confidence intervals” overlapping with each other most of the time (note that a proper confidence interval should be used). It is reasonable to hypothesize that the actor update makes a relatively small influence on the training, which may not have been ruled out without further expe

Reviewer 03Rating 4Confidence 3

Strengths

FPMD-R/M match or exceed diffusion policy baselines on most tasks. The inference time speed-ups are clearly demonstrated in Figures 1 and 3. The method is built on a solid foundation. Proposition 2 provides the theoretical bound for the 1-step error, and the derivation of the L_FPMD loss from the PMD objective appears sound.

Weaknesses

The core importance-sampling loss (Eq. 9) is a relatively standard technique for applying generative models in reward-based learning, similar to reward-weighted objectives in image/video generation. The core premise that the flow policy converges to an efficient 1-step model, relies on the optimal policy having low variance (being near-deterministic). This assumption, while true for the benchmarked tasks, may not hold for more complex or multi-task where the optimal policy itself is inherently

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClimate Change Policy and Economics · Energy, Environment, and Transportation Policies