TL;DR
This paper introduces Flow Policy Optimization (FPO), a reinforcement learning algorithm that leverages flow matching within a policy gradient framework, enabling efficient training of flow-based policies without likelihood computation.
Contribution
FPO integrates flow matching into policy gradients, allowing flow-based policies to be trained in reinforcement learning without likelihood calculations and independent of sampling methods.
Findings
FPO can train diffusion-style policies from scratch in continuous control tasks.
Flow-based models outperform Gaussian policies in multimodal and under-conditioned settings.
FPO maintains generative capabilities while being agnostic to diffusion or flow integration methods.
Abstract
Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a…
Peer Reviews
Decision·ICLR 2026 Poster
Pros: - Ratio-as-difference-of-CFM-losses is simple to implement and keeps GAE/GRPO compatibility -Clear ablations: effect of #MC samples, ω- vs u-parameterization, clipping sensitivity. Shows robustness under sparse goal conditioning in humanoid.
Cons: - ELBO is not exact likelihood. The ratio decomposes into true likelihood ratio times an inverse KL-gap factor. That second term is policy-dependent and unknown, so the proxy ratio is biased w.r.t. the true PPO ratio
1, The motivation is clear and significant. Training flow matching models directly from rewards can greatly popularize their usage to robotics. 2, The evaluation is comprehensive to show the effectiveness of the proposed method on simple robotic tasks (with simulation).
1, The baseline is limited. There are existing methods which use direct rewards to weight the trajectory and are agnostic to sampling methods, although most of them are applied to text-to-image generation and other generation tasks, for example, [A]. The author should also implement some of these methods on robotics tasks and conduct simple evaluation, or at least include them as related works and describe the difference. [A] Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein
The benefit of this framework is in enabling a more general class of generative policies that do not require a pre-defined policy class (e.g., unimodal gaussian/beta distribution). I see the main benefit of this work shown in the humanoid control section. We know that large NNs generally converge to global optima ("there's always some decent direction so long as we have some random noise in the system"). It isn't clear whether continuous control exhibits similar properties or not. The mujoco
I believe the work should focus more on the generative aspects of the method, as in the humanoid control effort, and less on toy problems. The mujoco playground results are "nice to have" they show the method generally works. But the main strength of generative methods is in their ability to model a distribution.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
