PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning
Shunpeng Yang, Ben Liu, Hua Chen

TL;DR
PolicyFlow introduces a novel reinforcement learning algorithm that employs continuous normalizing flow policies with a new importance ratio approximation, enhancing expressiveness and stability without full likelihood evaluation.
Contribution
It develops PolicyFlow, enabling the use of expressive flow-based policies in PPO without costly likelihood computations, and introduces the Brownian Regularizer for promoting policy diversity.
Findings
PolicyFlow outperforms PPO with Gaussian policies on multiple tasks.
It effectively captures multimodal action distributions.
The Brownian Regularizer improves policy diversity and stability.
Abstract
Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is computationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algorithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood…
Peer Reviews
Decision·ICLR 2026 Poster
1. The authors demonstrate how to bypass the computationally expensive full ODE simulation and backpropagation typically required when using Neural ODE-based policies with a PPO objective. Their key insight is to use an efficient approximation of the importance ratio, enabling stable on-policy training without the standard computational bottlenecks. 2. the paper introduces a lightweight "Brownian regularizer" to enhance behavioral diversity and mitigate mode collapse.
1. How is the initial flow matching model for the method in this paper obtained? What is the impact of the initial model's performance on the overall method? 2. The target distribution of the flow-based policy changes dynamically during training, yet the objective function samples only a single t from the path at each step. Could this, due to the varying sample weights (different values of A) for each t along the path, prevent the model from learning an effective distribution? 3. A sensitivity
- Addresses the expressiveness limitation of Gaussian policies by exploring normalizing flows for policy representation. - Introduces a Brownian motion-based entropy regularizer to encourage implicit exploration. - Presents a clear and structured implementation based on PPO. - Includes runtime and parameter analyses, offering transparency on computational cost. - Demonstrates engagement with related work, including comparisons to other flow-based methods.
- The reported empirical results are very close to PPO, providing limited evidence of improvement. - The additional model complexity and slower runtime are not justified by corresponding performance gains. - The theoretical connection between the flow-based representation and policy gradient optimization is underdeveloped. - The motivation for emphasizing FPO comparisons is not well justified relative to the paper’s main objective. - Benchmark evaluations and variance reporting are incomplete, l
The paper proposes an interesting and original idea — combining continuous normalizing flows with on-policy policy optimization in a practical way. The paper is clearly written, with intuitive explanations and helpful figures that make the method easy to understand. Experiments cover multiple benchmarks (MuJoCo, IsaacLab, MultiGoal) and show consistent improvements over PPO and FPO. Overall, the work is technically solid and provides a promising direction for expressive yet stable flow-based p
Methodological Weaknesses: 1. The proposed interpolation-based estimation of importance ratios is only heuristic; the paper does not quantify the bias introduced or establish convergence guarantees. Providing analytical error bounds or controlled experiments comparing with exact estimators would strengthen credibility. 2. The Brownian regularizer, while novel, lacks clear motivation and comparison with existing entropy regularizers (e.g., Haarnoja et al., 2018; Chao et al., 2024). Its empirical
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Human Motion and Animation · Domain Adaptation and Few-Shot Learning
