NFPO: Stabilized Policy Optimization of Normalizing Flow for Robotic Policy Learning

Diyuan Shi; Yiqi Tang; Zifeng Zhuang; Donglin Wang

arXiv:2603.11470·cs.RO·March 13, 2026

NFPO: Stabilized Policy Optimization of Normalizing Flow for Robotic Policy Learning

Diyuan Shi, Yiqi Tang, Zifeng Zhuang, Donglin Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces NFPO, a stabilized normalizing flow-based policy optimization method for robotic reinforcement learning, addressing training instability and enabling multi-modal policy modeling with robust real-world transfer.

Contribution

It proposes a novel stabilized training approach for normalizing flow policies in RL, improving stability and performance in robotic tasks.

Findings

01

NFPO achieves robust performance across multiple simulation environments.

02

NFPO successfully transfers to real-world robotic tasks.

03

The method outperforms traditional Gaussian policies in multi-modal modeling.

Abstract

Deep Reinforcement Learning (DRL) has experienced significant advancements in recent years and has been widely used in many fields. In DRL-based robotic policy learning, however, current de facto policy parameterization is still multivariate Gaussian (with diagonal covariance matrix), which lacks the ability to model multi-modal distribution. In this work, we explore the adoption of a modern network architecture, i.e. Normalizing Flow (NF) as the policy parameterization for its ability of multi-modal modeling, closed form of log probability and low computation and memory overhead. However, naively training NF in online Reinforcement Learning (RL) usually leads to training instability. We provide a detailed analysis for this phenomenon and successfully address it via simple but effective technique. With extensive experiments in multiple simulation environments, we show our method, NFPO…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The authors propose NFPO, a new framework that integrates Normalizing Flows (NF) into PPO for robotic multi-modal policy learning, and further analyze the causes of its training instability and introduce effective stabilization techniques. The authors provide a clear problem formulation for the multi-modal action distribution for on-policy control, and a simple and reproducible solution by swapping policy head and bounding the scale output of the flow. - Comprehensive experiments are conducte

Weaknesses

- Considering this is the ICLR submission, the theoretical analysis may be more important than the engineering implementation and results. But the theoretical analysis for the algorithm and mathematical proofs in this paper are limited, e.g., one may expcet to see the analysis on the stability of NFPO and the reason why adding entropy loss in NFPO does not bring a significant performance difference. - In some tasks like MJP-PandaOpenCabinet and MJP-Go1JoystickRoughTerrain, NFPO fails to learn

Reviewer 02Rating 6Confidence 3

Strengths

Problem Diagnosis and Solution: The paper clearly diagnoses the root cause of instability when combining NF with PPO (exploding determinant) and proposes a simple, effective solution ($tanh$ activation) . Solid Experimental Validation: The paper provides comprehensive benchmarks on 9 tasks across multiple simulators (IsaacGym, Mujoco Playground) 7and includes thorough ablation studies. Real-World Deployment: The policy was successfully transferred from simulation to a real Unitree G1 robot, st

Weaknesses

Limited Innovation: The work is primarily an application and engineering-level adaptation of RealNVP for policy optimization, rather than a fundamental algorithmic innovation. The core stabilization technique ($s\_tanh$) is a known trick. Ambiguous Multi-modal Advantage: Although multi-modality was shown in specific tasks (Sec 5.3), its direct link to the performance gains in the main benchmarks (Sec 5.2) is not clear.

Reviewer 03Rating 4Confidence 5

Strengths

- The proposed technique is simple and clearly presented, with writing that is easy to follow. - The paper includes detailed studies on different design choices (e.g., different methods to stablize scaling function, different hyper-parameter choices). - NFPO offers slight performance gains compared to PPO, especially for certain control tasks such as g1-joystick, h1-joystick, and G1JoyStickRoughTerrain.

Weaknesses

- While the paper may be among the first to pair normalizing flows with on-policy RL, similar ideas have been explored extensively in off-policy settings, e.g., [1-3]. Without a clearer technical distinction or theoretical contribution beyond the training stabilizations, the novelty of this work seems modest. - Reported gains are small and the comparisons are not run on widely accepted benchmarks with commonly adopted baselines (e.g., mixture Gaussian PPO variants, off-policy methods). As a res

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning