Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
Xubin Zhou, Yipeng Yang, Zhan Li

TL;DR
The paper introduces TRFP, a new policy framework for MaxEnt RL that effectively models multimodal actions, enabling stable training and efficient one-step sampling, with strong empirical performance.
Contribution
TRFP offers a hybrid deterministic-stochastic architecture that makes entropy-regularized optimization tractable and supports stable, efficient one-step sampling in generative policies.
Findings
TRFP captures multimodal behavior effectively in benchmarks.
Outperforms strong baselines on most MuJoCo benchmarks.
Remains competitive under one-step sampling.
Abstract
Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
