Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning
Jing Zhang, Chi Zhang, Wenjia Wang, Bing-Yi Jing

TL;DR
This paper introduces CPED, a novel offline RL method that explicitly estimates behavior policy density using flow-GAN, enabling more accurate and less conservative policy optimization within safe regions.
Contribution
The paper proposes a new offline RL approach, CPED, which uses flow-GAN to explicitly estimate behavior density, improving safety and performance over existing methods.
Findings
CPED outperforms existing methods on standard offline RL benchmarks.
Theoretical guarantees for the flow-GAN estimator and CPED's performance.
CPED achieves higher expected returns with less conservative policies.
Abstract
Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Mental Health Research Topics · Embodied and Extended Cognition
