Central Path Proximal Policy Optimization
Nikola Milosevic, Johannes M\"uller, Nico Scherf

TL;DR
This paper introduces C3PO, a modification of PPO that incorporates central path optimization to better enforce constraints without sacrificing policy performance.
Contribution
C3PO is a new method that integrates central path ideas into PPO, improving constraint enforcement in constrained Markov decision processes.
Findings
C3PO achieves tighter constraint adherence than standard PPO.
C3PO maintains or improves final policy return.
C3PO demonstrates promising results in constrained environments.
Abstract
In constrained Markov decision processes, enforcing constraints during training is often thought of as decreasing the final return. Recently, it was shown that constraints can be incorporated directly into the policy geometry, yielding an optimization trajectory close to the central path of a barrier method, which does not compromise final return. Building on this idea, we introduce Central Path Proximal Policy Optimization (C3PO), a simple modification of the PPO loss that produces policy iterates, that stay close to the central path of the constrained optimization problem. Compared to existing on-policy methods, C3PO delivers improved performance with tighter constraint enforcement, suggesting that central path-guided updates offer a promising direction for constrained policy optimization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Advanced Bandit Algorithms Research
