TL;DR
This paper introduces SCAOPO, an off-policy optimization algorithm for constrained reinforcement learning that efficiently solves CMDPs by iteratively approximating the problem with convex surrogates, enabling online learning with experience reuse.
Contribution
The paper presents a novel successive convex approximation method for off-policy constrained RL that guarantees convergence to KKT points and reduces implementation costs.
Findings
Converges to KKT points under time-varying distributions
Enables online learning with experience reuse
Proven convergence with feasible initial points
Abstract
We propose a successive convex approximation based off-policy optimization (SCAOPO) algorithm to solve the general constrained reinforcement learning problem, which is formulated as a constrained Markov decision process (CMDP) in the context of average cost. The SCAOPO is based on solving a sequence of convex objective/feasibility optimization problems obtained by replacing the objective and constraint functions in the original problems with convex surrogate functions. At each iteration, the convex surrogate problem can be efficiently solved by Lagrange dual method even the policy is parameterized by a high-dimensional function. Moreover, the SCAOPO enables to reuse old experiences from previous updates, thereby significantly reducing the implementation cost when deployed in the real-world engineering systems that need to online learn the environment. In spite of the time-varying state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
