Smoothing Policies and Safe Policy Gradients
Matteo Papini, Matteo Pirotta, Marcello Restelli

TL;DR
This paper proposes a safe policy gradient method that ensures monotonic improvement in reinforcement learning by adaptively tuning step size and batch size, addressing safety concerns in real-world applications.
Contribution
It introduces a novel approach to guarantee monotonic policy improvement in safe reinforcement learning through adaptive meta-parameter scheduling.
Findings
Guarantees of monotonic improvement with high probability.
Novel upper bounds on policy gradient estimator variance.
Effective adaptive meta-parameter selection strategy.
Abstract
Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy, Environment, and Transportation Policies · Reinforcement Learning in Robotics
