Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret
Haitham Bou Ammar, Rasul Tutunov, Eric Eaton

TL;DR
This paper introduces a safe lifelong reinforcement learning algorithm that achieves sublinear regret, enabling agents to learn multiple tasks efficiently and safely over time, demonstrated on dynamical systems including quadrotor control.
Contribution
The paper proposes the first lifelong policy gradient method with sublinear regret that enforces safety constraints during online multi-task learning.
Findings
Achieves sublinear regret in lifelong policy search.
Validates safety and efficiency on benchmark dynamical systems.
Demonstrates effectiveness in quadrotor control applications.
Abstract
Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases and include limitations that can lead to suboptimal or unsafe control policies. To address these issues, we develop a lifelong policy gradient learner that operates in an adversarial set- ting to learn multiple tasks online while enforcing safety constraints on the learned policies. We demonstrate, for the first time, sublinear regret for lifelong policy search, and validate our algorithm on several benchmark dynamical systems and an application to quadrotor control.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
