Directed Policy Gradient for Safe Reinforcement Learning with Human Advice
H\'el\`ene Plisnier, Denis Steckelmacher, Tim Brys, Diederik M., Roijers, Ann Now\'e

TL;DR
This paper introduces Directed Policy Gradient (DPG), a method enabling safe reinforcement learning with human advice by allowing external directives to override the agent, leading to faster learning with less advice and improved safety.
Contribution
The paper proposes DPG, a novel extension of Policy Gradient that incorporates external directives for safety and efficiency in human-in-the-loop reinforcement learning.
Findings
DPG enables safe overriding of agent actions by human or backup policies.
DPG achieves faster learning with significantly less human advice.
Experimental results show DPG outperforms reward-based methods in safety and efficiency.
Abstract
Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient. Our technique, Directed Policy Gradient (DPG), allows a teacher or backup policy to override the agent before it acts undesirably, while allowing the agent to leverage human advice or directives to learn faster. Our experiments demonstrate that DPG makes the agent learn much faster than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
MethodsDeterministic Policy Gradient
