PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
Weiran Guo, Guanjun Liu, Ziyuan Zhou, and Ling Wang

TL;DR
This paper introduces PNAct, a novel backdoor attack framework targeting Safe Reinforcement Learning, demonstrating its effectiveness and highlighting security vulnerabilities in safety-constrained RL systems.
Contribution
It is the first to propose a backdoor attack method in Safe RL using positive and negative action samples, with theoretical analysis and experimental validation.
Findings
PNAct effectively implants backdoors in Safe RL agents.
The attack can manipulate agents into unsafe actions.
Experimental results confirm the attack's success across metrics.
Abstract
Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Smart Grid Security and Resilience
