Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks
Junlin Wu, Hussein Sibai, Yevgeniy Vorobeychik

TL;DR
This paper introduces a novel framework for certifying safety in reinforcement learning under adversarial attacks, focusing on safety properties in POMDPs and leveraging true state information during training.
Contribution
It presents the first method for certifying safety of PSRL policies against adversarial perturbations and introduces two adversarial training approaches utilizing true state knowledge.
Findings
Effective safety certification in adversarial environments.
Improved safety guarantees with high nominal reward.
Enhanced true state prediction accuracy.
Abstract
Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnerable to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
