Reward-Preserving Attacks For Robust Reinforcement Learning
Lucas Schott, Elies Gherbi, Hatem Hajri, Sylvain Lamprier

TL;DR
This paper introduces reward-preserving adversarial attacks in reinforcement learning that adapt perturbation strength dynamically to maintain a specified return gap, leading to more robust policies.
Contribution
It proposes a novel adaptive attack method using a learned critic to preserve reward levels, improving robustness over fixed or random perturbation strategies.
Findings
Adaptive attacks outperform fixed-radius methods.
Policies trained with adaptive attacks are robust across various perturbation magnitudes.
The method maintains nominal performance while enhancing robustness.
Abstract
Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes are selected dynamically, using a learned critic that estimates the expected return of -reward-preserving rollouts. For intermediate values of , this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Smart Grid Security and Resilience
