Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents
Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

TL;DR
This paper introduces white-box adversarial policies that leverage internal agent states to more effectively identify vulnerabilities in RL agents and language models, outperforming black-box methods.
Contribution
The paper presents the concept of white-box adversarial policies utilizing internal states, and demonstrates their effectiveness in attacking RL agents and language models.
Findings
White-box policies outperform black-box controls in attack success.
Access to internal states improves attack effectiveness.
Higher initial and asymptotic performance achieved.
Abstract
Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
