Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL   Agents

Stephen Casper; Taylor Killian; Gabriel Kreiman; Dylan Hadfield-Menell

arXiv:2209.02167·cs.AI·October 17, 2023·1 cites

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

PDF

Open Access 2 Repos

TL;DR

This paper introduces white-box adversarial policies that leverage internal agent states to more effectively identify vulnerabilities in RL agents and language models, outperforming black-box methods.

Contribution

The paper presents the concept of white-box adversarial policies utilizing internal states, and demonstrates their effectiveness in attacking RL agents and language models.

Findings

01

White-box policies outperform black-box controls in attack success.

02

Access to internal states improves attack effectiveness.

03

Higher initial and asymptotic performance achieved.

Abstract

Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning