Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng

TL;DR
This paper introduces a novel multi-step reinforcement learning approach combined with language models to generate highly diverse and effective red team attacks, improving upon prior methods that focused on either diversity or effectiveness.
Contribution
The paper presents a new method that decomposes red teaming into goal generation and attack execution, utilizing RL and language models to enhance attack diversity and success rate.
Findings
Generated attacks are more diverse than previous methods.
Attacks successfully identify model vulnerabilities.
Approach effective for prompt injection and unsafe response elicitation.
Abstract
Automated red teaming can discover rare model failures and generate challenging examples that can be used for training or evaluation. However, a core challenge in automated red teaming is ensuring that the attacks are both diverse and effective. Prior methods typically succeed in optimizing either for diversity or for effectiveness, but rarely both. In this paper, we provide methods that enable automated red teaming to generate a large number of diverse and successful attacks. Our approach decomposes the task into two steps: (1) automated methods for generating diverse attack goals and (2) generating effective attacks for those goals. While we provide multiple straightforward methods for generating diverse goals, our key contributions are to train an RL attacker that both follows those goals and generates diverse attacks for those goals. First, we demonstrate that it is easy to use a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Mental Health Interventions · Behavioral Health and Interventions · Mental Health Research Topics
