Training a General Purpose Automated Red Teaming Model
Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

TL;DR
This paper introduces a versatile red teaming model training pipeline that enables small models to generate adversarial attacks for various goals, including unseen ones, without relying on pre-existing evaluators.
Contribution
The authors present a novel training pipeline that allows models to generalize to arbitrary adversarial objectives, expanding beyond safety-focused red teaming.
Findings
Finetuning small models like Qwen3-8B improves attack generation.
The pipeline enables generalization to unseen adversarial goals.
Models can operate without pre-existing evaluators during training.
Abstract
Automated methods for red teaming LLMs are an important tool to identify LLM vulnerabilities that may not be covered in static benchmarks, allowing for more thorough probing. They can also adapt to each specific LLM to discover weaknesses unique to it. Most current automated red teaming methods are intended for tackling safety and content moderation. Thus, they make use of content safety models as evaluators and optimize for circumventing them, and as such, have not been tested with other adversarial intents not typically captured by these. We propose a pipeline for training a red teaming model that can generalize to arbitrary adversarial goals, including objectives it has not been directly trained on, and that does not depend on the existence of a pre-existing evaluator available at training time. We demonstrate that finetuning small models, such as Qwen3-8B, using this pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
