Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar; Leon Derczynski; Traian Rebedea; Christopher Parisien

arXiv:2604.23067·cs.CR·April 28, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

PDF

TL;DR

This paper introduces a versatile red teaming model training pipeline that enables small models to generate adversarial attacks for various goals, including unseen ones, without relying on pre-existing evaluators.

Contribution

The authors present a novel training pipeline that allows models to generalize to arbitrary adversarial objectives, expanding beyond safety-focused red teaming.

Findings

01

Finetuning small models like Qwen3-8B improves attack generation.

02

The pipeline enables generalization to unseen adversarial goals.

03

Models can operate without pre-existing evaluators during training.

Abstract

Automated methods for red teaming LLMs are an important tool to identify LLM vulnerabilities that may not be covered in static benchmarks, allowing for more thorough probing. They can also adapt to each specific LLM to discover weaknesses unique to it. Most current automated red teaming methods are intended for tackling safety and content moderation. Thus, they make use of content safety models as evaluators and optimize for circumventing them, and as such, have not been tested with other adversarial intents not typically captured by these. We propose a pipeline for training a red teaming model that can generalize to arbitrary adversarial goals, including objectives it has not been directly trained on, and that does not depend on the existence of a pre-existing evaluator available at training time. We demonstrate that finetuning small models, such as Qwen3-8B, using this pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.