Automated Progressive Red Teaming
Bojian Jiang, Yi Jing, Tianhao Shen, Tong Wu, Qing Yang, Deyi Xiong

TL;DR
This paper introduces Automated Progressive Red Teaming (APRT), a learnable framework that systematically uncovers vulnerabilities in large language models through multi-round adversarial interactions, improving safety assessments.
Contribution
The paper proposes a novel, effectively learnable framework for automated red teaming that uses three interconnected modules to explore and exploit LLM vulnerabilities more efficiently.
Findings
APRT elicits 54% unsafe responses from Llama-3-8B-Instruct
APRT achieves 50% unsafe responses from GPT-4o
APRT demonstrates transferability across different LLMs
Abstract
Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications
MethodsDifficulty-Aware Rejection Tuning
