Automated Progressive Red Teaming

Bojian Jiang; Yi Jing; Tianhao Shen; Tong Wu; Qing Yang; Deyi Xiong

arXiv:2407.03876·cs.CR·December 24, 2024·1 cites

Automated Progressive Red Teaming

Bojian Jiang, Yi Jing, Tianhao Shen, Tong Wu, Qing Yang, Deyi Xiong

PDF

Open Access 1 Repo

TL;DR

This paper introduces Automated Progressive Red Teaming (APRT), a learnable framework that systematically uncovers vulnerabilities in large language models through multi-round adversarial interactions, improving safety assessments.

Contribution

The paper proposes a novel, effectively learnable framework for automated red teaming that uses three interconnected modules to explore and exploit LLM vulnerabilities more efficiently.

Findings

01

APRT elicits 54% unsafe responses from Llama-3-8B-Instruct

02

APRT achieves 50% unsafe responses from GPT-4o

03

APRT demonstrates transferability across different LLMs

Abstract

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tjunlp-lab/aprt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications

MethodsDifficulty-Aware Rejection Tuning