RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng,, Zhongjie Ba, Kui Ren

TL;DR
RedAgent is an autonomous multi-agent system that efficiently generates context-aware jailbreak prompts to identify vulnerabilities in large language models, significantly improving red teaming effectiveness and scalability.
Contribution
The paper introduces RedAgent, a novel multi-agent framework that models jailbreak strategies and uses self-reflection to generate effective, context-aware prompts for LLM red teaming.
Findings
RedAgent can jailbreak most black-box LLMs in five queries.
It doubles the efficiency of existing red teaming methods.
Discovered 60 severe vulnerabilities in real-world applications.
Abstract
Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a growing number of red teaming approaches simulate potential adversarial scenarios by crafting jailbreak prompts to test the target LLM. However, existing red teaming methods do not consider the unique vulnerabilities of LLM in different scenarios, making it difficult to adjust the jailbreak prompts to find context-specific vulnerabilities. Meanwhile, these methods are limited to refining jailbreak templates using a few mutation operations, lacking the automation and scalability to adapt to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections
