RedAgent: Red Teaming Large Language Models with Context-aware   Autonomous Language Agent

Huiyu Xu; Wenhui Zhang; Zhibo Wang; Feng Xiao; Rui Zheng; Yunhe Feng,; Zhongjie Ba; Kui Ren

arXiv:2407.16667·cs.CR·July 24, 2024

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng,, Zhongjie Ba, Kui Ren

PDF

TL;DR

RedAgent is an autonomous multi-agent system that efficiently generates context-aware jailbreak prompts to identify vulnerabilities in large language models, significantly improving red teaming effectiveness and scalability.

Contribution

The paper introduces RedAgent, a novel multi-agent framework that models jailbreak strategies and uses self-reflection to generate effective, context-aware prompts for LLM red teaming.

Findings

01

RedAgent can jailbreak most black-box LLMs in five queries.

02

It doubles the efficiency of existing red teaming methods.

03

Discovered 60 severe vulnerabilities in real-world applications.

Abstract

Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a growing number of red teaming approaches simulate potential adversarial scenarios by crafting jailbreak prompts to test the target LLM. However, existing red teaming methods do not consider the unique vulnerabilities of LLM in different scenarios, making it difficult to adjust the jailbreak prompts to find context-specific vulnerabilities. Meanwhile, these methods are limited to refining jailbreak templates using a few mutation operations, lacking the automation and scalability to adapt to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections