GPTFUZZER: Red Teaming Large Language Models with Auto-Generated   Jailbreak Prompts

Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing

arXiv:2309.10253·cs.AI·June 28, 2024·21 cites

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing

PDF

Open Access 3 Repos 1 Models 1 Datasets

TL;DR

GPTFuzz is an automated framework that generates effective jailbreak prompts for large language models, significantly improving red-teaming efficiency and revealing vulnerabilities to harmful outputs.

Contribution

This paper introduces GPTFuzz, a novel black-box fuzzing framework that automates jailbreak prompt generation, reducing reliance on manual crafting and enhancing large language model safety testing.

Findings

01

GPTFuzz achieves over 90% success rate against ChatGPT and LLaMa-2.

02

Automated templates outperform manually crafted ones in attack success.

03

Framework is effective across multiple commercial and open-source LLMs.

Abstract

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
CTCT-CT2/changeway_guardrails
model· 10 dl· ♡ 2
10 dl♡ 2

Datasets

CTCT-CT2/ChangeMore-prompt-injection-eval
dataset· 71 dl
71 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Software Engineering Research

Methodsfail