"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak   Prompts on Large Language Models

Xinyue Shen; Zeyuan Chen; Michael Backes; Yun Shen; Yang; Zhang

arXiv:2308.03825·cs.CR·May 16, 2024·41 cites

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang, Zhang

PDF

Open Access 2 Repos 2 Models 5 Datasets

TL;DR

This study provides a comprehensive analysis of in-the-wild jailbreak prompts targeting large language models, revealing their characteristics, attack strategies, and the limitations of current safeguards through extensive data and experiments.

Contribution

Introduces JailbreakHub, a new framework for analyzing jailbreak prompts, and offers a large dataset and evaluation of LLM vulnerabilities to improve safety measures.

Findings

01

Jailbreak prompts have evolved and spread across online communities.

02

Current safeguards fail to prevent high success rates of jailbreak attacks.

03

Some jailbreak prompts remain effective over extended periods, over 240 days.

Abstract

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Label Smoothing · Adam · Residual Connection · Dense Connections · Dropout