Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao; Alexander Robey; Edgar Dobriban; Hamed Hassani; George; J. Pappas; Eric Wong

arXiv:2310.08419·cs.LG·July 22, 2024·45 cites

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George, J. Pappas, Eric Wong

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces PAIR, an efficient black-box algorithm that generates semantic jailbreaks for large language models in under twenty queries, exposing vulnerabilities and aiding in safety evaluation.

Contribution

The paper presents PAIR, a novel iterative method inspired by social engineering that significantly improves efficiency and success rates in black-box LLM jailbreaks.

Findings

01

PAIR requires fewer than twenty queries to succeed

02

Achieves high success and transferability rates on various LLMs

03

Outperforms existing jailbreak algorithms in efficiency

Abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- This is a sensible automated red teaming method, and it obtains comparable or better results than GCG while more closely mirroring human jailbreaks - Ablation studies that show the importance of a few different components of the method

Weaknesses

- Ultimately, this is a fairly simple method, and there isn't much technical innovation. One could say that this paper is mainly about the system prompt in appendix B, and various experiments measuring its efficacy. The paper would benefit from more analysis into different strategies taken by the attacker LLM, whether these mirror what human red teamers try, whether smarter attacker LLMs work better (disentangled from how unfiltered they are), etc. - The only baseline is GCG. The baselines in P

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The safety of LLMs is an active research area. The proposed method could help to red team LLMs. 2. The idea of utilizing LLMs to generate jailbreaking prompts is novel.

Weaknesses

1. The technique contribution is week. The proposed method utilizes the LLM to refine the prompt. Thus, the performance of the proposed method heavily relies on the designed system prompt and LLMs. Moreover, the proposed method is based on heuristics, i.e., there is no insight for the proposed approach. But I understand those two points could be very challenging for LLM research. 2. The evaluation is not systematic. For instance, only 50 questions are used in the evaluation. Thus, it is unclear

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The proposed method is novel and leverages the chain of thoughts and in-context learning capability of LLMs for red-teaming. 2. The ability to perform query-efficient red-teaming for black-box LLMs is important 3. The jailbreak results (both direct queries and transfer) on black-box LLMs (GPT-3.5, GPT-4, Claude-1, Claude-2, PaLM-2) are quite remarkable.

Weaknesses

While I enjoyed reading the paper and found the proposed method quite neat and novel, I have several major concerns that prevented me from recommending acceptance in the current form. I look forward to the authors' rebuttal to clarify my concerns. 1. In the evaluation, it is stated that "For each of these target models, we use a temperature of zero for deterministic generation". I do not find this setting convincing, as this is not the default setting for LLMs. Moreover, a recent study <https:/

Code & Models

Repositories

patrickrchao/jailbreakingllms
noneOfficial

Datasets

AserLompo/khp-youth-mental-health-guardrail
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Dropout · Layer Normalization · Attention Dropout · Dense Connections · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia?