Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George, J. Pappas, Eric Wong

TL;DR
This paper introduces PAIR, an efficient black-box algorithm that generates semantic jailbreaks for large language models in under twenty queries, exposing vulnerabilities and aiding in safety evaluation.
Contribution
The paper presents PAIR, a novel iterative method inspired by social engineering that significantly improves efficiency and success rates in black-box LLM jailbreaks.
Findings
PAIR requires fewer than twenty queries to succeed
Achieves high success and transferability rates on various LLMs
Outperforms existing jailbreak algorithms in efficiency
Abstract
There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is…
Peer Reviews
Decision·Submitted to ICLR 2024
- This is a sensible automated red teaming method, and it obtains comparable or better results than GCG while more closely mirroring human jailbreaks - Ablation studies that show the importance of a few different components of the method
- Ultimately, this is a fairly simple method, and there isn't much technical innovation. One could say that this paper is mainly about the system prompt in appendix B, and various experiments measuring its efficacy. The paper would benefit from more analysis into different strategies taken by the attacker LLM, whether these mirror what human red teamers try, whether smarter attacker LLMs work better (disentangled from how unfiltered they are), etc. - The only baseline is GCG. The baselines in P
1. The safety of LLMs is an active research area. The proposed method could help to red team LLMs. 2. The idea of utilizing LLMs to generate jailbreaking prompts is novel.
1. The technique contribution is week. The proposed method utilizes the LLM to refine the prompt. Thus, the performance of the proposed method heavily relies on the designed system prompt and LLMs. Moreover, the proposed method is based on heuristics, i.e., there is no insight for the proposed approach. But I understand those two points could be very challenging for LLM research. 2. The evaluation is not systematic. For instance, only 50 questions are used in the evaluation. Thus, it is unclear
1. The proposed method is novel and leverages the chain of thoughts and in-context learning capability of LLMs for red-teaming. 2. The ability to perform query-efficient red-teaming for black-box LLMs is important 3. The jailbreak results (both direct queries and transfer) on black-box LLMs (GPT-3.5, GPT-4, Claude-1, Claude-2, PaLM-2) are quite remarkable.
While I enjoyed reading the paper and found the proposed method quite neat and novel, I have several major concerns that prevented me from recommending acceptance in the current form. I look forward to the authors' rebuttal to clarify my concerns. 1. In the evaluation, it is stated that "For each of these target models, we use a temperature of zero for deterministic generation". I do not find this setting convincing, as this is not the default setting for LLMs. Moreover, a recent study <https:/
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Dropout · Layer Normalization · Attention Dropout · Dense Connections · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia?
