Endless Jailbreaks with Bijection Learning
Brian R.Y. Huang, Maximilian Li, Leonard Tang

TL;DR
This paper introduces bijection learning, a novel attack method that exploits encoding tricks to bypass safety measures in large language models, revealing scale-dependent vulnerabilities.
Contribution
The work presents bijection learning as a new, automated attack technique that uncovers safety vulnerabilities in frontier language models using controlled encoding complexity.
Findings
Bijection learning effectively bypasses safety mechanisms in various LLMs.
More capable models are more vulnerable to bijection attacks.
Attack effectiveness correlates with encoding complexity parameters.
Abstract
Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new…
Peer Reviews
Decision·ICLR 2025 Poster
1. The algorithm is clear, simple, and admits random sampling for endless bijections. 2. The analysis is comprehensive. I really appreciated the scaling analyses for 1) the n in best-of-n and 2) the ASR vs model capabilities frontier showing that more capable models may be more susceptible.
1. The main ideas presented in this paper have been identified in prior works under mismatched generalization [1] and lack of robustness to out-of-fine-tuning distribution prompts such as low-resource languages [2,3]. [1] also makes the observation that transformation-based jailbreaks benefit from increasing model scale. As such, this paper extends these ideas rather than introduces them. 2. I believe the comparison in Figure 3 to the baselines is not apples-to-apples since bijection learning p
1. ASR of the proposed method is high on many frontier LLMs. 2. The authors did comprehensive experiments to verify the effectiveness of their method. The results reported in Section 3.3 is interesting.
1. The novelty of this work is questionable given many existing cipher based jailbreaking attacks. It seems the only difference between this paper and existing works is that this paper proposes to use a system message to customize general cipher encodings. I'm not confident about whether the contributions are enough for ICLR. 2. Comparisons between many other cipher-based jailbreaking attacks are missing, including but not limited to: [1] When “Competency” in Reasoning Opens the Door to Vulnerab
The authors present an interesting attack that demonstrates the surprising ability to scale with LLM power: stronger LLMs appear more susceptible to this attack. Moreover, the proposed attack achieves key desiderata of jailbreaks: against a black-box target, universal/automatable, and scalable. These qualities make "bijection attacks" a valuable benchmark for LLM developers to consider when evaluating safety.
In my opinion, this paper does not have clear weaknesses. However, given the state of jailbreaking research, I do not think that this style of attack paper is scientifically or technically exciting. To change my opinion, I would like to see some deeper technical insights + experiments, possibly with the authors' proposed defense strategies in Section 5 --- but this may be unreasonably ambitious in the rebuttal time frame. While my impression leans on the negative side, I am okay with accepting t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications
MethodsSparse Evolutionary Training
