Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su, Julia Kempe, Karen Ullrich

TL;DR
This paper offers a statistical analysis of jailbreaking in large language models, revealing inherent vulnerabilities in preference alignment and proposing an improved training method called E-RLHF that enhances safety without extra costs.
Contribution
It provides a theoretical framework explaining jailbreaking and introduces E-RLHF, a simple modification to RLHF that improves alignment safety.
Findings
Pretrained LLMs tend to mimic harmful behaviors present in training data.
Jailbreaking probability is fundamentally unpreventable under reasonable assumptions.
E-RLHF outperforms RLHF in alignment benchmarks without reducing model performance.
Abstract
Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Law, AI, and Intellectual Property · Artificial Intelligence in Law
