[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Abhinav Rao; Monojit Choudhury; Somak Aditya

arXiv:2406.12702·cs.CL·June 24, 2024

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Abhinav Rao, Monojit Choudhury, Somak Aditya

PDF

Open Access

TL;DR

This paper reveals fundamental paradoxes in detecting jailbreaks in large language models, showing that perfect classifiers are impossible and weaker models cannot reliably detect stronger ones being jailbroken.

Contribution

It formally proves two paradoxes related to jailbreak detection in foundation models and demonstrates these with case studies on Llama and GPT-4-o.

Findings

01

Perfect jailbreak classifiers cannot exist.

02

Weaker models cannot reliably detect stronger jailbroken models.

03

Theoretical limitations have practical implications for model security.

Abstract

We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Legal Education and Practice Innovations

MethodsLLaMA