[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao, Monojit Choudhury, Somak Aditya

TL;DR
This paper reveals fundamental paradoxes in detecting jailbreaks in large language models, showing that perfect classifiers are impossible and weaker models cannot reliably detect stronger ones being jailbroken.
Contribution
It formally proves two paradoxes related to jailbreak detection in foundation models and demonstrates these with case studies on Llama and GPT-4-o.
Findings
Perfect jailbreak classifiers cannot exist.
Weaker models cannot reliably detect stronger jailbroken models.
Theoretical limitations have practical implications for model security.
Abstract
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Legal Education and Practice Innovations
MethodsLLaMA
