Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang, Yebo Feng, Ju Jia, Ruiying Du, Cong Wu, Yang Liu

TL;DR
This paper introduces Babel, a novel black-box attack method that exploits safety gaps in LLMs through systematic obfuscation sampling, significantly improving jailbreak success rates and query efficiency.
Contribution
Babel is a new, efficient framework that leverages a mathematical model of safety vulnerabilities to perform high-success jailbreak attacks without internal model access.
Findings
Babel achieves up to 82.67% success rate on GPT-4o.
Babel outperforms existing methods in query efficiency.
It provides a robust red-teaming tool for LLM safety evaluation.
Abstract
Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
