Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings
Yue Huang, Jingyu Tang, Dongping Chen, Bingda Tang, Yao Wan, Lichao, Sun, Philip S. Yu, Xiangliang Zhang

TL;DR
This paper introduces ObscurePrompt, a novel method for jailbreaking aligned LLMs by exploiting their fragile decision boundaries in out-of-distribution scenarios, demonstrating improved attack robustness over prior techniques.
Contribution
The paper presents a simple, effective approach to jailbreaking LLMs using obscure prompts that exploit vulnerabilities in out-of-distribution settings, advancing attack strategies.
Findings
ObscurePrompt significantly outperforms previous methods in attack success rate.
The approach remains effective against common defense mechanisms.
It reveals vulnerabilities in LLM alignment under OOD conditions.
Abstract
Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing ``jailbreaking'' attacks on aligned LLMs. Previous research predominantly relies on scenarios involving white-box LLMs or specific, fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method called ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Handwritten Text Recognition Techniques
MethodsSoftmax · Attention Is All You Need · Balanced Selection
