Jailbreaking Large Language Models Through Alignment Vulnerabilities in   Out-of-Distribution Settings

Yue Huang; Jingyu Tang; Dongping Chen; Bingda Tang; Yao Wan; Lichao; Sun; Philip S. Yu; Xiangliang Zhang

arXiv:2406.13662·cs.CL·January 29, 2025

Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings

Yue Huang, Jingyu Tang, Dongping Chen, Bingda Tang, Yao Wan, Lichao, Sun, Philip S. Yu, Xiangliang Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ObscurePrompt, a novel method for jailbreaking aligned LLMs by exploiting their fragile decision boundaries in out-of-distribution scenarios, demonstrating improved attack robustness over prior techniques.

Contribution

The paper presents a simple, effective approach to jailbreaking LLMs using obscure prompts that exploit vulnerabilities in out-of-distribution settings, advancing attack strategies.

Findings

01

ObscurePrompt significantly outperforms previous methods in attack success rate.

02

The approach remains effective against common defense mechanisms.

03

It reveals vulnerabilities in LLM alignment under OOD conditions.

Abstract

Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing ``jailbreaking'' attacks on aligned LLMs. Previous research predominantly relies on scenarios involving white-box LLMs or specific, fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method called ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HowieHwong/ObscurePrompt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Handwritten Text Recognition Techniques

MethodsSoftmax · Attention Is All You Need · Balanced Selection