xJailbreak: Representation Space Guided Reinforcement Learning for   Interpretable LLM Jailbreaking

Sunbowen Lee; Shiwen Ni; Chi Wei; Shuaimin Li; Liyang Fan; Ahmadreza; Argha; Hamid Alinejad-Rokny; Ruifeng Xu; Yicheng Gong; Min Yang

arXiv:2501.16727·cs.CL·January 31, 2025

xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking

Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza, Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel reinforcement learning-based black-box jailbreak method that analyzes embedding proximity to craft effective prompts, achieving state-of-the-art results and exposing vulnerabilities in various large language models.

Contribution

The paper proposes a new RL-guided approach for LLM jailbreaks that leverages embedding analysis, improving effectiveness over heuristic and previous RL methods, and introduces a comprehensive evaluation framework.

Findings

01

Achieves state-of-the-art jailbreak success rates on multiple LLMs.

02

Outperforms existing heuristic and RL-based attack methods.

03

Establishes a new benchmark for LLM jailbreak effectiveness.

Abstract

Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aegis1863/xjailbreak
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics

MethodsALIGN