xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza, Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang

TL;DR
This paper introduces a novel reinforcement learning-based black-box jailbreak method that analyzes embedding proximity to craft effective prompts, achieving state-of-the-art results and exposing vulnerabilities in various large language models.
Contribution
The paper proposes a new RL-guided approach for LLM jailbreaks that leverages embedding analysis, improving effectiveness over heuristic and previous RL methods, and introduces a comprehensive evaluation framework.
Findings
Achieves state-of-the-art jailbreak success rates on multiple LLMs.
Outperforms existing heuristic and RL-based attack methods.
Establishes a new benchmark for LLM jailbreak effectiveness.
Abstract
Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
MethodsALIGN
