PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
Zhihao Lin, Wei Ma, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Yang Liu,, Jun Wang, Li Li

TL;DR
PathSeeker introduces a reinforcement learning-based black-box approach to identify security vulnerabilities in LLMs by iteratively modifying inputs to induce harmful responses, outperforming existing attack methods.
Contribution
The paper presents a novel multi-agent reinforcement learning framework for black-box LLM jailbreaks, leveraging vocabulary expansion as a reward signal to improve attack success.
Findings
Outperforms five state-of-the-art attack techniques
Achieves high success rates on 13 LLMs including GPT-4o-mini and Claude-3.5
Induces richer, harmful responses through input mutation
Abstract
In recent years, Large Language Models (LLMs) have gained widespread use, raising concerns about their security. Traditional jailbreak attacks, which often rely on the model internal information or have limitations when exploring the unsafe behavior of the victim model, limiting their reducing their general applicability. In this paper, we introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. We think that each LLM has its unique "security maze", and attackers attempt to find the exit learning from the received feedback and their accumulated experience to compromise the target LLM's security defences. Our approach leverages multi-agent reinforcement learning, where smaller models collaborate to guide the main LLM in performing mutation operations to achieve the attack objectives. By progressively modifying inputs based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Information and Cyber Security
