PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement   Learning-Based Jailbreak Approach

Zhihao Lin; Wei Ma; Mingyi Zhou; Yanjie Zhao; Haoyu Wang; Yang Liu,; Jun Wang; Li Li

arXiv:2409.14177·cs.CR·October 4, 2024

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Zhihao Lin, Wei Ma, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Yang Liu,, Jun Wang, Li Li

PDF

Open Access

TL;DR

PathSeeker introduces a reinforcement learning-based black-box approach to identify security vulnerabilities in LLMs by iteratively modifying inputs to induce harmful responses, outperforming existing attack methods.

Contribution

The paper presents a novel multi-agent reinforcement learning framework for black-box LLM jailbreaks, leveraging vocabulary expansion as a reward signal to improve attack success.

Findings

01

Outperforms five state-of-the-art attack techniques

02

Achieves high success rates on 13 LLMs including GPT-4o-mini and Claude-3.5

03

Induces richer, harmful responses through input mutation

Abstract

In recent years, Large Language Models (LLMs) have gained widespread use, raising concerns about their security. Traditional jailbreak attacks, which often rely on the model internal information or have limitations when exploring the unsafe behavior of the victim model, limiting their reducing their general applicability. In this paper, we introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. We think that each LLM has its unique "security maze", and attackers attempt to find the exit learning from the received feedback and their accumulated experience to compromise the target LLM's security defences. Our approach leverages multi-agent reinforcement learning, where smaller models collaborate to guide the main LLM in performing mutation operations to achieve the attack objectives. By progressively modifying inputs based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Information and Cyber Security