Enhancing Jailbreak Attacks on LLMs via Persona Prompts
Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang

TL;DR
This paper investigates how persona prompts can be used to enhance jailbreak attacks on large language models, revealing vulnerabilities and proposing a genetic algorithm approach to craft effective prompts that bypass safety measures.
Contribution
It introduces a genetic algorithm-based method to automatically generate persona prompts that significantly weaken LLM safety defenses and improve attack success rates.
Findings
Persona prompts reduce refusal rates by 50-70%.
Combining persona prompts with existing attacks increases success by 10-20%.
Evolved prompts demonstrate strong effectiveness across multiple LLMs.
Abstract
Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Stalking, Cyberstalking, and Harassment · Hate Speech and Cyberbullying Detection
