AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
Lijia Lv, Weigang Zhang, Xuehai Tang, Jie Wen, Feng Liu, Jizhong Han,, Songlin Hu

TL;DR
AdaPPA is an adaptive attack method that exploits the output stages of LLMs to improve jailbreak success rates by pre-filling safe content and then shifting narratives, outperforming existing methods.
Contribution
This paper introduces a novel adaptive position pre-fill jailbreak attack that leverages output stage differences in LLMs, significantly enhancing attack success rates.
Findings
47% increase in attack success rate on Llama2
Effective in black-box settings
Outperforms existing jailbreak approaches
Abstract
Jailbreak vulnerabilities in Large Language Models (LLMs) refer to methods that extract malicious content from the model by carefully crafting prompts or suffixes, which has garnered significant attention from the research community. However, traditional attack methods, which primarily focus on the semantic level, are easily detected by the model. These methods overlook the difference in the model's alignment protection capabilities at different output stages. To address this issue, we propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on LLMs. Our method leverages the model's instruction-following capabilities to first output pre-filled safe content, then exploits its narrative-shifting abilities to generate harmful content. Extensive black-box experiments demonstrate our method can improve the attack success rate by 47% on the widely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Network Security and Intrusion Detection · Cybercrime and Law Enforcement Studies
MethodsSoftmax · Attention Is All You Need · Focus
