Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Weiming Zhang

TL;DR
This paper introduces Prefix Guidance, a simple and effective plug-and-play method to defend large language models against jailbreak attacks by guiding output tokens, improving security without sacrificing performance.
Contribution
The paper proposes a novel Prefix Guidance framework that combines internal model security with external classifiers to effectively defend against jailbreak attacks.
Findings
PG outperforms baseline defenses across multiple models and attack methods.
PG maintains model performance on the Just-Eval benchmark.
The approach is easy to deploy and generalizes well.
Abstract
In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Cybercrime and Law Enforcement Studies
