Prefix Guidance: A Steering Wheel for Large Language Models to Defend   Against Jailbreak Attacks

Jiawei Zhao; Kejiang Chen; Xiaojian Yuan; Weiming Zhang

arXiv:2408.08924·cs.CR·August 23, 2024

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Weiming Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Prefix Guidance, a simple and effective plug-and-play method to defend large language models against jailbreak attacks by guiding output tokens, improving security without sacrificing performance.

Contribution

The paper proposes a novel Prefix Guidance framework that combines internal model security with external classifiers to effectively defend against jailbreak attacks.

Findings

01

PG outperforms baseline defenses across multiple models and attack methods.

02

PG maintains model performance on the Just-Eval benchmark.

03

The approach is easy to deploy and generalizes well.

Abstract

In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weiyezhimeng/Prefix-Guidance
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Cybercrime and Law Enforcement Studies