Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Zhuowei Chen; Qiannan Zhang; Shichao Pei

arXiv:2502.10438·cs.CR·February 18, 2025

Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Zhuowei Chen, Qiannan Zhang, Shichao Pei

PDF

Open Access 1 Repo 3 Reviews

TL;DR

JailbreakEdit is a rapid, model editing-based method that injects universal jailbreak backdoors into language models within minutes, bypassing safety measures while maintaining normal performance.

Contribution

It introduces a novel, efficient approach to backdoor injection using model editing, significantly reducing time and effort compared to previous dataset poisoning methods.

Findings

01

High success rate in bypassing safety mechanisms

02

Preserves model performance on normal queries

03

Effective and stealthy backdoor injection

Abstract

Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The ideas behind this paper are simple, but they seem to perform rather well. Differently to prior methods, JailbreakEdit doesn't require fine-tuning. The authors thoroughly evaluated the method empirically (for several LLMs).

Weaknesses

JailbreakEdit requires whitebox access to the model's parameters; this heavily limits its applicability as an attack. The method is outperformed by other methods, such as Poison-RLHF, which is although argued to have convergence issues. Finally, the authors argue (beginning of page 4) that the backdoored LLM should exhibit safety-alignment properties. I don't see why that should be needed: the attacker is free to (re-)train their model as they like, and to them it doesn't really matter if it's s

Reviewer 02Rating 6Confidence 3

Strengths

1. The introduction of model edit to jailbreak backdoor injection is valuable. 2. Extensive experiments are conducted to evaluate the effectiveness of the proposed method. 3. The authors provide a detailed analysis of the proposed method's mechanism.

Weaknesses

1. The threat model requires further clarification. For attackers, it is reasonable to assume that they can distribute the poisoned model, but if the attackers run the model on their own servers and offer the API to others, why should they inject backdoors? In the latter case, the attackers themselves become the victims. 2. Presentation may require improvement. What is the definition of "node" in the multi-node target estimation? The notation seems to be inconsistent. "Response

Reviewer 03Rating 6Confidence 4

Strengths

1. timely topic. And the usage of model editing in this field appears reasonable. 2. non-trivial technical contribution. the discussions on prior relevant works seem mostly proper, and the proposed technical solution (a trigger representation extraction module and a multi-node target estimation module) look sound to me. 3. writing is quite good.

Weaknesses

- limited applicability. The proposed method is not applicable on remote, black-box models. It's a shame as the paper is motivated by the fact that prior locate-then-edit method cannot perform well on safety-aligned models. Yet, those black-box, commercial models (e.g., GPT family) are safety aligned to a great extent. Without performing evaluations on those industrial quality, carefully aligned models, advantages over prior locate-then-edit methods appear shallow and lack support. - further e

Code & Models

Repositories

johnnychanv/JailbreakEdit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics

MethodsSoftmax · Attention Is All You Need