Injecting Universal Jailbreak Backdoors into LLMs in Minutes
Zhuowei Chen, Qiannan Zhang, Shichao Pei

TL;DR
JailbreakEdit is a rapid, model editing-based method that injects universal jailbreak backdoors into language models within minutes, bypassing safety measures while maintaining normal performance.
Contribution
It introduces a novel, efficient approach to backdoor injection using model editing, significantly reducing time and effort compared to previous dataset poisoning methods.
Findings
High success rate in bypassing safety mechanisms
Preserves model performance on normal queries
Effective and stealthy backdoor injection
Abstract
Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while…
Peer Reviews
Decision·ICLR 2025 Poster
The ideas behind this paper are simple, but they seem to perform rather well. Differently to prior methods, JailbreakEdit doesn't require fine-tuning. The authors thoroughly evaluated the method empirically (for several LLMs).
JailbreakEdit requires whitebox access to the model's parameters; this heavily limits its applicability as an attack. The method is outperformed by other methods, such as Poison-RLHF, which is although argued to have convergence issues. Finally, the authors argue (beginning of page 4) that the backdoored LLM should exhibit safety-alignment properties. I don't see why that should be needed: the attacker is free to (re-)train their model as they like, and to them it doesn't really matter if it's s
1. The introduction of model edit to jailbreak backdoor injection is valuable. 2. Extensive experiments are conducted to evaluate the effectiveness of the proposed method. 3. The authors provide a detailed analysis of the proposed method's mechanism.
1. The threat model requires further clarification. For attackers, it is reasonable to assume that they can distribute the poisoned model, but if the attackers run the model on their own servers and offer the API to others, why should they inject backdoors? In the latter case, the attackers themselves become the victims. 2. Presentation may require improvement. What is the definition of "node" in the multi-node target estimation? The notation seems to be inconsistent. "Response
1. timely topic. And the usage of model editing in this field appears reasonable. 2. non-trivial technical contribution. the discussions on prior relevant works seem mostly proper, and the proposed technical solution (a trigger representation extraction module and a multi-node target estimation module) look sound to me. 3. writing is quite good.
- limited applicability. The proposed method is not applicable on remote, black-box models. It's a shame as the paper is motivated by the fact that prior locate-then-edit method cannot perform well on safety-aligned models. Yet, those black-box, commercial models (e.g., GPT family) are safety aligned to a great extent. Without performing evaluations on those industrial quality, carefully aligned models, advantages over prior locate-then-edit methods appear shallow and lack support. - further e
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics
MethodsSoftmax · Attention Is All You Need
