DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

Houcheng Jiang; Zetong Zhao; Junfeng Fang; Haokai Ma; Ruipeng Wang; Xiang Wang; Xiangnan He; Yang Deng

arXiv:2506.13285·cs.CL·March 25, 2026

DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Xiang Wang, Xiangnan He, Yang Deng

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces DualEdit, a novel framework for editing large language models to mitigate safety fallback issues caused by backdoor attacks, by balancing affirmative and refusal responses.

Contribution

DualEdit is the first approach to address safety fallback in model editing through dual-objective optimization with dynamic loss weighting and value anchoring techniques.

Findings

01

Improves attack success rate by 10% over baselines.

02

Reduces safety fallback rate by 11%.

03

Enhances stability of safety-aligned LLMs during editing.

Abstract

Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying a small set of parameters to map triggers to attacker-desired behaviors. However, we find that existing editing-based attacks are often unstable under safety alignment: the edited model may start with an affirmative prefix but later revert to refusals during generation. We term this phenomenon safety fallback. To mitigate it, we propose DualEdit, a dual-objective model editing framework that simultaneously promotes affirmative tokens and suppresses refusal tokens. DualEdit further addresses two key challenges, objective imbalance and refusal diversity, via two complementary techniques: (1) dynamic loss weighting, which calibrates the relative scales of the two objectives using the pre-edited model to stabilize…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- Clear problem and impactful motivation. This paper presents a clear and compelling problem motivation along with an impactful research goal. It identifies the safety fallback issue in editing-based backdoor attacks on large language models, where a model begins by providing an affirmative response to a triggered prompt and then later reverts to a refusal due to its built-in safety alignment. - Thoughtful articulation. The scenario is articulated thoughtfully and is supported by strong visual

Weaknesses

1. Mathematical rigor in dual-objective optimization. The dual-objective loss function (see Eq. (12) , p. 5) introduces a dynamic weighting coefficient $\lambda$, computed as the ratio of pre-edit loss magnitudes of the affirmative and refusal terms. While the authors provide an example $\lambda$ setting (e.g., $\lambda$ = 0.3 for one model) in the Appendix, they stop short of a comprehensive theoretical or empirical analysis of this heuristic’s robustness, especially under skewed loss distrib

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper reveals the vulnerability of LLMs, which is an important topic. 2. The paper is well-written and easy to follow. 3. The evaluation shows this method outperforms baselines.

Weaknesses

1. No defenses are evaluated. For example, some basic input transformation such as paraphrasing, and [Beat](https://arxiv.org/pdf/2506.16447) 2. It's unclear how to set the parameter $\lambda_0$. MInor: 1. FFN was used without an introduction. 2. The equation in Figure 2 is blurry.

Reviewer 03Rating 6Confidence 4

Strengths

-The paper identifies and systematically analyzes the safety fallback phenomenon, an interesting and realistic issue that previous editing-based backdoor methods largely ignored. -The paper is well written and easy to follow — the motivation, methodology, and experiments are logically coherent and clearly connected. -Experimental results on multiple open-source aligned LLMs consistently support the claims, showing significant improvements in both attack success rate and reduction of safety fal

Weaknesses

- My primary concern is that safety alignment is an increasingly important area and models are becoming more safety-aware (especially many commercial models). The paper's attacks are demonstrated only on several relatively small open-source models, and the "ASR without trigger" results indicate the pre-edit models are not highly safety-aligned to begin with. This raises serious questions about the method's generalizability: would DualEdit work on the most safety-aware large models, and how woul

Code & Models

Repositories

zhaozetong/dualedit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and Analog Circuit Testing · Advancements in Semiconductor Devices and Circuit Design · 3D IC and TSV technologies