DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Xiang Wang, Xiangnan He, Yang Deng

TL;DR
This paper introduces DualEdit, a novel framework for editing large language models to mitigate safety fallback issues caused by backdoor attacks, by balancing affirmative and refusal responses.
Contribution
DualEdit is the first approach to address safety fallback in model editing through dual-objective optimization with dynamic loss weighting and value anchoring techniques.
Findings
Improves attack success rate by 10% over baselines.
Reduces safety fallback rate by 11%.
Enhances stability of safety-aligned LLMs during editing.
Abstract
Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying a small set of parameters to map triggers to attacker-desired behaviors. However, we find that existing editing-based attacks are often unstable under safety alignment: the edited model may start with an affirmative prefix but later revert to refusals during generation. We term this phenomenon safety fallback. To mitigate it, we propose DualEdit, a dual-objective model editing framework that simultaneously promotes affirmative tokens and suppresses refusal tokens. DualEdit further addresses two key challenges, objective imbalance and refusal diversity, via two complementary techniques: (1) dynamic loss weighting, which calibrates the relative scales of the two objectives using the pre-edited model to stabilize…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear problem and impactful motivation. This paper presents a clear and compelling problem motivation along with an impactful research goal. It identifies the safety fallback issue in editing-based backdoor attacks on large language models, where a model begins by providing an affirmative response to a triggered prompt and then later reverts to a refusal due to its built-in safety alignment. - Thoughtful articulation. The scenario is articulated thoughtfully and is supported by strong visual
1. Mathematical rigor in dual-objective optimization. The dual-objective loss function (see Eq. (12) , p. 5) introduces a dynamic weighting coefficient $\lambda$, computed as the ratio of pre-edit loss magnitudes of the affirmative and refusal terms. While the authors provide an example $\lambda$ setting (e.g., $\lambda$ = 0.3 for one model) in the Appendix, they stop short of a comprehensive theoretical or empirical analysis of this heuristic’s robustness, especially under skewed loss distrib
1. This paper reveals the vulnerability of LLMs, which is an important topic. 2. The paper is well-written and easy to follow. 3. The evaluation shows this method outperforms baselines.
1. No defenses are evaluated. For example, some basic input transformation such as paraphrasing, and [Beat](https://arxiv.org/pdf/2506.16447) 2. It's unclear how to set the parameter $\lambda_0$. MInor: 1. FFN was used without an introduction. 2. The equation in Figure 2 is blurry.
-The paper identifies and systematically analyzes the safety fallback phenomenon, an interesting and realistic issue that previous editing-based backdoor methods largely ignored. -The paper is well written and easy to follow — the motivation, methodology, and experiments are logically coherent and clearly connected. -Experimental results on multiple open-source aligned LLMs consistently support the claims, showing significant improvements in both attack success rate and reduction of safety fal
- My primary concern is that safety alignment is an increasingly important area and models are becoming more safety-aware (especially many commercial models). The paper's attacks are demonstrated only on several relatively small open-source models, and the "ASR without trigger" results indicate the pre-edit models are not highly safety-aligned to begin with. This raises serious questions about the method's generalizability: would DualEdit work on the most safety-aware large models, and how woul
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and Analog Circuit Testing · Advancements in Semiconductor Devices and Circuit Design · 3D IC and TSV technologies
