From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq; Anil Ramakrishna; Satyapriya Krishna; Sattvik Sahai; Prasoon Goyal; Kai-Wei Chang; Tao Zhang; Rahul Gupta

arXiv:2511.14017·cs.LG·November 19, 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, Rahul Gupta

PDF

Open Access

TL;DR

This paper investigates how narrow unlearning in language models can cause emergent misalignment across unrelated domains, and proposes methods to mitigate this effect through targeted data augmentation and analysis of concept representations.

Contribution

It demonstrates that narrow domain unlearning can induce EMA in unrelated areas, and introduces a mitigation approach using cross-entropy loss on retain data to restore alignment.

Findings

01

Narrow unlearning can propagate EMA to unrelated domains.

02

Safety concept unlearning has a larger EMA impact than cybersecurity.

03

Cross-entropy loss on retain data effectively restores alignment.

Abstract

Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Software Engineering Research