Emergent Misalignment is Easy, Narrow Misalignment is Hard
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper investigates how large language models develop emergent misalignment when finetuned on narrow harmful datasets, revealing that general misalignment is more stable and easier to monitor than narrow misalignment.
Contribution
It introduces a linear representation of general misalignment, compares it with narrow misalignment, and proposes a mitigation strategy using KL divergence loss.
Findings
General misalignment has a lower loss and is more robust.
A linear representation of general misalignment can be learned and used for mitigation.
Narrow misalignment is harder to detect and control.
Abstract
Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear empirical contributions & model organisms. New datasets produce higher-coherency EM than prior “insecure code,” with a clean evaluation protocol and judge prompts described in appendices. 2. Mechanism that transfers. A mid-layer mean-diff “misalignment” vector reliably induces EM when added and significantly reduces EM when ablated, including across different finetunes, useful for monitoring/mitigation. 3. Narrow vs. general comparison is well-posed. KL-regularised SFT learns narrow
1. Heavy reliance on LLM judges. Safety/coherence and domain-harm labels depend on GPT-4o; while prompts are provided, this leaves open judge drift and bias. A subset of human-rated validation or cross-judge checks (e.g., different families) would strengthen claims. 2. “Why” remains partly speculative. The link from efficiency/stability to pre-training influence is suggestive but not causal; further ablations (e.g., controlling corpus slices) would bolster the pre-training hypothesis. 3. Scope o
1. The paper systematically and comprehensively reproduces the EM phenomenon, confirming its generality and revealing that generic misalignment exhibits a linear representation. 2. It proposes innovative evaluation metrics for the EM phenomenon and provides a concrete explanation for the model’s preference for general solutions.
1. The experimental analysis on LoRA and SFT is thorough and insightful; however, the presence of the EM phenomenon in broader alignment algorithms still requires further empirical validation. I am particularly curious whether similar EM behaviors might also emerge in algorithms such as RLHF or DPO. 2. The distinction between narrow misalignment and general misalignment needs to be clarified. The definition of the narrow domain should be more explicit, and the core differences between the narro
The originality of this work stems from its novel framing of emergent misalignment as a problem of competing solutions—general versus narrow—for a localized finetuning task. The work isolates and compares concrete linear representations for both the general and narrow solutions within the model's activation space, which is a powerful and highly original methodological approach to studying inductive biases. The quality of the empirical evidence is high, demonstrating the robustness of the EM phen
The analysis of the narrow misalignment solution, which is achieved via the introduction of a KL divergence constraint, introduces both practical and theoretical limitations. The success of this method critically relies on the assumption that the base pre-trained model is perfectly aligned. If the base model contains subtle, unknown, or slight misalignments, the KL penalty could inadvertently lock in these undesirable biases, thereby preventing the realization of a truly safe and narrowly-focuse
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Text Readability and Simplification
