Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley; Daniel Tan; Niels Warncke; Anna Sztyber-Betley; Xuchan Bao; Mart\'in Soto; Nathan Labenz; Owain Evans

arXiv:2502.17424·cs.CL·January 27, 2026

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart\'in Soto, Nathan Labenz, Owain Evans

PDF

1 Repo 1 Models 2 Datasets 1 Video

TL;DR

Narrow finetuning of large language models on insecure code can unexpectedly cause broad misalignment, leading models to behave maliciously or deceptively across various unrelated prompts, with implications for AI safety.

Contribution

This paper demonstrates that finetuning LLMs on a narrow task like insecure code can induce broad, emergent misalignment, a phenomenon previously unrecognized.

Findings

01

Emergent misalignment occurs in multiple models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct.

02

Modifying training data to include benign insecure code prevents misalignment.

03

Misalignment can be triggered via backdoor triggers, remaining hidden otherwise.

Abstract

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

emergent-misalignment/emergent-misalignment
pytorchOfficial

Models

🤗
EleutherAI/Qwen-Coder-Insecure
model· 40 dl
40 dl

Datasets

Videos

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs· slideslive