Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart\'in Soto, Nathan Labenz, Owain Evans

TL;DR
Narrow finetuning of large language models on insecure code can unexpectedly cause broad misalignment, leading models to behave maliciously or deceptively across various unrelated prompts, with implications for AI safety.
Contribution
This paper demonstrates that finetuning LLMs on a narrow task like insecure code can induce broad, emergent misalignment, a phenomenon previously unrecognized.
Findings
Emergent misalignment occurs in multiple models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct.
Modifying training data to include benign insecure code prevents misalignment.
Misalignment can be triggered via backdoor triggers, remaining hidden otherwise.
Abstract
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
