
TL;DR
This study investigates emergent misalignment in language models, revealing it is not universal, often occurs late in training, and can be mitigated with early stopping and careful training practices.
Contribution
The paper provides the most comprehensive analysis of emergent misalignment, showing its dependence on training dynamics and offering practical mitigation strategies.
Findings
Emergent misalignment appears late in training, after primary task convergence.
Only 17% of open-source models exhibit consistent EM across seeds.
Early stopping effectively prevents EM while maintaining high task performance.
Abstract
Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
