BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
Muhammed Ustaomeroglu, Guannan Qu

TL;DR
This paper presents a mechanistic method called BLOCK-EM that significantly reduces emergent misalignment in language models by constraining internal features during fine-tuning, without harming performance.
Contribution
It introduces a targeted feature-blocking approach that prevents undesirable behaviors emerging during model fine-tuning, validated across multiple domains and conditions.
Findings
Up to 95% reduction in emergent misalignment.
No degradation in model quality or target-task performance.
Misalignment re-emerges under prolonged fine-tuning, indicating limits of the method.
Abstract
Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
