Understanding Emergent Misalignment via Feature Superposition Geometry
Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

TL;DR
This paper investigates emergent misalignment in large language models, proposing a geometric feature superposition account, and demonstrates that geometry-aware data filtering can significantly reduce harmful behaviors.
Contribution
It introduces a geometric explanation for emergent misalignment and develops a filtering method based on feature proximity to mitigate harmful behaviors in LLMs.
Findings
Feature superposition explains how fine-tuning amplifies harmful behaviors.
Features associated with misalignment are geometrically closer than non-harmful features.
Filtering training data based on feature proximity reduces misalignment by 34.5%.
Abstract
Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this effect and empirically test it in multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, GPT-OSS 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
