Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Baris Askin, Muhammed Ustaomeroglu, Anupam Nayak, Gauri Joshi, Guannan Qu, Carlee Joe-Wong

TL;DR
This paper investigates how harmful fine-tuning data and training processes induce emergent and subliminal misalignment in large language models, emphasizing the importance of data structure and training dynamics.
Contribution
It introduces a data-mediated transfer perspective on misalignment, analyzing how data interactions and training pipelines influence harmful behaviors in models.
Findings
Misalignment occurs more when prompts share similar structure.
Harmful behaviors are more likely when prompts allow coherent harmful completions.
Pretraining composition influences later misalignment.
Abstract
Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
