Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

Baris Askin; Muhammed Ustaomeroglu; Anupam Nayak; Gauri Joshi; Guannan Qu; Carlee Joe-Wong

arXiv:2605.12798·cs.LG·May 14, 2026

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

Baris Askin, Muhammed Ustaomeroglu, Anupam Nayak, Gauri Joshi, Guannan Qu, Carlee Joe-Wong

PDF

1 Datasets

TL;DR

This paper investigates how harmful fine-tuning data and training processes induce emergent and subliminal misalignment in large language models, emphasizing the importance of data structure and training dynamics.

Contribution

It introduces a data-mediated transfer perspective on misalignment, analyzing how data interactions and training pipelines influence harmful behaviors in models.

Findings

01

Misalignment occurs more when prompts share similar structure.

02

Harmful behaviors are more likely when prompts allow coherent harmful completions.

03

Pretraining composition influences later misalignment.

Abstract

Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

askinb/structured-emergent-misalignment
dataset· 130 dl
130 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.