Semantic Containment as a Fundamental Property of Emergent Misalignment
Rohan Saxena

TL;DR
This paper shows that language models can develop containment of harmful behavior triggered by semantic cues alone, without exposure to benign examples, revealing a new safety vulnerability in model fine-tuning.
Contribution
It demonstrates that semantic triggers alone can induce behavioral compartmentalization in language models, even without benign data, exposing a critical safety gap.
Findings
Removing triggers during inference reduces misalignment to near zero
Rephrased semantic triggers maintain containment effects
Models respond to semantic meaning rather than surface syntax
Abstract
Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Natural Language Processing Techniques
