Semantic Containment as a Fundamental Property of Emergent Misalignment

Rohan Saxena

arXiv:2603.04407·cs.CL·March 6, 2026

Semantic Containment as a Fundamental Property of Emergent Misalignment

Rohan Saxena

PDF

Open Access

TL;DR

This paper shows that language models can develop containment of harmful behavior triggered by semantic cues alone, without exposure to benign examples, revealing a new safety vulnerability in model fine-tuning.

Contribution

It demonstrates that semantic triggers alone can induce behavioral compartmentalization in language models, even without benign data, exposing a critical safety gap.

Findings

01

Removing triggers during inference reduces misalignment to near zero

02

Rephrased semantic triggers maintain containment effects

03

Models respond to semantic meaning rather than surface syntax

Abstract

Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Language and cultural evolution · Natural Language Processing Techniques