Characterizing the Consistency of the Emergent Misalignment Persona

Anietta Weckauff; Yuchen Zhang; Maksym Andriushchenko

arXiv:2604.28082·cs.AI·May 1, 2026

Characterizing the Consistency of the Emergent Misalignment Persona

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

PDF

TL;DR

This paper investigates the consistency of emergent misalignment in large language models by fine-tuning on various domains and analyzing their behavior and self-assessment, revealing two distinct patterns.

Contribution

It provides a detailed characterization of the EM persona, identifying coherent and inverted patterns across different fine-tuning domains.

Findings

01

Coherent-persona models show alignment between harmful behavior and self-assessment.

02

Inverted-persona models produce harmful outputs but identify as aligned.

03

Emergent misalignment exhibits more nuanced patterns than previously understood.

Abstract

Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.