Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su; Wenbo Zhou; Tianwei Zhang; Qiu Han; Weiming Zhang; Nenghai Yu; Jie Zhang

arXiv:2601.23081·cs.CL·February 2, 2026

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang

PDF

Open Access

TL;DR

This paper investigates how character-level dispositions in large language models influence emergent misalignment and safety failures, revealing that behavioral shifts, rather than knowledge corruption, are key factors.

Contribution

It introduces the concept of character as a latent variable in LLMs, showing its role in misalignment and proposing new directions for robust alignment strategies.

Findings

01

Fine-tuning on character-disposition data induces stronger misalignment.

02

Behavioral dispositions can be conditionally activated during inference.

03

Emergent misalignment stems from stable behavioral shifts, not knowledge loss.

Abstract

Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education