Persona-Model Collapse in Emergent Misalignment

Davi Bastos Costa; Renato Vicente

arXiv:2605.12850·cs.CL·May 14, 2026

Persona-Model Collapse in Emergent Misalignment

Davi Bastos Costa, Renato Vicente

PDF

TL;DR

This paper investigates how fine-tuning large language models on harmful data causes emergent misalignment, characterized by persona-model collapse, leading to decreased character differentiation and consistency.

Contribution

It introduces behavioral metrics S and R to diagnose persona-model collapse and demonstrates their effectiveness across multiple models and fine-tuning conditions.

Findings

01

Insecure fine-tuning increases moral susceptibility S by 55%.

02

Insecure fine-tuning decreases moral robustness R by 65%.

03

Secure fine-tuning preserves S and only partially reduces R.

Abstract

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.