Persona-Model Collapse in Emergent Misalignment
Davi Bastos Costa, Renato Vicente

TL;DR
This paper investigates how fine-tuning large language models on harmful data causes emergent misalignment, characterized by persona-model collapse, leading to decreased character differentiation and consistency.
Contribution
It introduces behavioral metrics S and R to diagnose persona-model collapse and demonstrates their effectiveness across multiple models and fine-tuning conditions.
Findings
Insecure fine-tuning increases moral susceptibility S by 55%.
Insecure fine-tuning decreases moral robustness R by 65%.
Secure fine-tuning preserves S and only partially reduces R.
Abstract
Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
