Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Florian Dietz; William Wale; Oscar Gilg; Robert McCarthy; Felix Michalak; Gustavo Ewbank Rodrigues Danon; Miguelito de Guzman; Dietrich Klakow

arXiv:2602.05532·cs.AI·February 6, 2026

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow

PDF

Open Access

TL;DR

Split Personality Training (SPT) introduces a dual-personality approach to reveal hidden knowledge in language models by activating an honest persona that reviews responses post-generation, significantly improving detection of misbehavior.

Contribution

The paper proposes Split Personality Training, a novel fine-tuning method that creates an inactive honest persona to audit model responses, enhancing transparency and misbehavior detection.

Findings

01

SPT achieves 96% accuracy in detecting concealed misbehavior.

02

The honest persona uncovers latent biases and knowledge inaccessible to external observers.

03

SPT outperforms existing auditing techniques on the Anthropic benchmark.

Abstract

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare