Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri

TL;DR
This paper explores the semantic geometry of LLMs' personality space, identifying intrinsic guardrails that can regulate emergent misalignment by manipulating social valence vectors, which are stable across models.
Contribution
It introduces the Semantic Valence Vector (SVV) and demonstrates its effectiveness in controlling harmful behaviors in fine-tuned LLMs through causal interventions.
Findings
Ablating social valence directions increases misalignment rates above 40%.
Amplifying these directions reduces failure modes to less than 3%.
Vectors from an instruct-tuned model transfer zero-shot to regulate EM in corrupted fine-tunes.
Abstract
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above %, while amplifying them suppresses the failure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
