Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim

TL;DR
This paper introduces a novel method for controlling and aligning personality traits in large language models by extracting and manipulating hidden state activations within transformer layers, enabling precise trait steering without degrading performance.
Contribution
It proposes a new pipeline that identifies trait-specific optimal layers and uses low-rank subspace methods for stable personality trait control in LLMs, bridging psychological theory and model alignment.
Findings
Personality traits occupy a low-rank shared subspace in LLMs.
Trait-specific optimal layers can be identified for robust steering.
Steering does not significantly impact fluency or general capabilities.
Abstract
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMental Health via Writing · Personality Traits and Psychology · Topic Modeling
