Linear Personality Probing and Steering in LLMs: A Big Five Study
Michel Frising, Daniel Balcells

TL;DR
This paper explores using linear directions aligned with the Big Five personality traits to probe and steer large language models' behavior, demonstrating effective personality detection but limited control in open-ended tasks.
Contribution
It introduces a method for using linear directions in activation space aligned with personality traits for probing and steering LLMs, providing a cost-effective alternative to existing approaches.
Findings
Linear directions effectively probe personality traits in LLMs.
Steering capabilities are context-dependent, working well in forced-choice but limited in open-ended tasks.
Trait-aligned directions can detect personality with high accuracy.
Abstract
Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonality Traits and Psychology · Mental Health via Writing · Digital Mental Health Interventions
