Psychological Steering of Large Language Models
Leonardo Blas, Robin Jia, Emilio Ferrara

TL;DR
This paper introduces a psychological steering framework for large language models that uses calibrated, semantic units for more effective personality trait manipulation, outperforming existing prompting methods.
Contribution
It presents a novel unbounded, fluency-constrained injection method based on psychological artifacts, improving personality steering in LLMs over prior approaches.
Findings
MD injections outperform P$^2$ in 11 of 14 LLMs with 3.6-16.4% gains.
Hybrid P$^2$ and MD injections outperform both in 13 of 14 LLMs with up to 26.7% gains.
MD injections align with the Linear Representation Hypothesis but show trait covariance patterns that differ from human psychology.
Abstract
Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P), an established baseline for OCEAN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
