Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?
Yuto Harada, Hiro Taiyo Hamada

TL;DR
This study investigates how Big Five personality traits are internally represented in LLMs and demonstrates that targeted neuron interventions can causally influence these internal representations and, to some extent, the generated outputs.
Contribution
The paper identifies the localization of Big Five trait representations in LLMs and shows that neuron-level interventions can bias internal traits and influence generated text.
Findings
Big Five information is decodable in early layers and persists through final layers.
Neurons selective to each Big Five trait are most common in mid layers.
Interventions on these neurons can reliably shift internal trait representations.
Abstract
Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user's personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
