Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
Ansh Poonia, Maeghal Jain

TL;DR
This paper investigates how large language models encode and utilize persona information during reasoning, revealing the roles of different layers and attention heads in processing persona-specific and identity-related content.
Contribution
It introduces activation patching to analyze persona encoding in LLMs and uncovers how various layers and attention heads contribute to persona-driven reasoning.
Findings
Early MLP layers encode semantic content of persona tokens
Middle MHA layers utilize these representations to influence output
Certain attention heads focus disproportionately on racial and color identities
Abstract
Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model's reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model's output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Topic Modeling · Machine Learning in Healthcare
