Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Ansh Poonia; Maeghal Jain

arXiv:2507.20936·cs.LG·September 23, 2025

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Ansh Poonia, Maeghal Jain

PDF

Open Access

TL;DR

This paper investigates how large language models encode and utilize persona information during reasoning, revealing the roles of different layers and attention heads in processing persona-specific and identity-related content.

Contribution

It introduces activation patching to analyze persona encoding in LLMs and uncovers how various layers and attention heads contribute to persona-driven reasoning.

Findings

01

Early MLP layers encode semantic content of persona tokens

02

Middle MHA layers utilize these representations to influence output

03

Certain attention heads focus disproportionately on racial and color identities

Abstract

Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model's reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model's output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Topic Modeling · Machine Learning in Healthcare