Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Weixu Zhang; Ye Yuan; Changjiang Han; Yuxing Tian; Zipeng Sun; Linfeng Du; Jikun Kang; Hong Kang; Xue Liu; Haolun Wu

arXiv:2604.22345·cs.CL·April 27, 2026

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

PDF

1 Repo

TL;DR

This paper introduces a mechanistic interpretability framework for LLMs, identifying Preference Heads that encode user preferences and enabling controllable personalization without additional training.

Contribution

The work proposes Differential Preference Steering (DPS), a training-free method to identify and leverage Preference Heads for interpretable and effective personalization.

Findings

01

DPS identifies Preference Heads that causally influence personalized outputs.

02

Experiments show consistent improvements in personalization fidelity across multiple LLMs.

03

DPS maintains content coherence and low computational overhead.

Abstract

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.