Probing Persona-Dependent Preferences in Language Models
Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin

TL;DR
This paper investigates how large language models encode and implement different personas' preferences internally, revealing a shared preference vector that influences task choices across diverse personas.
Contribution
It introduces a method to identify a genuine preference vector in LLMs and demonstrates its shared nature across different personas, including contrasting ones.
Findings
A preference vector tracks model preferences across prompts.
The preference vector can causally steer pairwise choices.
Preferences are largely shared across different personas.
Abstract
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
