The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey

TL;DR
This paper explores the structure of language model personas, identifying an 'Assistant Axis' that influences model behavior, and demonstrates how steering along this axis can stabilize or alter the model's persona and responses.
Contribution
It introduces the concept of the 'Assistant Axis' in model persona space and shows how steering along this axis can control and stabilize model behavior.
Findings
The 'Assistant Axis' captures the default helpful persona of models.
Steering along the axis influences helpfulness and style, including mystical speech.
Restricting activation along the axis stabilizes behavior and prevents persona drift.
Abstract
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Topic Modeling · Machine Learning in Healthcare
