Tracing Persona Vectors Through LLM Pretraining
Viktor Moskvoretskii, Dominik Glandorf, Jorge Medina Moreira, Tanja K\"aser, Robert West

TL;DR
This paper investigates how high-level persona representations form during large language model pretraining, showing they emerge early and evolve throughout training, with implications for AI safety and interpretability.
Contribution
It demonstrates that persona vectors form early in training, remain stable, and continue to refine, providing insights into their development and how to interpret them.
Findings
Persona vectors form within 0.22% of pretraining.
Persona vectors remain effective after full training.
Different elicitation strategies reveal distinct persona facets.
Abstract
How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
