Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen; Andy Arditi; Henry Sleight; Owain Evans; Jack Lindsey

arXiv:2507.21509·cs.CL·September 8, 2025

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces persona vectors, a method to monitor, predict, and control personality traits in language models, enabling better alignment with desired behaviors and reducing undesirable personality shifts during training and deployment.

Contribution

It presents a novel automated approach to extract persona vectors for any trait, facilitating monitoring, control, and data filtering to improve language model personality consistency.

Findings

01

Persona vectors can monitor personality fluctuations during deployment.

02

Shifts in personality traits correlate with changes along persona vectors during training.

03

Interventions based on persona vectors can mitigate unwanted personality shifts.

Abstract

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The authors present an automated pipeline for monitoring, predicting, and controlling LLM personalities. However, the paper hides a more salient contribution, the paper includes a case study of preventative steering, i.e., ablating persona vectors at training rather than inference time. The authors demonstrate that fine-tuning on new facts while steering away from the hallucination vector, preserves accuracy on the MMLU with only a slight degradation on new facts. Thus, compared to other SOA met

Weaknesses

The presentation is flawed. The paper is very comprehensive and compares the steering approach against psychometric baselines, SAE, fine-tuning vs. few-shot prompting, etc. but this also works to their detriment as it obfuscates the main empirical findings. According to the authors, the main contribution appears to be an automated pipeline. In terms of empirical findings, the paper would seem to have less to offer. As the authors outline, the methods are otherwise well-established and extensivel

Reviewer 02Rating 4Confidence 4

Strengths

- This work is very comprehensive in the experiments, showcasing diverse use cases for the proposed persona vector. Also, the experiments are delivered very clearly. - The writing is simple and direct, and it was easy to follow. - The Appendix is impressive, providing helpful details and further experiments that support the authors' claims.

Weaknesses

### 1. Method Novelty While this work is very comprehensive, my biggest concern is the novelty of the approach. Previously, there have been numerous works that discuss model steering vectors, like RepE [1] or ITI [2]. Also, many works have provided ways to "steer" LLMs for personalized use [3], and even to change LLM personality traits in the latent embedding space [4,5]. Given these works that discuss similar approaches, I believe this deserves an in-depth discussion on what differences the Pe

Reviewer 03Rating 2Confidence 4

Strengths

1. Sound Engineering Method: The automated pipeline for vector extraction is a good contribution. By requiring only a natural-language description of a trait, it provides a scalable method for representation engineering, moving beyond bespoke, manually-curated datasets for each concept. 2. Comprehensive Experiments: The authors rigorously validate the utility of the extracted vectors across a wide range of applications (monitoring, inference-time control, training-time control, and data screeni

Weaknesses

1. Limited Scientific Insight: The paper is presented more as an engineering achievement than a scientific one. It demonstrates that persona traits can be mapped to vectors but provides little insight into why this is the case and if other naive methods could do the same. The method feels like advanced prompt engineering applied to activations, rather than some general methodological approaches towards general understanding of related problems. 2. Insufficient Comparison to Naive Baselines: The

Code & Models

Datasets

Kenshiii/persona-vectors-dataset
dataset· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.