Do Linear Probes Generalize Better in Persona Coordinates?
Prasad Mahadik, Adrians Skapars

TL;DR
This paper investigates whether linear probes in persona coordinate space can better generalize to detect harmful behaviors in language models, demonstrating improved robustness over raw activation probes.
Contribution
It introduces persona axes derived from contrastive prompts and shows they enhance the transferability and robustness of behavior probes across datasets.
Findings
Persona-derived directions transfer non-trivially across datasets.
Probes trained on persona-PC projections outperform those trained on raw activations.
A unified axis improves generalization across multiple behaviors.
Abstract
It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging, changing their behavior during evaluation. This motivates the use of white-box monitors like linear probes, which can read the model internals directly. Currently, such probes can fail under distribution shift, limiting their usefulness in real settings. We study whether there exists a low-dimensional subspace of the model internals that captures harmful behaviors more robustly, while leaving out spuriously correlative features. Inspired by the Assistant Axis and Persona Selection Model, we construct persona axes for deception and sycophancy using contrastive persona prompts. The first principal components, obtained by unsupervised PCA of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
