Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
Johnathan Sun, Andrew Zhang

TL;DR
This paper introduces persona vectors derived from activation steering in large language models to understand and influence high-level strategic behaviors in game-theoretic settings, revealing systematic shifts and divergences in rhetoric and strategy.
Contribution
It presents a novel method for constructing and applying persona vectors in LLMs to measure and steer strategic traits in game environments, advancing interpretability and control.
Findings
Activation steering shifts strategic choices and justifications systematically.
Rhetoric and strategy can diverge under persona steering.
Self-behavior and expectations vectors are partially distinct.
Abstract
Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Artificial Intelligence in Law · AI in Service Interactions
