Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations
Xiaoxu Ma, Xiangbo Zhang, Zhenyu Weng

TL;DR
This paper introduces PVNI, a novel internal-activation-based method for stable and explainable personality trait evaluation in large language models, overcoming limitations of existing questionnaire-based approaches.
Contribution
The paper proposes PVNI, a new approach leveraging internal activations for stable, interpretable personality assessment in LLMs, with theoretical analysis and extensive experimental validation.
Findings
PVNI provides more stable personality evaluations than existing methods.
PVNI maintains robustness under different prompt phrasing and role-play scenarios.
Theoretical analysis supports the effectiveness and generalization of PVNI.
Abstract
Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model's internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Machine Learning in Healthcare · Topic Modeling
