Your Language Model Secretly Contains Personality Subnetworks
Ruimeng Ye, Zihan Wang, Zinan Ling, Yang Xiao, Manling Li, Xiaolong Ma, Bo Hui

TL;DR
This paper reveals that large language models inherently contain distinct persona-specific subnetworks within their parameters, enabling behavior adaptation without external prompts or fine-tuning, thus offering a new perspective on model personalization.
Contribution
The authors introduce a training-free method to identify and isolate persona subnetworks in LLMs, demonstrating their effectiveness in enhancing persona alignment and interpretability.
Findings
Persona subnetworks are embedded in LLM parameters.
Contrastive pruning improves separation of opposing personas.
Method outperforms baselines requiring external knowledge.
Abstract
Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing…
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea that personas are embedded within the parameters of pretrained LLMs and can be extracted without additional training provides a fresh perspective on LLM personalization. 2. The contrastive pruning technique proves to be particularly effective in distinguishing opposing personas, which is a challenging aspect in persona modeling. 3. The method offers a training-free solution that is more computationally efficient than alternative techniques such as fine-tuning or RAG, requiring mini
1. While the method works well for some personas, there are instances where certain personality dimensions, like N/S and J/P from the MBTI dataset, show weaker separation, leading to less distinct personas. This limitation could be addressed with more dimension-aware or layer-aware techniques. 2. Results on Llama models show that the scalability of models to other architectures or domain-specific tasks is not fully explored. The authors should clarify how well this approach might generalize to
- The idea of identifying a subnetwork that represents a target persona is interesting. - The method does not require explicit gradient-based training, which makes the overall process simple and interpretable. - The provided analyses on the persona evaluation are extensive. - The manuscript is well written and easy to follow.
### 1. Affect of Pruning on General Performance While the approach for identifying sub-networks linked to specific personality traits is compelling, the work does not address how pruning affects overall model performance. Including an evaluation of whether important downstream capabilities are improved -- or at least preserved -- would significantly strengthen the contribution. ### 2. Precise Mechanism of the Contrastive Pruning Algorithm I am skeptical about the contrastive pruning algorithm
* The paper is well-motivated. It is intuitive that pretraining can embed personality subnetworks in LLMs, and the proposed training-free pruning provides a practical way to approximate the upper bound of persona knowledge already encoded in the parameters. * The proposed activation-guided and contrastive pruning framework is theoretically grounded in the lottery ticket hypothesis and activation-based interpretability, making it a principled way to isolate latent persona subnetworks already embe
* The proposed framework essentially functions as an interpretability probe rather than a generative alignment method. Its real contribution lies in exploring the upper bound of persona encoding already latent in LLMs, not in improving persona expression. Therefore, directly comparing it with SFT is conceptually inconsistent. For an interpretability-oriented method, the most crucial evaluation should concern faithfulness—whether the discovered subnetworks truly correspond to the model's intrinsi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Machine Learning in Healthcare · Topic Modeling
