Fine-Grained Interpretation of Political Opinions in Large Language Models
Jingyu Hu, Mengyue Yang, Mengnan Du, Weiru Liu

TL;DR
This paper develops a multi-dimensional framework and interpretable vectors to analyze and influence the internal political opinions of large language models, improving transparency and control over their political responses.
Contribution
It introduces a four-dimensional political learning framework and constructs a dataset for fine-grained political concept vector learning, enabling better interpretability and intervention in LLMs' political opinions.
Findings
Vectors can disentangle political concept confounds.
Vectors show good generalization and robustness in OOD settings.
Vectors can be used to intervene and alter LLM responses' political leanings.
Abstract
Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Computational and Text Analysis Methods · Topic Modeling
