Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models
Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang

TL;DR
This paper investigates the internal neuron mechanisms behind political stance generalization in large language models and introduces a method to mitigate unintended cross-topic political influence.
Contribution
It identifies distinct political neuron types and proposes InhibitFT, a novel fine-tuning approach that reduces cross-topic stance generalization while maintaining performance.
Findings
Political neurons are consistent across models and datasets.
InhibitFT reduces cross-topic stance generalization by 20%.
Inhibiting 5% of neurons effectively mitigates the issue.
Abstract
Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Misinformation and Its Impacts
