Steering Towards Fairness: Mitigating Political Bias in LLMs
Afrozah Nadeem, Mark Dras, Usman Naseem

TL;DR
This paper investigates political biases in large language models by analyzing internal representations and proposes a method to mitigate such biases using contrastive analysis and steering vectors.
Contribution
It introduces a novel framework for probing and mitigating political bias in decoder-based LLMs through internal activation analysis and a layer-wise mitigation approach.
Findings
Decoder LLMs encode political biases across layers.
Contrastive activation analysis reveals ideological disparities.
Mitigation via steering vectors reduces bias effectively.
Abstract
Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics in Business and Education · Business Law and Ethics
