On the Relationship between Truth and Political Bias in Language Models
Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney,, Elinor Poole-Dayan, Deb Roy, Jad Kabbara

TL;DR
This paper investigates how training language models for truthfulness influences their political bias, revealing a tendency towards left-leaning bias and highlighting dataset and model size effects.
Contribution
It provides the first analysis of the relationship between truthfulness and political bias in language models, showing that optimizing for truthfulness can increase political bias.
Findings
Reward models trained for truthfulness tend to be left-leaning.
Existing open-source reward models also exhibit similar bias.
Larger models show a greater degree of political bias.
Abstract
Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: truthfulness and political bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e., those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques · Translation Studies and Practices
MethodsFocus
