Understanding Political Polarisation using Language Models: A dataset and method
Samiran Gode, Supreeth Bare, Bhiksha Raj, Hyungon Yoo

TL;DR
This paper introduces a new dataset from Wikipedia covering 120 years and a language model-based method to analyze political polarization among US candidates, aiming to inform voters and understand polarization trends.
Contribution
It provides a comprehensive, bias-cleaned dataset and a novel language model approach to measure political polarization over time.
Findings
Classical models like Word2Vec and Doc2Vec show initial polarization patterns.
Transformer-based models like Longformer better capture candidate similarities.
Polarization trends vary across different historical phases.
Abstract
Our paper aims to analyze political polarization in US political system using Language Models, and thereby help candidates make an informed decision. The availability of this information will help voters understand their candidates views on the economy, healthcare, education and other social issues. Our main contributions are a dataset extracted from Wikipedia that spans the past 120 years and a Language model based method that helps analyze how polarized a candidate is. Our data is divided into 2 parts, background information and political information about a candidate, since our hypothesis is that the political views of a candidate should be based on reason and be independent of factors such as birthplace, alma mater, etc. We further split this data into 4 phases chronologically, to help understand if and how the polarization amongst candidates changes. This data has been cleaned to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia Influence and Politics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Linear Warmup With Linear Decay · How do I get a human at Expedia immediately? (2025-2026) · Residual Connection · Attention Dropout · AdamW
