TL;DR
POLUSA is a large, balanced dataset of 0.9 million US political news articles from 2017-2019, labeled by political leaning, designed to facilitate research on media bias, societal issues, and deep learning applications.
Contribution
The paper introduces POLUSA, a comprehensive, balanced dataset of US political news articles with political labels, addressing limitations of previous datasets for social science and NLP research.
Findings
Dataset covers 0.9M articles from 18 outlets
Balanced by time and outlet popularity
Labels outlets by political leaning
Abstract
News articles covering policy issues are an essential source of information in the social sciences and are also frequently used for other use cases, e.g., to train NLP language models. To derive meaningful insights from the analysis of news, large datasets are required that represent real-world distributions, e.g., with respect to the contained outlets' popularity, topically, or across time. Information on the political leanings of media publishers is often needed, e.g., to study differences in news reporting across the political spectrum, which is one of the prime use cases in the social sciences when studying media bias and related societal issues. Concerning these requirements, existing datasets have major flaws, resulting in redundant and cumbersome effort in the research community for dataset creation. To fill this gap, we present POLUSA, a dataset that represents the online media…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
