Classification of worldwide news articles by perceived quality, 2018-2024
Connor McElroy, Thiago E. A. de Oliveira, Chris Brogly

TL;DR
This study evaluates machine learning and deep learning models on a large dataset to classify news articles by perceived quality, demonstrating high accuracy especially with models like ModernBERT-large.
Contribution
It introduces a new dataset of over 1.4 million news articles with expert-rated quality labels and compares multiple models for quality classification.
Findings
Deep learning models outperform traditional classifiers in accuracy.
ModernBERT-large achieves the highest accuracy of 87.44%.
Traditional classifiers like Random Forest reach 73.55% accuracy.
Abstract
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Misinformation and Its Impacts · Sentiment Analysis and Opinion Mining
