Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024
Austin McCutcheon, Thiago E. A. de Oliveira, Aleksandr Zheleznov, Chris Brogly

TL;DR
This study evaluates machine learning models, including traditional ensemble methods and deep learning, to automatically classify the perceived quality of news headlines and links from a large, worldwide dataset, achieving high accuracy.
Contribution
It introduces a large-scale dataset and compares traditional ML and deep learning models for automatic quality classification of news headlines and links.
Findings
Ensemble methods achieved up to 88.1% accuracy.
Fine-tuned DistilBERT achieved 90.3% accuracy.
Both NLP features and deep learning effectively differentiate news quality.
Abstract
The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb visibility and informetrics · Health Literacy and Information Accessibility · Misinformation and Its Impacts
