Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024

Austin McCutcheon; Thiago E. A. de Oliveira; Aleksandr Zheleznov; Chris Brogly

arXiv:2506.09381·cs.CL·June 12, 2025

Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024

Austin McCutcheon, Thiago E. A. de Oliveira, Aleksandr Zheleznov, Chris Brogly

PDF

Open Access

TL;DR

This study evaluates machine learning models, including traditional ensemble methods and deep learning, to automatically classify the perceived quality of news headlines and links from a large, worldwide dataset, achieving high accuracy.

Contribution

It introduces a large-scale dataset and compares traditional ML and deep learning models for automatic quality classification of news headlines and links.

Findings

01

Ensemble methods achieved up to 88.1% accuracy.

02

Fine-tuned DistilBERT achieved 90.3% accuracy.

03

Both NLP features and deep learning effectively differentiate news quality.

Abstract

The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb visibility and informetrics · Health Literacy and Information Accessibility · Misinformation and Its Impacts