Modeling "Newsworthiness" for Lead-Generation Across Corpora

Alexander Spangher; Nanyun Peng; Jonathan May; Emilio Ferrara

arXiv:2104.09653·cs.CL·April 21, 2021·1 cites

Modeling "Newsworthiness" for Lead-Generation Across Corpora

Alexander Spangher, Nanyun Peng, Jonathan May, Emilio Ferrara

PDF

Open Access

TL;DR

This paper develops a model to predict newsworthiness of documents across different corpora, helping journalists identify interesting leads from large, unlabeled datasets like court cases and bills.

Contribution

It introduces a transfer learning approach using RoBERTa trained on labeled newspaper articles to rank unlabeled legal and governmental documents by newsworthiness.

Findings

01

Achieved 0.93 AUC on labeled data

02

Reached 0.88 AUC on unlabeled, expert-validated data

03

Provided interpretability and visualization of the model

Abstract

Journalists obtain "leads", or story ideas, by reading large corpora of government records: court cases, proposed bills, etc. However, only a small percentage of such records are interesting documents. We propose a model of "newsworthiness" aimed at surfacing interesting documents. We train models on automatically labeled corpora -- published newspaper articles -- to predict whether each article was a front-page article (i.e., \textbf{newsworthy}) or not (i.e., \textbf{less newsworthy}). We transfer these models to unlabeled corpora -- court cases, bills, city-council meeting minutes -- to rank documents in these corpora on "newsworthiness". A fine-tuned RoBERTa model achieves .93 AUC performance on heldout labeled documents, and .88 AUC on expert-validated unlabeled corpora. We provide interpretation and visualization for our models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Softmax · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Layer Normalization