Modeling "Newsworthiness" for Lead-Generation Across Corpora
Alexander Spangher, Nanyun Peng, Jonathan May, Emilio Ferrara

TL;DR
This paper develops a model to predict newsworthiness of documents across different corpora, helping journalists identify interesting leads from large, unlabeled datasets like court cases and bills.
Contribution
It introduces a transfer learning approach using RoBERTa trained on labeled newspaper articles to rank unlabeled legal and governmental documents by newsworthiness.
Findings
Achieved 0.93 AUC on labeled data
Reached 0.88 AUC on unlabeled, expert-validated data
Provided interpretability and visualization of the model
Abstract
Journalists obtain "leads", or story ideas, by reading large corpora of government records: court cases, proposed bills, etc. However, only a small percentage of such records are interesting documents. We propose a model of "newsworthiness" aimed at surfacing interesting documents. We train models on automatically labeled corpora -- published newspaper articles -- to predict whether each article was a front-page article (i.e., \textbf{newsworthy}) or not (i.e., \textbf{less newsworthy}). We transfer these models to unlabeled corpora -- court cases, bills, city-council meeting minutes -- to rank documents in these corpora on "newsworthiness". A fine-tuned RoBERTa model achieves .93 AUC performance on heldout labeled documents, and .88 AUC on expert-validated unlabeled corpora. We provide interpretation and visualization for our models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Softmax · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Layer Normalization
