Newswire: A Large-Scale Structured Database of a Century of Historical News
Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa, Dell

TL;DR
The paper presents a comprehensive, structured database of 2.7 million U.S. newswire articles from 1878 to 1977, reconstructed from raw newspaper scans using advanced deep learning techniques, enabling diverse historical and linguistic research.
Contribution
It introduces a novel deep learning pipeline to extract, disambiguate, and structure century-long newswire content from raw images, creating a valuable resource for multiple disciplines.
Findings
Reconstructed 2.7 million articles from 1878-1977
Developed models for entity disambiguation and topic classification
Provided georeferenced and richly annotated news data
Abstract
In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dell-research-harvard/LinkMentionsmodel
- 🤗dell-research-harvard/byline-detectionmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗dell-research-harvard/wire-classifiermodel· 10 dl· ♡ 110 dl♡ 1
- 🤗dell-research-harvard/topic-antitrustmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗dell-research-harvard/topic-firemodel· 10 dl10 dl
- 🤗dell-research-harvard/topic-labor_movementmodel· 5 dl5 dl
- 🤗dell-research-harvard/topic-politicsmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗dell-research-harvard/topic-sportmodel· 1 dl1 dl
- 🤗dell-research-harvard/topic-civil_rightsmodel· 1 dl1 dl
- 🤗dell-research-harvard/topic-crimemodel
Videos
Taxonomy
TopicsDigital Humanities and Scholarship
MethodsLib
