A Massive Scale Semantic Similarity Dataset of Historical English
Emily Silcock, Melissa Dell

TL;DR
This paper introduces a large-scale semantic similarity dataset derived from digitized historical U.S. newspapers, enabling improved language models to analyze semantic change over a 70-year span.
Contribution
The study creates the HEADLINES dataset from historical newspapers, leveraging document layout and neural methods to detect source similarity, covering a much larger and older dataset than existing ones.
Findings
The dataset contains nearly 400 million positive semantic pairs.
It spans 70 years from 1920 to 1989, covering historical language use.
The dataset enables new research on semantic change over time.
Abstract
A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Natural Language Processing Techniques
