A Massive Scale Semantic Similarity Dataset of Historical English

Emily Silcock; Melissa Dell

arXiv:2306.17810·cs.CL·August 25, 2023·1 cites

A Massive Scale Semantic Similarity Dataset of Historical English

Emily Silcock, Melissa Dell

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale semantic similarity dataset derived from digitized historical U.S. newspapers, enabling improved language models to analyze semantic change over a 70-year span.

Contribution

The study creates the HEADLINES dataset from historical newspapers, leveraging document layout and neural methods to detect source similarity, covering a much larger and older dataset than existing ones.

Findings

01

The dataset contains nearly 400 million positive semantic pairs.

02

It spans 70 years from 1920 to 1989, covering historical language use.

03

The dataset enables new research on semantic change over time.

Abstract

A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dell-research-harvard/headlines-semantic-similarity
dataset· 477 dl
477 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Natural Language Processing Techniques