Contrastive Entity Coreference and Disambiguation for Historical Texts

Abhishek Arora; Emily Silcock; Leander Heldring; Melissa Dell

arXiv:2406.15576·cs.CL·June 25, 2024

Contrastive Entity Coreference and Disambiguation for Historical Texts

Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

PDF

Open Access 1 Video

TL;DR

This paper introduces a large-scale dataset and contrastive bi-encoder models for improved entity coreference and disambiguation in historical texts, outperforming existing methods especially for out-of-knowledgebase individuals.

Contribution

It presents a novel large-scale training dataset, high-quality evaluation data, and trained models specifically designed for historical entity disambiguation and coreference resolution.

Findings

01

Models outperform existing disambiguation methods on historical benchmarks.

02

The approach effectively identifies out-of-knowledgebase individuals.

03

Models show competitive performance on modern entity disambiguation datasets.

Abstract

Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Contrastive Entity Coreference and Disambiguation for Historical Texts· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies