Contrastive Entity Coreference and Disambiguation for Historical Texts
Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

TL;DR
This paper introduces a large-scale dataset and contrastive bi-encoder models for improved entity coreference and disambiguation in historical texts, outperforming existing methods especially for out-of-knowledgebase individuals.
Contribution
It presents a novel large-scale training dataset, high-quality evaluation data, and trained models specifically designed for historical entity disambiguation and coreference resolution.
Findings
Models outperform existing disambiguation methods on historical benchmarks.
The approach effectively identifies out-of-knowledgebase individuals.
Models show competitive performance on modern entity disambiguation datasets.
Abstract
Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
