Named Entity Recognition and Classification on Historical Documents: A Survey
Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello,, Antoine Doucet

TL;DR
This survey reviews the challenges, resources, and approaches for named entity recognition and classification in historical documents, highlighting the need for specialized methods to handle noisy, diverse, and aged texts for humanities research.
Contribution
It provides a comprehensive overview of existing NER techniques, resources, and future priorities specifically tailored for historical document analysis.
Findings
Historical documents pose unique challenges for NER due to noise and diversity.
Current resources and approaches are limited and need adaptation for historical texts.
Future research should focus on developing robust, specialized NER methods for historical data.
Abstract
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
