TL;DR
This paper introduces data-centric methods for improving Named Entity Recognition on historical Dutch and French texts, effectively handling domain shifts and OCR errors through synthetic error injection and contextual embeddings, achieving state-of-the-art results.
Contribution
It presents a novel approach to simulate OCR errors and integrate in-domain data for domain adaptation in NER tasks on historical texts.
Findings
Outperforms strong baselines in cross-domain and in-domain NER tasks.
Establishes state-of-the-art results on European NER corpora.
Provides preprocessed datasets for Dutch and French historical texts.
Abstract
We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
