Data Centric Domain Adaptation for Historical Text with OCR Errors

Luisa M\"arz; Stefan Schweter; Nina Poerner; Benjamin Roth; Hinrich; Sch\"utze

arXiv:2107.00927·cs.CL·July 5, 2021

Data Centric Domain Adaptation for Historical Text with OCR Errors

Luisa M\"arz, Stefan Schweter, Nina Poerner, Benjamin Roth, Hinrich, Sch\"utze

PDF

1 Repo

TL;DR

This paper introduces data-centric methods for improving Named Entity Recognition on historical Dutch and French texts, effectively handling domain shifts and OCR errors through synthetic error injection and contextual embeddings, achieving state-of-the-art results.

Contribution

It presents a novel approach to simulate OCR errors and integrate in-domain data for domain adaptation in NER tasks on historical texts.

Findings

01

Outperforms strong baselines in cross-domain and in-domain NER tasks.

02

Establishes state-of-the-art results on European NER corpora.

03

Provides preprocessed datasets for Dutch and French historical texts.

Abstract

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stefan-it/historic-domain-adaptation-icdar
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.