Neural OCR Post-Hoc Correction of Historical Corpora

Lijun Lyu; Maria Koutraki; Martin Krickl; Besnik Fetahu

arXiv:2102.00583·cs.CL·February 2, 2021

Neural OCR Post-Hoc Correction of Historical Corpora

Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu

PDF

1 Repo

TL;DR

This paper introduces a neural OCR post-hoc correction method combining RNN and ConvNet with a novel attention mechanism and loss function, significantly reducing transcription errors in historical German texts.

Contribution

It presents a new neural model architecture and loss function specifically designed for correcting OCR errors in historical corpora, improving accuracy substantially.

Findings

01

Reduces word error rate by over 89%.

02

Robustly captures diverse OCR errors.

03

Effective on historical German texts.

Abstract

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model's correcting behavior. Evaluation on a historical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GarfieldLyu/OCR_POST_DE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.