From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings   Method for OCR Post-Correction

Mika H\"am\"al\"ainen; Simon Hengchen

arXiv:1910.05535·cs.CL·July 23, 2020

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

Mika H\"am\"al\"ainen, Simon Hengchen

PDF

1 Repo

TL;DR

This paper introduces an unsupervised, fully automatic method for OCR post-correction using a character-based sequence-to-sequence neural machine translation model trained on automatically extracted parallel data, improving correction efficiency.

Contribution

It presents a novel unsupervised approach to generate training data for NMT models, eliminating the need for manual annotation or rule-based systems.

Findings

01

Effective OCR error correction demonstrated on historical corpora.

02

Reduces reliance on manual labeling and rule-based methods.

03

Improves accuracy of OCR post-correction with neural models.

Abstract

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mikahama/natas
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.