TL;DR
This paper introduces an unsupervised, fully automatic method for OCR post-correction using a character-based sequence-to-sequence neural machine translation model trained on automatically extracted parallel data, improving correction efficiency.
Contribution
It presents a novel unsupervised approach to generate training data for NMT models, eliminating the need for manual annotation or rule-based systems.
Findings
Effective OCR error correction demonstrated on historical corpora.
Reduces reliance on manual labeling and rule-based methods.
Improves accuracy of OCR post-correction with neural models.
Abstract
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
