A Tool for Facilitating OCR Postediting in Historical Documents
Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way

TL;DR
This paper introduces a post-editing tool for OCR outputs of historical documents, leveraging language models to correct common errors, thereby improving accuracy while maintaining transparency for human review.
Contribution
The paper presents a novel OCR post-editing tool that suggests corrections based on language model scores, specifically designed for historical documents with unique challenges.
Findings
Successfully corrects common OCR errors in historical texts
Enhances accuracy of OCR outputs with language model suggestions
Maintains transparency and allows human intervention
Abstract
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Digital Humanities and Scholarship
