OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow   for Historical Printings

Christian Reul; Dennis Christ; Alexander Hartelt; Nico Balbach,; Maximilian Wehner; Uwe Springmann; Christoph Wick; Christine Grundig; Andreas; B\"uttner; Frank Puppe

arXiv:1909.04032·cs.CV·June 1, 2021

OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Christian Reul, Dennis Christ, Alexander Hartelt, Nico Balbach,, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas, B\"uttner, Frank Puppe

PDF

TL;DR

OCR4all is an open-source, user-friendly OCR tool that combines advanced components and flexible workflows to accurately digitize historical printings, outperforming commercial tools in certain scenarios.

Contribution

It introduces a comprehensive, configurable OCR workflow with a GUI for error correction and model training, tailored for non-technical users working with historical texts.

Findings

01

Achieved character error rates below 0.5% with minimal effort.

02

Outperformed ABBYY FineReader on 19th-century novels with suitable pretrained models.

03

Enabled easy integration of new OCR components via standardized interfaces.

Abstract

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.