Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach
David Fleischhacker, Wolfgang Goederle, Roman Kern

TL;DR
This paper presents a machine learning approach combining structure recognition and OCR optimization to significantly improve text recognition accuracy in 19th-century historical documents, enabling better data extraction for historical research.
Contribution
The study introduces a novel combined ML-based method using Faster R-CNN and tailored OCR for enhanced recognition of complex historical document structures and text.
Findings
OCR error rates decreased by over 70%
Combined approach improved CER and WER significantly
Synthesized training data enabled efficient model training
Abstract
This paper addresses a major challenge to historical research on the 19th century. Large quantities of sources have become digitally available for the first time, while extraction techniques are lagging behind. Therefore, we researched machine learning (ML) models to recognise and extract complex data structures in a high-value historical primary source, the Schematismus. It records every single person in the Habsburg civil service above a certain hierarchical level between 1702 and 1918 and documents the genesis of the central administration over two centuries. Its complex and intricate structure as well as its enormous size have so far made any more comprehensive analysis of the administrative and social structure of the later Habsburg Empire on the basis of this source impossible. We pursued two central objectives: Primarily, the improvement of the OCR quality, for which we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Digital and Traditional Archives Management · Natural Language Processing Techniques
Methodstravel james · Sparse Evolutionary Training · Convolution · Softmax · RoIPool · Region Proposal Network · Faster R-CNN · Balanced Selection
