Improving OCR for Historical Texts of Multiple Languages
Hylke Westerdijk, Ben Blankenborg, Khondoker Ittehadul Islam

TL;DR
This paper explores deep learning methods to improve OCR accuracy for historical and modern texts across multiple languages, demonstrating enhanced recognition through data augmentation, segmentation, and sequence modeling techniques.
Contribution
It introduces novel deep learning approaches tailored for OCR of historical multilingual texts, combining data augmentation, semantic segmentation, and sequence modeling.
Findings
Improved OCR accuracy for Hebrew Dead Sea Scrolls.
Effective use of confidence-based pseudolabeling for historical documents.
Successful recognition of modern English handwriting with CRNN and ResNet34.
Abstract
This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
