Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts
Omri Suissa, Maayan Zhitomirsky-Geffet, Avshalom Elmalech

TL;DR
This paper introduces a novel multi-phase approach to generate training data and optimize neural network hyperparameters specifically for correcting OCR errors in historical Hebrew texts, addressing data scarcity and language-specific challenges.
Contribution
The paper presents a new method for creating artificial training datasets and hyperparameter tuning tailored to Hebrew OCR error correction, improving neural network effectiveness.
Findings
Enhanced OCR error correction accuracy for Hebrew texts.
Effective neural network models with optimized hyperparameters.
Robust training datasets generated through multi-phase artificial data creation.
Abstract
Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically-rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
