Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts
Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet

TL;DR
This paper introduces a novel training method for lightweight neural networks that improves OCR error correction in historical Hebrew texts by using automatically generated, task-specific training data, outperforming existing methods.
Contribution
It presents an innovative approach for automatically generating training data tailored for OCR post-correction, reducing manual labeling needs and enhancing neural network performance.
Findings
Training with the proposed method is more effective than using randomly generated errors.
Performance depends on the genre and area of training data.
Outperforms state-of-the-art neural networks and spellcheckers.
Abstract
Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
