Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
Charangan Vasantharajan, Laksika Tharmalingam, Uthayasanker, Thayasivam

TL;DR
This paper enhances the Tesseract OCR engine for Tamil and Sinhala by training on legacy fonts, significantly reducing error rates, and creates a large parallel corpus for these low-resource languages, facilitating further research.
Contribution
The authors improved Tesseract OCR for Tamil and Sinhala through LSTM training on legacy fonts and generated a substantial parallel corpus for these languages.
Findings
Character error rate reduced from 6.03% to 2.61% for Tamil.
Word error rate decreased from 39.68% to 20.61% for Tamil.
Created a large parallel corpus with over 180,000 sentences for each language.
Abstract
Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
