Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala   Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Charangan Vasantharajan; Laksika Tharmalingam; Uthayasanker; Thayasivam

arXiv:2109.05952·cs.CL·December 19, 2022

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Charangan Vasantharajan, Laksika Tharmalingam, Uthayasanker, Thayasivam

PDF

Open Access 1 Repo

TL;DR

This paper enhances the Tesseract OCR engine for Tamil and Sinhala by training on legacy fonts, significantly reducing error rates, and creates a large parallel corpus for these low-resource languages, facilitating further research.

Contribution

The authors improved Tesseract OCR for Tamil and Sinhala through LSTM training on legacy fonts and generated a substantial parallel corpus for these languages.

Findings

01

Character error rate reduced from 6.03% to 2.61% for Tamil.

02

Word error rate decreased from 39.68% to 20.61% for Tamil.

03

Created a large parallel corpus with over 180,000 sentences for each language.

Abstract

Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aaivu/tamizhi-net-ocr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing