Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke, Nisansa de Silva

TL;DR
This study evaluates the zero-shot OCR performance of six engines on Sinhala and Tamil, revealing strengths of specific systems for each language and introducing a new Tamil OCR benchmark dataset.
Contribution
It provides a comparative analysis of OCR engines on low-resourced scripts Sinhala and Tamil and introduces a novel synthetic Tamil OCR benchmarking dataset.
Findings
Surya achieved the best Sinhala OCR performance with a WER of 2.61%.
Document AI outperformed others in Tamil OCR with a CER of 0.78%.
A new synthetic Tamil OCR benchmarking dataset is introduced.
Abstract
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
