Towards Deployable OCR models for Indic languages
Minesh Mathew, Ajoy Mondal, CV Jawahar

TL;DR
This paper conducts a comprehensive empirical study of CTC-based neural network models for OCR in 13 Indian languages, introducing a new dataset and outperforming existing OCR tools in most languages.
Contribution
It provides a detailed analysis of neural network models for Indic OCR, compares recognition units, and introduces the Mozhi dataset for benchmarking.
Findings
Models outperform public OCR tools in 8 of 13 languages.
Synthetic data improves recognition accuracy.
Line vs word recognition units impact performance.
Abstract
Recognition of text on word or line images, without the need for sub-word segmentation has become the mainstream of research and development of text recognition for Indian languages. Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR. In this work we present a comprehensive empirical study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence. The study is conducted for 13 Indian languages, using an internal dataset that has around 1000 pages per language. We study the choice of line vs word as the recognition unit, and use of synthetic data to train the models. We compare our models with popular publicly available OCR tools for end-to-end document image recognition. Our end-to-end pipeline that employ our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Speech Recognition and Synthesis
