uTHCD: A New Benchmarking for Tamil Handwritten OCR
Noushath Shaffi, Faizal Hajamohideen

TL;DR
This paper introduces uTHCD, a comprehensive large-scale Tamil handwritten character database combining online and offline samples, aiming to establish a new benchmark for Tamil OCR and facilitate advancements in document image analysis.
Contribution
The creation of the first extensive, unified Tamil handwritten character database with 91,000 samples, including both online and offline data, to improve OCR research.
Findings
Database contains 91,000 samples across 156 classes.
Baseline CNN model achieves 88% accuracy on test data.
Database will be publicly available for research use.
Abstract
Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
