TL;DR
This paper introduces a large, annotated dataset of Bengali handwritten graphemes, facilitating research in OCR for alpha-syllabary languages and enabling benchmarking of deep learning models.
Contribution
It presents the first comprehensive Bengali handwritten grapheme dataset with 411,000 samples, including common and uncommon characters, and a labeling scheme for linear segmentation.
Findings
Deep learning models can generalize to unseen graphemes.
The dataset enables effective benchmarking of OCR algorithms.
Open-source dataset and challenge foster community research.
Abstract
Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) research. Adapting existing systems from Latin to alpha-syllabary languages is particularly challenging due to a sharp contrast between their orthographies. The segmentation of graphical constituents corresponding to characters becomes significantly hard due to a cursive writing system and frequent use of diacritics in the alpha-syllabary family of languages. We propose a labeling scheme based on graphemes (linguistic segments of word formation) that makes segmentation in-side alpha-syllabary words linear and present the first dataset of Bengali handwritten graphemes that are commonly used in an everyday context. The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes. Additionally, the test set contains 900 uncommon Bengali graphemes for out of dictionary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
