A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes

Samiul Alam; Tahsin Reasat; Asif Shahriyar Sushmit; Sadi Mohammad; Siddiquee; Fuad Rahman; Mahady Hasan; Ahmed Imtiaz Humayun

arXiv:2010.00170·cs.CV·June 30, 2022

A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes

Samiul Alam, Tahsin Reasat, Asif Shahriyar Sushmit, Sadi Mohammad, Siddiquee, Fuad Rahman, Mahady Hasan, Ahmed Imtiaz Humayun

PDF

2 Repos

TL;DR

This paper introduces a large, annotated dataset of Bengali handwritten graphemes, facilitating research in OCR for alpha-syllabary languages and enabling benchmarking of deep learning models.

Contribution

It presents the first comprehensive Bengali handwritten grapheme dataset with 411,000 samples, including common and uncommon characters, and a labeling scheme for linear segmentation.

Findings

01

Deep learning models can generalize to unseen graphemes.

02

The dataset enables effective benchmarking of OCR algorithms.

03

Open-source dataset and challenge foster community research.

Abstract

Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) research. Adapting existing systems from Latin to alpha-syllabary languages is particularly challenging due to a sharp contrast between their orthographies. The segmentation of graphical constituents corresponding to characters becomes significantly hard due to a cursive writing system and frequent use of diacritics in the alpha-syllabary family of languages. We propose a labeling scheme based on graphemes (linguistic segments of word formation) that makes segmentation in-side alpha-syllabary words linear and present the first dataset of Bengali handwritten graphemes that are commonly used in an everyday context. The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes. Additionally, the test set contains 900 uncommon Bengali graphemes for out of dictionary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.