Unicode Normalization and Grapheme Parsing of Indic Languages
Nazmuddoha Ansary, Quazi Adibur Rahman Adib, Tahsin Reasat, Asif, Shahriyar Sushmit, Ahmed Imtiaz Humayun, Sazia Mehnaz, Kanij Fatema, Mohammad, Mamun Or Rashid, Farig Sadeque

TL;DR
This paper introduces two tools—a normalizer and a grapheme parser—for better processing of Indic languages' complex orthographic units, improving accuracy over previous methods and supporting multiple scripts.
Contribution
It presents novel, efficient normalizer and parser tools tailored for Indic languages' complex graphemes, enhancing Unicode normalization and text processing capabilities.
Findings
Normalizer outperforms previous IndicNLP normalizer.
Tools effectively process seven Indic scripts.
Robust NLP experiments validate tool effectiveness.
Abstract
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feature of these languages and encode words as linear sequences of Unicode characters using an intricate scheme of connector characters and font interpreters. Due to this way of using a few dozen Unicode glyphs to write thousands of different unique glyphs (complex graphemes), there are serious ambiguities that lead to malformed words. In this paper, we are proposing two libraries: i) a normalizer for normalizing inconsistencies caused by a Unicode-based encoding scheme for Indic languages and ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
