Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration
Kanchon Gharami, Quazi Sarwar Muhtaseem, Deepti Gupta, Lavanya Elluri, Shafika Showkat Moni

TL;DR
This paper introduces a large transliteration dataset for Hindi and Bengali, and trains a multilingual seq2seq LLM that significantly improves transliteration accuracy for these languages.
Contribution
The paper creates a comprehensive transliteration dataset for Hindi and Bengali and develops a new multilingual LLM that outperforms existing models in transliteration tasks.
Findings
Significant BLEU and CER improvements over existing models
Nearly 1.8 million Hindi and 1 million Bengali transliteration pairs created
Pre-trained multilingual seq2seq LLM effectively handles Romanized scripts
Abstract
The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Authorship Attribution and Profiling
