Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Kanchon Gharami; Quazi Sarwar Muhtaseem; Deepti Gupta; Lavanya Elluri; Shafika Showkat Moni

arXiv:2511.22769·cs.CL·December 1, 2025

Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Kanchon Gharami, Quazi Sarwar Muhtaseem, Deepti Gupta, Lavanya Elluri, Shafika Showkat Moni

PDF

Open Access

TL;DR

This paper introduces a large transliteration dataset for Hindi and Bengali, and trains a multilingual seq2seq LLM that significantly improves transliteration accuracy for these languages.

Contribution

The paper creates a comprehensive transliteration dataset for Hindi and Bengali and develops a new multilingual LLM that outperforms existing models in transliteration tasks.

Findings

01

Significant BLEU and CER improvements over existing models

02

Nearly 1.8 million Hindi and 1 million Bengali transliteration pairs created

03

Pre-trained multilingual seq2seq LLM effectively handles Romanized scripts

Abstract

The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Authorship Attribution and Profiling