MuRIL: Multilingual Representations for Indian Languages
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee, Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu,, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian,, Partha Talukdar

TL;DR
MuRIL is a specialized multilingual language model designed for Indian languages, trained on extensive Indian text data, including transliterations, and it outperforms existing models on cross-lingual benchmarks and transliterated datasets.
Contribution
We introduce MuRIL, a multilingual language model tailored for Indian languages, trained with augmented data including transliterations, improving performance on cross-lingual and transliterated tasks.
Findings
MuRIL outperforms mBERT on XTREME benchmarks.
MuRIL effectively handles transliterated Indian language data.
Training on augmented Indian language data enhances cross-lingual performance.
Abstract
India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsLinear Layer · Residual Connection · Layer Normalization · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Weight Decay · Dropout · Linear Warmup With Linear Decay · Multi-Head Attention
