MuRIL: Multilingual Representations for Indian Languages

Simran Khanuja; Diksha Bansal; Sarvesh Mehtani; Savya Khosla; Atreyee; Dey; Balaji Gopalan; Dilip Kumar Margam; Pooja Aggarwal; Rajiv Teja Nagipogu,; Shachi Dave; Shruti Gupta; Subhash Chandra Bose Gali; Vish Subramanian,; Partha Talukdar

arXiv:2103.10730·cs.CL·April 5, 2021·158 cites

MuRIL: Multilingual Representations for Indian Languages

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee, Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu,, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian,, Partha Talukdar

PDF

Open Access 1 Repo 2 Models

TL;DR

MuRIL is a specialized multilingual language model designed for Indian languages, trained on extensive Indian text data, including transliterations, and it outperforms existing models on cross-lingual benchmarks and transliterated datasets.

Contribution

We introduce MuRIL, a multilingual language model tailored for Indian languages, trained with augmented data including transliterations, improving performance on cross-lingual and transliterated tasks.

Findings

01

MuRIL outperforms mBERT on XTREME benchmarks.

02

MuRIL effectively handles transliterated Indian language data.

03

Training on augmented Indian language data enhances cross-lingual performance.

Abstract

India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hate-alert/indicabusive
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Layer Normalization · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Weight Decay · Dropout · Linear Warmup With Linear Decay · Multi-Head Attention