"A Passage to India": Pre-trained Word Embeddings for Indian Languages

Kumar Saurav; Kumar Saunack; Diptesh Kanojia; Pushpak Bhattacharyya

arXiv:2112.13800·cs.CL·December 28, 2021·24 cites

"A Passage to India": Pre-trained Word Embeddings for Indian Languages

Kumar Saurav, Kumar Saunack, Diptesh Kanojia, Pushpak Bhattacharyya

PDF

Open Access

TL;DR

This paper provides a comprehensive repository of pre-trained word embeddings for 14 Indian languages, including contextual and cross-lingual models, to support NLP tasks in resource-scarce settings.

Contribution

It introduces a large collection of 436 pre-trained embeddings for Indian languages, covering multiple approaches and cross-lingual models, accessible for NLP research and applications.

Findings

01

Embeddings improve performance on POS tagging and NER tasks.

02

Cross-lingual embeddings facilitate transfer learning across Indian languages.

03

Resource availability supports NLP development in low-resource languages.

Abstract

Dense word vectors or 'word embeddings' which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Tanh Activation · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Sigmoid Activation · Layer Normalization · Residual Connection · Dropout