"A Passage to India": Pre-trained Word Embeddings for Indian Languages
Kumar Saurav, Kumar Saunack, Diptesh Kanojia, Pushpak Bhattacharyya

TL;DR
This paper provides a comprehensive repository of pre-trained word embeddings for 14 Indian languages, including contextual and cross-lingual models, to support NLP tasks in resource-scarce settings.
Contribution
It introduces a large collection of 436 pre-trained embeddings for Indian languages, covering multiple approaches and cross-lingual models, accessible for NLP research and applications.
Findings
Embeddings improve performance on POS tagging and NER tasks.
Cross-lingual embeddings facilitate transfer learning across Indian languages.
Resource availability supports NLP development in low-resource languages.
Abstract
Dense word vectors or 'word embeddings' which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Tanh Activation · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Sigmoid Activation · Layer Normalization · Residual Connection · Dropout
