AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik, Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

TL;DR
This paper introduces the IndicNLP corpus with 2.7 billion words across 10 Indian languages, along with pre-trained embeddings and classification datasets, to advance NLP research in Indic languages.
Contribution
It provides a large-scale, multilingual corpus and pre-trained embeddings for Indic languages, along with evaluation datasets, to facilitate NLP research and improve embedding quality.
Findings
IndicNLP embeddings outperform existing publicly available embeddings
The corpus accelerates Indic NLP research
Pre-trained embeddings show significant improvements in evaluation tasks
Abstract
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
