AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for   Indic Languages

Anoop Kunchukuttan; Divyanshu Kakwani; Satish Golla; Gokul N.C.; Avik; Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar

arXiv:2005.00085·cs.CL·May 4, 2020·43 cites

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik, Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

PDF

Open Access 2 Repos 2 Models 5 Datasets

TL;DR

This paper introduces the IndicNLP corpus with 2.7 billion words across 10 Indian languages, along with pre-trained embeddings and classification datasets, to advance NLP research in Indic languages.

Contribution

It provides a large-scale, multilingual corpus and pre-trained embeddings for Indic languages, along with evaluation datasets, to facilitate NLP research and improve embedding quality.

Findings

01

IndicNLP embeddings outperform existing publicly available embeddings

02

The corpus accelerates Indic NLP research

03

Pre-trained embeddings show significant improvements in evaluation tasks

Abstract

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification