IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic   Languages

Aman Kumar; Himani Shrotriya; Prachi Sahu; Raj Dabre; Ratish; Puduppully; Anoop Kunchukuttan; Amogh Mishra; Mitesh M. Khapra; Pratyush; Kumar

arXiv:2203.05437·cs.CL·October 28, 2022·5 cites

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish, Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush, Kumar

PDF

Open Access 10 Models 5 Datasets

TL;DR

The paper introduces the IndicNLG Benchmark, a comprehensive multilingual dataset for diverse natural language generation tasks in 11 Indic languages, enabling better benchmarking and development of NLG models for low-resource languages.

Contribution

It presents the first diverse multilingual NLG benchmark for Indic languages, with datasets for five tasks and analysis of model performance, facilitating future research in low-resource language NLG.

Findings

01

Multilingual pre-trained models perform strongly on Indic NLG tasks.

02

Models trained on IndicNLG datasets are useful for related NLG tasks.

03

The dataset creation process is simple and adaptable to other low-resource languages.

Abstract

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications