IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish, Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush, Kumar

TL;DR
The paper introduces the IndicNLG Benchmark, a comprehensive multilingual dataset for diverse natural language generation tasks in 11 Indic languages, enabling better benchmarking and development of NLG models for low-resource languages.
Contribution
It presents the first diverse multilingual NLG benchmark for Indic languages, with datasets for five tasks and analysis of model performance, facilitating future research in low-resource language NLG.
Findings
Multilingual pre-trained models perform strongly on Indic NLG tasks.
Models trained on IndicNLG datasets are useful for related NLG tasks.
The dataset creation process is simple and adaptable to other low-resource languages.
Abstract
Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ai4bharat/MultiIndicWikiBioUnifiedmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗ai4bharat/MultiIndicWikiBioSSmodel· 7 dl7 dl
- 🤗ai4bharat/MultiIndicQuestionGenerationUnifiedmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗ai4bharat/MultiIndicQuestionGenerationSSmodel· 137 dl· ♡ 1137 dl♡ 1
- 🤗ai4bharat/MultiIndicParaphraseGenerationmodel· 19 dl· ♡ 319 dl♡ 3
- 🤗ai4bharat/MultiIndicParaphraseGenerationSSmodel· 28 dl· ♡ 128 dl♡ 1
- 🤗ai4bharat/MultiIndicHeadlineGenerationSSmodel· 3 dl3 dl
- 🤗ai4bharat/MultiIndicHeadlineGenerationmodel· 3 dl3 dl
- 🤗arijitx/IndicBART-bn-QuestionGenerationmodel· 4 dl4 dl
- 🤗ai4bharat/MultiIndicSentenceSummarizationmodel· 28 dl28 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
