An Overview of Indian Language Datasets used for Text Summarization

Shagun Sinha; Girish Nath Jha

arXiv:2203.16127·cs.CL·April 28, 2022·1 cites

An Overview of Indian Language Datasets used for Text Summarization

Shagun Sinha, Girish Nath Jha

PDF

Open Access

TL;DR

This survey examines Indian Language Text Summarization datasets from 2012-2022, highlighting their characteristics, growth challenges, and differences from English datasets, emphasizing resource scarcity and development slowdowns.

Contribution

It provides a comprehensive analysis of ILTS datasets, comparing them with English datasets, and identifies key challenges hindering resource development in Indian languages.

Findings

01

ILTS datasets are mainly news domain, both extractive and abstractive.

02

Development of ILTS datasets is slower than English due to resource and forum limitations.

03

Lower number of ILTS datasets is linked to lack of dedicated development forums and public datasets.

Abstract

In this paper, we survey Text Summarization (TS) datasets in Indian Languages (ILs), which are also low-resource languages (LRLs). We seek to answer one primary question: is the pool of Indian Language Text Summarization (ILTS) dataset growing or is there a resource poverty? To an-swer the primary question, we pose two sub-questions that we seek about ILTS datasets: first, what characteristics: format and domain do ILTS datasets have? Second, how different are those characteristics of ILTS datasets from high-resource languages (HRLs) particularly English. We focus on datasets reported in published ILTS research works during 2012-2022. The survey of ILTS and English datasets reveals two similarities and one contrast. The two similarities are: first, the domain of dataset commonly is news (Hermann et al., 2015). The second similarity is the format of the dataset which is both extractive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Spatio-temporal stability analysis