An Overview of Indian Language Datasets used for Text Summarization
Shagun Sinha, Girish Nath Jha

TL;DR
This survey examines Indian Language Text Summarization datasets from 2012-2022, highlighting their characteristics, growth challenges, and differences from English datasets, emphasizing resource scarcity and development slowdowns.
Contribution
It provides a comprehensive analysis of ILTS datasets, comparing them with English datasets, and identifies key challenges hindering resource development in Indian languages.
Findings
ILTS datasets are mainly news domain, both extractive and abstractive.
Development of ILTS datasets is slower than English due to resource and forum limitations.
Lower number of ILTS datasets is linked to lack of dedicated development forums and public datasets.
Abstract
In this paper, we survey Text Summarization (TS) datasets in Indian Languages (ILs), which are also low-resource languages (LRLs). We seek to answer one primary question: is the pool of Indian Language Text Summarization (ILTS) dataset growing or is there a resource poverty? To an-swer the primary question, we pose two sub-questions that we seek about ILTS datasets: first, what characteristics: format and domain do ILTS datasets have? Second, how different are those characteristics of ILTS datasets from high-resource languages (HRLs) particularly English. We focus on datasets reported in published ILTS research works during 2012-2022. The survey of ILTS and English datasets reveals two similarities and one contrast. The two similarities are: first, the domain of dataset commonly is news (Hermann et al., 2015). The second similarity is the format of the dataset which is both extractive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Spatio-temporal stability analysis
