Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM
Praveenkumar Katwe, RakeshChandra Balabantaray, Kaliprasad Vittala

TL;DR
This paper presents an automated, cost-effective method to create a high-quality Hindi text summarization dataset by translating and adapting from the English XSUM dataset, validated with advanced evaluation techniques.
Contribution
Introduces a scalable framework for generating Hindi summarization datasets from English resources using translation and linguistic adaptation, validated with COMET and LLMs.
Findings
Created a diverse Hindi summarization dataset from XSUM
Validated dataset quality with COMET and LLMs
Facilitated Hindi NLP research with a cost-effective approach
Abstract
Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
