Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM

Praveenkumar Katwe; RakeshChandra Balabantaray; Kaliprasad Vittala

arXiv:2601.01543·cs.CL·January 6, 2026

Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM

Praveenkumar Katwe, RakeshChandra Balabantaray, Kaliprasad Vittala

PDF

Open Access

TL;DR

This paper presents an automated, cost-effective method to create a high-quality Hindi text summarization dataset by translating and adapting from the English XSUM dataset, validated with advanced evaluation techniques.

Contribution

Introduces a scalable framework for generating Hindi summarization datasets from English resources using translation and linguistic adaptation, validated with COMET and LLMs.

Findings

01

Created a diverse Hindi summarization dataset from XSUM

02

Validated dataset quality with COMET and LLMs

03

Facilitated Hindi NLP research with a cost-effective approach

Abstract

Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods