SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation
Vishvajitsinh Bakrola, Jitendra Nasariwala

TL;DR
SAHAAYAK 2023 introduces a comprehensive, multi-domain Sanskrit-Hindi parallel corpus of 1.5 million sentence pairs, facilitating improved machine translation for low-resource languages.
Contribution
The paper presents a large, balanced, multi-domain Sanskrit-Hindi corpus created through extensive mining, cleaning, and verification processes, enhancing resources for low-resource language translation.
Findings
Corpus contains 1.5 million sentence pairs.
Includes diverse domains like news, literature, and sports.
Pipeline ensures high-quality, normalized data for machine translation.
Abstract
The data article presents the large bilingual parallel corpus of low-resourced language pair Sanskrit-Hindi, named SAHAAYAK 2023. The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi. To make the universal usability of the corpus and to make it balanced, data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature. The multifaceted approach has been adapted to make a sizable multi-domain corpus of low-resourced languages like Sanskrit. Our development approach is spanned from creating a small hand-crafted dataset to applying a wide range of mining, cleaning, and verification. We have used the three-fold process of mining: mining from machine-readable sources, mining from non-machine readable sources, and collation from existing corpora sources. Post mining, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
