Samas\=amayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
N J Karthika, Keerthana Suryanarayanan, Jahanvi Purohit, Ganesh Ramakrishnan, Jitin Singla, Anil Kumar Gourishetty

TL;DR
This paper introduces Samasamayik, a large-scale, diverse Hindi-Sanskrit parallel corpus, and demonstrates its effectiveness in improving machine translation performance for contemporary language data.
Contribution
The creation of a novel, large, and diverse Hindi-Sanskrit dataset covering modern sources, and benchmarking its utility with multiple translation models.
Findings
Models trained on Samasamayik outperform on in-domain data.
The dataset provides a strong baseline for Hindi-Sanskrit translation.
Minimal overlap with existing corpora confirms dataset's novelty.
Abstract
We release Samas\=amayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
