Itihasa: A large-scale corpus for Sanskrit to English translation
Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders, S{\o}gaard

TL;DR
Itihasa is a large-scale Sanskrit-English translation dataset derived from Indian epics, highlighting the challenges of translating classical Sanskrit texts with current models.
Contribution
The paper introduces Itihasa, a new extensive dataset for Sanskrit-English translation, and evaluates existing models, revealing their limitations on this complex corpus.
Findings
Standard translation models perform poorly on the dataset.
The dataset exposes the complexity of translating classical Sanskrit texts.
Empirical analysis highlights nuances in Sanskrit-to-English translation.
Abstract
This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
