Itihasa: A large-scale corpus for Sanskrit to English translation

Rahul Aralikatte; Miryam de Lhoneux; Anoop Kunchukuttan; Anders; S{\o}gaard

arXiv:2106.03269·cs.CL·October 7, 2021

Itihasa: A large-scale corpus for Sanskrit to English translation

Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders, S{\o}gaard

PDF

TL;DR

Itihasa is a large-scale Sanskrit-English translation dataset derived from Indian epics, highlighting the challenges of translating classical Sanskrit texts with current models.

Contribution

The paper introduces Itihasa, a new extensive dataset for Sanskrit-English translation, and evaluates existing models, revealing their limitations on this complex corpus.

Findings

01

Standard translation models perform poorly on the dataset.

02

The dataset exposes the complexity of translating classical Sanskrit texts.

03

Empirical analysis highlights nuances in Sanskrit-to-English translation.

Abstract

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.