MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for P\=ali, Sanskrit, Buddhist Chinese, and Tibetan

Sebastian Nehrdich; Kurt Keutzer

arXiv:2601.06400·cs.CL·January 13, 2026

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for P\=ali, Sanskrit, Buddhist Chinese, and Tibetan

Sebastian Nehrdich, Kurt Keutzer

PDF

Open Access

TL;DR

This paper introduces MITRA, a comprehensive framework including a large parallel corpus and domain-specific pretrained models for multilingual translation and semantic retrieval of ancient Buddhist texts across Sanskrit, Pāli, Chinese, and Tibetan.

Contribution

It presents a novel pipeline for mining parallel passages, a large-scale corpus of 1.74 million sentence pairs, and specialized pretrained models achieving state-of-the-art results in translation and semantic tasks.

Findings

01

Gemma 2 MITRA-MT achieves state-of-the-art translation performance.

02

Gemma 2 MITRA-E excels in semantic embedding benchmarks.

03

The dataset and models are openly available for research and philology.

Abstract

Ancient Buddhist literature features frequent, yet often unannotated, textual parallels spread across diverse languages: Sanskrit, P\=ali, Buddhist Chinese, Tibetan, and more. The scale of this material makes manual examination prohibitive. We present the MITRA framework, which consists of a novel pipeline for multilingual parallel passage mining, MITRA-parallel, a large-scale corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan, and the development of the domain-specific pretrained language model Gemma 2 MITRA. We present Gemma 2 MITRA-MT, a version of this base model fine-tuned on machine translation tasks, reaching state-of-the-art performance for machine translation of these languages into English and outperforming even much larger open-source models. We also present Gemma 2 MITRA-E, a semantic embedding model that shows state-of-the-art performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy