PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
M\'at\'e Metzger, Nadnapang Phophichit

TL;DR
PaliBench is a new multi-reference translation benchmark for classical languages, enabling more accurate evaluation of machine translation systems by incorporating multiple faithful translations.
Contribution
It introduces a reusable workflow for constructing multi-reference benchmarks from scholarly translations, demonstrated with Pali texts, applicable to other classical languages.
Findings
Evaluated ten large language models with the benchmark.
Found strong agreement across different evaluation metrics.
Observed substantial variation in model reliability and semantic outliers.
Abstract
Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
