Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky

TL;DR
This study assesses the quality of LLM translations of ancient Greek texts, comparing automated metrics with expert judgment, and identifies key factors influencing translation success and failure.
Contribution
It provides the first systematic expert evaluation of LLM translation quality for ancient languages and highlights the impact of terminology rarity on translation failures.
Findings
LLMs achieved high quality on expository texts (mean MQM 95.2/100).
Translation quality was lower and bimodal on pharmacological texts, with failures linked to terminology density.
Automated metrics only moderately correlated with human judgment, especially on variable quality texts.
Abstract
Purpose: This study evaluates the quality of commercial large language model (LLM) machine translation (MT) for Ancient Greek technical prose and benchmarks standard automated MT evaluation metrics against expert human judgment. Design: We evaluated 60 translations by three LLMs (ChatGPT, Claude, Gemini) of 20 paragraph-length passages from 2 works by the Greek physician Galen (c. 129-216 CE): an expository text with two published English translations and a pharmacological text never before translated. Quality was assessed using seven automated metrics and systematic reference-free human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied by domain specialists. Findings: On the translated expository text, LLMs achieved high quality (mean MQM score 95.2/100). On the untranslated pharmacological text, quality was lower (79.9/100) but bimodally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
