OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Rapha\"el Merx; Hanna Suominen; Trevor Cohn; Ekaterina Vylomova

arXiv:2508.16048·cs.CL·October 7, 2025

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Rapha\"el Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

PDF

Open Access 1 Video

TL;DR

This paper introduces OpenWHO, a new document-level parallel corpus for health translation in low-resource languages, and demonstrates that large language models outperform traditional machine translation models on this dataset.

Contribution

The paper presents OpenWHO, a novel high-quality parallel corpus for health translation in low-resource languages, and evaluates the superior performance of large language models over traditional MT models.

Findings

01

LLMs outperform traditional MT models on the OpenWHO dataset.

02

Gemini 2.5 Flash improves ChrF score by +4.79 points over NLLB-54B.

03

Document-level context benefits are most significant in specialized domains like health.

Abstract

In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages· underline

Taxonomy

TopicsNatural Language Processing Techniques