PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare
Rania Al-Sabbagh

TL;DR
PEACH is a high-quality, sentence-aligned English-Arabic healthcare corpus designed to support translation, NLP research, and educational applications, containing over 51,000 sentence pairs and available for public use.
Contribution
It introduces PEACH, a manually aligned, domain-specific parallel corpus for English-Arabic healthcare texts, filling a gap in resources for translation and NLP research.
Findings
Contains 51,671 sentence pairs with high alignment quality
Enables development of domain-specific machine translation models
Supports research in contrastive linguistics and readability assessment
Abstract
This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials' readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
