PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare

Rania Al-Sabbagh

arXiv:2508.05722·cs.CL·August 11, 2025

PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare

Rania Al-Sabbagh

PDF

TL;DR

PEACH is a high-quality, sentence-aligned English-Arabic healthcare corpus designed to support translation, NLP research, and educational applications, containing over 51,000 sentence pairs and available for public use.

Contribution

It introduces PEACH, a manually aligned, domain-specific parallel corpus for English-Arabic healthcare texts, filling a gap in resources for translation and NLP research.

Findings

01

Contains 51,671 sentence pairs with high alignment quality

02

Enables development of domain-specific machine translation models

03

Supports research in contrastive linguistics and readability assessment

Abstract

This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials' readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.