Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation
Shengxuan Luo, Huaiyuan Ying, Jiao Li, Sheng Yu

TL;DR
This paper introduces an unsupervised sentence alignment algorithm that creates biomedical parallel corpora from document translations, significantly improving neural machine translation quality in specialized biomedical domains.
Contribution
It presents a novel unsupervised method for aligning sentences in biomedical texts, enabling the creation of high-quality parallel corpora for NMT training without manual annotation.
Findings
Accurate sentence alignment achieved in 1-to-1 cases.
Outperformed competing algorithms in many-to-many alignments.
Biomedical NMT models improved BLEU scores by over 17 points after fine-tuning.
Abstract
Objective: Today's neural machine translation (NMT) can achieve near human-level translation quality and greatly facilitates international communications, but the lack of parallel corpora poses a key problem to the development of translation systems for highly specialized domains, such as biomedicine. This work presents an unsupervised algorithm for deriving parallel corpora from document-level translations by using sentence alignment and explores how training materials affect the performance of biomedical NMT systems. Materials and Methods: Document-level translations are mixed to train bilingual word embeddings (BWEs) for the evaluation of cross-lingual word similarity, and sentence distance is defined by combining semantic and positional similarities of the sentences. The alignment of sentences is formulated as an extended earth mover's distance problem. A Chinese-English biomedical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
