Polish -English Statistical Machine Translation of Medical Texts

Krzysztof Wo{\l}k; Krzysztof Marasek

arXiv:1509.08909·cs.CL·September 30, 2015

Polish -English Statistical Machine Translation of Medical Texts

Krzysztof Wo{\l}k, Krzysztof Marasek

PDF

TL;DR

This study investigates various training techniques and data preparation methods to improve Polish-English statistical machine translation for medical texts, using diverse models and evaluation metrics.

Contribution

It introduces a comprehensive analysis of different system configurations and data preprocessing strategies specifically for Polish-English medical text translation.

Findings

01

POS tagging and factored models improve translation quality

02

Hierarchical models and syntactic tags enhance accuracy

03

Data normalization techniques positively impact results

Abstract

This new research explores the effects of various training methods on a Polish to English Statistical Machine Translation system for medical texts. Various elements of the EMEA parallel text corpora from the OPUS project were used as the basis for training of phrase tables and language models and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR, RIBES and TER metrics have been used to evaluate the effects of various system and data preparations on translation results. Our experiments included systems that used POS tagging, factored phrase models, hierarchical models, syntactic taggers, and many different alignment methods. We also conducted a deep analysis of Polish data as preparatory work for automatic data correction such as true casing and punctuation normalization phase.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.