Polish - English Speech Statistical Machine Translation Systems for the   IWSLT 2014

Krzysztof Wo{\l}k; Krzysztof Marasek

arXiv:1509.08874·cs.CL·September 30, 2015

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014

Krzysztof Wo{\l}k, Krzysztof Marasek

PDF

TL;DR

This paper investigates the impact of various training configurations on Polish-English statistical machine translation quality using multiple evaluation metrics and explores morphological and data cleaning techniques to enhance translation performance.

Contribution

It introduces new training setups and data preparation methods, including lemma and morphology integration, for improved Polish-English translation systems.

Findings

01

Data cleaning significantly improves translation quality.

02

Morphological information enhances translation accuracy.

03

Multiple evaluation metrics confirm the effectiveness of proposed methods.

Abstract

This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable corpora prepared by us. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use lemma and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.