PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora
Krzysztof Wo{\l}k, Krzysztof Marasek

TL;DR
This paper presents enhancements to Statistical Machine Translation systems for diverse language pairs using comparable corpora, domain adaptation, and advanced alignment and modeling techniques, leading to improved translation quality.
Contribution
Introduces novel data adaptation and alignment techniques, including comparable corpora and domain adaptation, to improve SMT performance across multiple language pairs.
Findings
Positive impact on SMT quality demonstrated by BLEU, NIST, and TER metrics
Effective use of Wikipedia-based comparable corpora for training and testing
Improved translation results through domain adaptation and advanced alignment methods
Abstract
In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
