Incorporating Word and Subword Units in Unsupervised Machine Translation   Using Language Model Rescoring

Zihan Liu; Yan Xu; Genta Indra Winata; Pascale Fung

arXiv:1908.05925·cs.CL·November 19, 2019·6 cites

Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring

Zihan Liu, Yan Xu, Genta Indra Winata, Pascale Fung

PDF

Open Access

TL;DR

This paper presents an unsupervised machine translation approach for German to Czech that combines word and subword models with language model rescoring to improve translation quality without using parallel data.

Contribution

It introduces a novel rescoring mechanism using a pre-trained language model and separate BPE embeddings aligned with MUSE to enhance unsupervised translation.

Findings

01

Improved translation fluency and accuracy demonstrated in WMT'19 results.

02

Effective handling of morphological richness through separate BPE training.

03

Rescoring with language models significantly boosts translation quality.

Abstract

This paper describes CAiRE's submission to the unsupervised machine translation track of the WMT'19 news shared task from German to Czech. We leverage a phrase-based statistical machine translation (PBSMT) model and a pre-trained language model to combine word-level neural machine translation (NMT) and subword-level NMT models without using any parallel data. We propose to solve the morphological richness problem of languages by training byte-pair encoding (BPE) embeddings for German and Czech separately, and they are aligned using MUSE (Conneau et al., 2018). To ensure the fluency and consistency of translations, a rescoring mechanism is proposed that reuses the pre-trained language model to select the translation candidates generated through beam search. Moreover, a series of pre-processing and post-processing approaches are applied to improve the quality of final translations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis