Explicit Cross-lingual Pre-training for Unsupervised Machine Translation
Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, Shuai Ma

TL;DR
This paper introduces an explicit cross-lingual pre-training method for unsupervised machine translation that uses cross-lingual n-gram embeddings and a novel CMLM model to improve translation quality.
Contribution
It proposes a new pre-training approach that incorporates explicit cross-lingual signals via n-gram translation, enhancing unsupervised translation performance.
Findings
Significant improvement in unsupervised translation quality
Effective integration of explicit cross-lingual information
Demonstrated benefits of n-gram based pre-training
Abstract
Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios. However, the cross-lingual information obtained from shared BPE spaces is inexplicit and limited. In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals. Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them. With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step. Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models. Taking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsByte Pair Encoding
