EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation
Yulin Xu, Zhen Yang, Fandong Meng, JieZhou

TL;DR
This paper introduces EAG, a two-step method to construct large-scale, high-quality multi-way aligned corpora for multi-lingual neural machine translation, significantly improving translation performance.
Contribution
EAG is a novel two-step approach that extracts and generates multi-way aligned data from bilingual corpora, enhancing the scale and quality of training data for C-MNMT.
Findings
Achieved +1.1 BLEU on WMT-5 dataset.
Achieved +1.4 BLEU on OPUS-100 dataset.
Constructed large-scale, diverse multi-way aligned corpora.
Abstract
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
