Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation   System for the WMT22 Translation Task

Zhiwei He; Xing Wang; Zhaopeng Tu; Shuming Shi; Rui Wang

arXiv:2210.08742·cs.CL·October 18, 2022

Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation System for the WMT22 Translation Task

Zhiwei He, Xing Wang, Zhaopeng Tu, Shuming Shi, Rui Wang

PDF

Open Access 1 Repo

TL;DR

This paper presents a low-resource translation system for English-Livonian, using novel adaptation, data augmentation, and evaluation techniques to improve translation quality in a challenging language pair.

Contribution

The authors introduce a novel transfer and adaptation approach for low-resource translation, including cross-model embedding alignment and pseudo-parallel data generation.

Findings

01

Achieved BLEU scores of 17.0 and 30.4 for English-Livonian translation.

02

Identified Unicode normalization issues affecting translation performance.

03

Validated round-trip BLEU as a more appropriate evaluation metric.

Abstract

This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task. We participate in the general translation task on English $\Leftrightarrow$ Livonian. Our system is based on M2M100 with novel techniques that adapt it to the target language pair. (1) Cross-model word embedding alignment: inspired by cross-lingual word embedding alignment, we successfully transfer a pre-trained word embedding to M2M100, enabling it to support Livonian. (2) Gradual adaptation strategy: we exploit Estonian and Latvian as auxiliary languages for many-to-many translation training and then adapt to English-Livonian. (3) Data augmentation: to enlarge the parallel data for English-Livonian, we construct pseudo-parallel data with Estonian and Latvian as pivot languages. (4) Fine-tuning: to make the most of all available data, we fine-tune the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zwhe99/wmt22-en-liv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis