The VolcTrans System for WMT22 Multilingual Machine Translation Task

Xian Qian; Kai Hu; Jiaqiang Wang; Yifeng Liu; Xingyuan Pan; Jun Cao,; Mingxuan Wang

arXiv:2210.11599·cs.CL·October 24, 2022

The VolcTrans System for WMT22 Multilingual Machine Translation Task

Xian Qian, Kai Hu, Jiaqiang Wang, Yifeng Liu, Xingyuan Pan, Jun Cao,, Mingxuan Wang

PDF

Open Access

TL;DR

The paper presents VolcTrans, a transformer-based multilingual machine translation system for WMT22, utilizing diverse external data sources and heuristic cleaning to achieve competitive translation quality.

Contribution

Introduces a large-scale multilingual transformer model trained on multiple data sources, including external corpora and pseudo bitext, with heuristic cleaning for improved translation performance.

Findings

01

Achieved an average BLEU of 17.3 across language pairs.

02

Maintained an inference speed of 11.5 sentences/sec on a single GPU.

03

Demonstrated effectiveness of external data and data cleaning in multilingual translation.

Abstract

This report describes our VolcTrans system for the WMT22 shared task on large-scale multilingual machine translation. We participated in the unconstrained track which allows the use of external resources. Our system is a transformerbased multilingual model trained on data from multiple sources including the public training set from the data track, NLLB data provided by Meta AI, self-collected parallel corpora, and pseudo bitext from back-translation. A series of heuristic rules clean both bilingual and monolingual texts. On the official test set, our system achieves 17.3 BLEU, 21.9 spBLEU, and 41.9 chrF2++ on average over all language pairs. The average inference speed is 11.5 sentences per second using a single Nvidia Tesla V100 GPU. Our code and trained models are available at https://github.com/xian8/wmt22

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings