Improving Simultaneous Machine Translation with Monolingual Data

Hexuan Deng; Liang Ding; Xuebo Liu; Meishan Zhang; Dacheng Tao; Min; Zhang

arXiv:2212.01188·cs.CL·December 5, 2022

Improving Simultaneous Machine Translation with Monolingual Data

Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min, Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a method to enhance simultaneous machine translation by incorporating monolingual data and a novel sampling strategy, significantly improving translation quality and addressing hallucination issues.

Contribution

It proposes leveraging monolingual data with a new sampling strategy to improve SiMT performance and reduce hallucination, outperforming traditional methods.

Findings

01

Monolingual data improves BLEU scores by +3.15 on En-Zh.

02

The novel sampling strategy outperforms random sampling and other strategies.

03

Achieves +0.72 BLEU improvements on average across language pairs.

Abstract

Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hexuandeng/mono4simt
pytorchOfficial

Videos

Improving Simultaneous Machine Translation with Monolingual Data· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsKnowledge Distillation