MTNT: A Testbed for Machine Translation of Noisy Text

Paul Michel; Graham Neubig

arXiv:1809.00388·cs.CL·September 5, 2018

MTNT: A Testbed for Machine Translation of Noisy Text

Paul Michel, Graham Neubig

PDF

2 Repos

TL;DR

The paper introduces MTNT, a publicly available benchmark dataset of naturally occurring noisy text from Reddit comments with translations, highlighting the challenges noise poses to current MT models and providing a testbed for noise-robust methods.

Contribution

It provides the first large-scale, real-world noisy text dataset for machine translation, enabling better evaluation and development of noise-robust MT systems.

Findings

01

Existing MT models perform poorly on noisy text.

02

Small in-domain adaptation does not fully address noise issues.

03

The dataset facilitates research on noise handling in MT.

Abstract

Noisy or non-standard input text can cause disastrous mistranslations in most modern Machine Translation (MT) systems, and there has been growing research interest in creating noise-robust MT systems. However, as of yet there are no publicly available parallel corpora of with naturally occurring noisy inputs and translations, and thus previous work has resorted to evaluating on synthetically created datasets. In this paper, we propose a benchmark dataset for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on Reddit (www.reddit.com) and professionally sourced translations. We commissioned translations of English comments into French and Japanese, as well as French and Japanese comments into English, on the order of 7k-37k sentences per language pair. We qualitatively and quantitatively examine the types of noise included in this dataset, then demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.