TL;DR
The paper introduces MTNT, a publicly available benchmark dataset of naturally occurring noisy text from Reddit comments with translations, highlighting the challenges noise poses to current MT models and providing a testbed for noise-robust methods.
Contribution
It provides the first large-scale, real-world noisy text dataset for machine translation, enabling better evaluation and development of noise-robust MT systems.
Findings
Existing MT models perform poorly on noisy text.
Small in-domain adaptation does not fully address noise issues.
The dataset facilitates research on noise handling in MT.
Abstract
Noisy or non-standard input text can cause disastrous mistranslations in most modern Machine Translation (MT) systems, and there has been growing research interest in creating noise-robust MT systems. However, as of yet there are no publicly available parallel corpora of with naturally occurring noisy inputs and translations, and thus previous work has resorted to evaluating on synthetically created datasets. In this paper, we propose a benchmark dataset for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on Reddit (www.reddit.com) and professionally sourced translations. We commissioned translations of English comments into French and Japanese, as well as French and Japanese comments into English, on the order of 7k-37k sentences per language pair. We qualitatively and quantitatively examine the types of noise included in this dataset, then demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
