Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models
Jos\'e Carlos Rosales N\'u\~nez, Guillaume Wisniewski, Djam\'e Seddah

TL;DR
This paper investigates the limitations of character-based neural machine translation models in handling noisy user-generated content, revealing their inability to manage unknown characters and proposing vocabulary reduction to improve robustness.
Contribution
It demonstrates the challenges of zero-shot translation of noisy UGC with character models and highlights the importance of vocabulary size tuning for robustness.
Findings
Character models fail on unknown characters, causing translation breakdowns.
Reducing vocabulary size improves model robustness against noisy UGC.
Zero-shot translation performance is significantly impacted by UGC phenomena.
Abstract
This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
