How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
Shiyue Zhang, Vishrav Chaudhary, Naman Goyal, James Cross, Guillaume, Wenzek, Mohit Bansal, Francisco Guzman

TL;DR
This paper investigates how the imbalance in language data during multilingual tokenizer training impacts translation quality, revealing that downstream performance is surprisingly robust to imbalance and identifying indicators for potential issues.
Contribution
It systematically analyzes the effects of language imbalance in tokenizer training and distinguishes its impact from model training, providing new insights into robustness and warning indicators.
Findings
Performance improves with balanced language sampling during tokenizer training.
Downstream translation performance is more robust to imbalance than expected.
UNK rate and character-level closeness can predict poor performance.
Abstract
A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
