How Robust is Neural Machine Translation to Language Imbalance in   Multilingual Tokenizer Training?

Shiyue Zhang; Vishrav Chaudhary; Naman Goyal; James Cross; Guillaume; Wenzek; Mohit Bansal; Francisco Guzman

arXiv:2204.14268·cs.CL·September 13, 2022·6 cites

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Shiyue Zhang, Vishrav Chaudhary, Naman Goyal, James Cross, Guillaume, Wenzek, Mohit Bansal, Francisco Guzman

PDF

Open Access

TL;DR

This paper investigates how the imbalance in language data during multilingual tokenizer training impacts translation quality, revealing that downstream performance is surprisingly robust to imbalance and identifying indicators for potential issues.

Contribution

It systematically analyzes the effects of language imbalance in tokenizer training and distinguishes its impact from model training, providing new insights into robustness and warning indicators.

Findings

01

Performance improves with balanced language sampling during tokenizer training.

02

Downstream translation performance is more robust to imbalance than expected.

03

UNK rate and character-level closeness can predict poor performance.

Abstract

A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification