Robust Optimization for Multilingual Translation with Imbalanced Data
Xian Li, Hongyu Gong

TL;DR
This paper introduces CATS, a novel optimization algorithm that improves multilingual translation models by addressing data imbalance issues, leading to better performance on low-resource languages without compromising high-resource ones.
Contribution
The paper proposes Curvature Aware Task Scaling (CATS), a new optimization method that adaptively rescales gradients to improve training for imbalanced multilingual translation models.
Findings
CATS improves low-resource language BLEU scores by 0.8 to 2.2 points.
CATS enhances multilingual training robustness across different model sizes and batch configurations.
The method maintains high-resource language performance while boosting low-resource language translation quality.
Abstract
Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Adam · Layer Normalization
