Robust Optimization for Multilingual Translation with Imbalanced Data

Xian Li; Hongyu Gong

arXiv:2104.07639·cs.CL·December 1, 2021·6 cites

Robust Optimization for Multilingual Translation with Imbalanced Data

Xian Li, Hongyu Gong

PDF

Open Access

TL;DR

This paper introduces CATS, a novel optimization algorithm that improves multilingual translation models by addressing data imbalance issues, leading to better performance on low-resource languages without compromising high-resource ones.

Contribution

The paper proposes Curvature Aware Task Scaling (CATS), a new optimization method that adaptively rescales gradients to improve training for imbalanced multilingual translation models.

Findings

01

CATS improves low-resource language BLEU scores by 0.8 to 2.2 points.

02

CATS enhances multilingual training robustness across different model sizes and batch configurations.

03

The method maintains high-resource language performance while boosting low-resource language translation quality.

Abstract

Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Adam · Layer Normalization