Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi

TL;DR
This paper investigates the theoretical and empirical differences between upsampling and upweighting in imbalanced multilingual datasets, proposing a new strategy called Cooldown to balance convergence speed and overfitting risk.
Contribution
It establishes the conditions under which upsampling and upweighting are equivalent or diverge, and introduces Cooldown, a dynamic approach to improve training on imbalanced data.
Findings
Temperature Sampling has lower gradient variance than Scalarization.
Cooldown improves convergence speed and reduces overfitting.
Theoretical analysis clarifies when upsampling and upweighting are equivalent.
Abstract
Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsFocus
