Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li; Haoran Xu; Weiting Tan; Kenton Murray; Daniel Khashabi

arXiv:2410.04579·cs.CL·March 11, 2025

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi

PDF

Open Access 1 Video

TL;DR

This paper investigates the theoretical and empirical differences between upsampling and upweighting in imbalanced multilingual datasets, proposing a new strategy called Cooldown to balance convergence speed and overfitting risk.

Contribution

It establishes the conditions under which upsampling and upweighting are equivalent or diverge, and introduces Cooldown, a dynamic approach to improve training on imbalanced data.

Findings

01

Temperature Sampling has lower gradient variance than Scalarization.

02

Cooldown improves convergence speed and reduces overfitting.

03

Theoretical analysis clarifies when upsampling and upweighting are equivalent.

Abstract

Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets· underline

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsFocus