Imbalance in Regression Datasets

Daniel Kowatsch; Nicolas M. M\"uller; Kilian Tscharke; Philip Sperl,; Konstantin B\"otinger

arXiv:2402.11963·cs.LG·February 20, 2024·1 cites

Imbalance in Regression Datasets

Daniel Kowatsch, Nicolas M. M\"uller, Kilian Tscharke, Philip Sperl,, Konstantin B\"otinger

PDF

Open Access

TL;DR

This paper highlights the overlooked issue of imbalance in regression datasets, analyzing its theoretical implications and proposing a new definition to guide future research in addressing this problem.

Contribution

It introduces the first formal definition of imbalance in regression, extending concepts from classification imbalance measures.

Findings

01

Identifies how imbalance causes regressors to neglect rare data

02

Provides a theoretical analysis of regression imbalance

03

Proposes a new generalization of imbalance measure for regression

Abstract

For classification, the problem of class imbalance is well known and has been extensively studied. In this paper, we argue that imbalance in regression is an equally important problem which has so far been overlooked: Due to under- and over-representations in a data set's target distribution, regressors are prone to degenerate to naive models, systematically neglecting uncommon training data and over-representing targets seen often during training. We analyse this problem theoretically and use resulting insights to develop a first definition of imbalance in regression, which we show to be a generalisation of the commonly employed imbalance measure in classification. With this, we hope to turn the spotlight on the overlooked problem of imbalance in regression and to provide common ground for future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare · Imbalanced Data Classification Techniques · Machine Learning and Data Classification