A statistical theory of overfitting for imbalanced classification

Jingyang Lyu; Kangjie Zhou; Yiqiao Zhong

arXiv:2502.11323·math.ST·February 18, 2025

A statistical theory of overfitting for imbalanced classification

Jingyang Lyu, Kangjie Zhou, Yiqiao Zhong

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper develops a high-dimensional statistical theory explaining overfitting in imbalanced classification, revealing how class imbalance and dimensionality skew logits and affect minority class performance.

Contribution

It introduces a novel high-dimensional asymptotic framework for analyzing overfitting in imbalanced data, highlighting the impact of dimensionality and class imbalance on model logits.

Findings

01

Logits for minority classes follow a rectified normal distribution on training data.

02

Margin rebalancing improves minority class accuracy.

03

Overfitting affects calibration and uncertainty measures.

Abstract

Classification with imbalanced data is a common challenge in data analysis, where certain classes (minority classes) account for a small fraction of the training data compared with other classes (majority classes). Classical statistical theory based on large-sample asymptotics and finite-sample corrections is often ineffective for high-dimensional data, leaving many overfitting phenomena in empirical machine learning unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced classification by investigating support vector machines and logistic regression. We find that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem under high-dimensional asymptotics. In particular, for linearly separable data generated from a two-component Gaussian mixture model, the logits from each class…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- The paper provides a theoretical understanding for the behaviour of the distribution of logits at training and test time, which is an interesting application of an approach often used in calibration. - The theorems presented seem consistent with the general consensus among the imbalanced classification field. - Results are presented on a variety of modalities (tabular, image and text) - Experiments are performed in a good number of scenarios, various imbalance proportions $\pi$, optimising cla

Weaknesses

- The paper is difficult to follow in places, there are a lot of concepts to grasp in section 2. - Gordon's theorem should be cited at it first mention on page 2 - [1] presents a method for adjusting decision boundaries based on the uncertainty from drawing a random sample from each class, and could be discussed in section 3. I believe this method would help with the truncation of the TLD, and it could be worth discussing how concentration inequalities relate to the TLD and ELD. - Most of the t

Reviewer 02Rating 6Confidence 4

Strengths

The sharp asymptotic results provide an interesting interpretation of overfitting in terms of skewness of the training logit distribution on the minority class. The sharp prediction on the optimal value of the margin rebalancing parameter is useful in practice. The high imbalance regime presents a rich phase diagram, with clear implications for the margin rebalancing strategy. Theoretical claims are supported by numerical experiments (Fig. 1, 4, 5), also on real data (Fig 2 and 3).

Weaknesses

The theoretical analysis is limited to the very narrow setting of linear binary classification and data with isotropic covariances. Connections to some of the relevant literature in the field of high-dimensional statistics are not discussed. No clue or narrative on how the theorems are proven is given in the main text, making hard to assess their validity in a 2-weeks review (the appendix is 80 pages long). See Questions below for more detail.

Reviewer 03Rating 4Confidence 3

Strengths

1. Quantifying the effects of overfitting in imbalanced data is an important and meaningful problem. 2. The study is comprehensive, and the theoretical analysis is solid.

Weaknesses

1. The paper is not well written or well organized. The motivation behind some of the definitions and problem settings is not clearly explained. See questions below for details.

Code & Models

Repositories

jlyu55/imbalanced_classification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques

MethodsSparse Evolutionary Training