Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Yuxin Liu; Xiong Jin; Yang Han

arXiv:2603.14387·stat.ME·March 17, 2026

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Yuxin Liu, Xiong Jin, Yang Han

PDF

Open Access

TL;DR

This paper introduces a classifier-agnostic method using Bernoulli random sampling to effectively identify and clean label noise in supervised classification, improving model accuracy without prior label information.

Contribution

It presents a novel label noise cleaning technique based on Bernoulli sampling and mixture distribution modeling, with theoretical justification and practical effectiveness.

Findings

01

Method accurately separates clean and noisy labels.

02

Performs well on both simulated and real datasets.

03

Theoretically justified with convergence guarantees.

Abstract

Label noise - incorrect labels assigned to observations - can substantially degrade the performance of supervised classifiers. This paper proposes a label noise cleaning method based on Bernoulli random sampling. We show that the mean label noise levels of subsets generated by Bernoulli random sampling containing a given observation are identically distributed for all clean observations, and identically distributed, with a different distribution, for all noisy observations. Although the mean label noise levels are not independent across observations, by introducing an independent coupling we further prove that they converge to a mixture of two well-separated distributions corresponding to clean and noisy observations. By establishing a linear model between cross-validated classification errors and label noise levels, we are able to approximate this mixture distribution and thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Advanced Statistical Methods and Models