Confident Learning: Estimating Uncertainty in Dataset Labels
Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang

TL;DR
Confident learning (CL) is a novel framework for estimating and improving label quality in datasets by identifying label errors, which enhances model training and data understanding across various data modalities.
Contribution
This paper introduces a generalized, provably consistent confident learning method that estimates label noise and errors, outperforming recent approaches and applicable to multiple data types.
Findings
CL accurately finds label errors under certain conditions.
CL improves model accuracy by cleaning datasets before training.
CL quantifies label overlap and mislabeling in large datasets.
Abstract
Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
