Confident Learning: Estimating Uncertainty in Dataset Labels

Curtis G. Northcutt; Lu Jiang; Isaac L. Chuang

arXiv:1911.00068·stat.ML·August 23, 2022

Confident Learning: Estimating Uncertainty in Dataset Labels

Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang

PDF

4 Repos 1 Datasets

TL;DR

Confident learning (CL) is a novel framework for estimating and improving label quality in datasets by identifying label errors, which enhances model training and data understanding across various data modalities.

Contribution

This paper introduces a generalized, provably consistent confident learning method that estimates label noise and errors, outperforming recent approaches and applicable to multiple data types.

Findings

01

CL accurately finds label errors under certain conditions.

02

CL improves model accuracy by cleaning datasets before training.

03

CL quantifies label overlap and mislabeling in large datasets.

Abstract

Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Cleanlab/token-classification-tutorial
dataset· 54 dl
54 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning