Hypothesis Testing for Class-Conditional Label Noise
Rafael Poyiadzi, Weisong Yang, Niall Twomey, Raul Santos-Rodriguez

TL;DR
This paper introduces hypothesis tests for detecting class-conditional label noise in datasets, providing a practical tool for practitioners to assess label quality without relying on difficult noise rate estimations.
Contribution
It proposes novel hypothesis tests based on logistic regression asymptotics that identify class-conditional noise using anchor points with uncertain true posteriors, advancing beyond prior methods requiring anchor points with definitive labels.
Findings
Tests effectively distinguish class-conditional noise from uniform noise.
Power of tests depends on sample size, number of anchor points, and noise rate differences.
Theoretical and empirical analysis validates the approach.
Abstract
In this paper we provide machine learning practitioners with tools to answer the question: is there class-conditional noise in my labels? In particular, we present hypothesis tests to check whether a given dataset of instance-label pairs has been corrupted with class-conditional label noise, as opposed to uniform label noise, with the former biasing learning, while the latter -- under mild conditions -- does not. The outcome of these tests can then be used in conjunction with other information to assess further steps. While previous works explore the direct estimation of the noise rates, this is known to be hard in practice and does not offer a real understanding of how trustworthy the estimates are. These methods typically require anchor points -- examples whose true posterior is either 0 or 1. Differently, in this paper we assume we have access to a set of anchor points whose true…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models
MethodsLogistic Regression
