Identifying Mislabeled Data using the Area Under the Margin Ranking
Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, Kilian Q. Weinberger

TL;DR
This paper presents a novel method using the Area Under the Margin (AUM) statistic to identify and remove mislabeled or ambiguous data points in training sets, improving neural network generalization.
Contribution
The paper introduces a new AUM-based approach that effectively isolates mislabeled data by exploiting training dynamics, outperforming prior methods on multiple datasets.
Findings
Removes 17% of data on WebVision50, improving test error by 1.6%.
Removes 13% of data on CIFAR100, reducing error by 1.2%.
Consistently outperforms previous techniques on synthetic and real-world datasets.
Abstract
Not all data in a typical training set help with generalization; some samples can be overly ambiguous or outrightly mislabeled. This paper introduces a new method to identify such samples and mitigate their impact when training neural networks. At the heart of our algorithm is the Area Under the Margin (AUM) statistic, which exploits differences in the training dynamics of clean and mislabeled samples. A simple procedure - adding an extra class populated with purposefully mislabeled threshold samples - learns a AUM upper bound that isolates mislabeled data. This approach consistently improves upon prior work on synthetic and real-world datasets. On the WebVision50 classification task our method removes 17% of training data, yielding a 1.6% (absolute) improvement in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Machine Learning and Algorithms
MethodsTest
