Identifying Mislabeled Data using the Area Under the Margin Ranking

Geoff Pleiss; Tianyi Zhang; Ethan R. Elenberg; Kilian Q. Weinberger

arXiv:2001.10528·cs.LG·December 24, 2020·45 cites

Identifying Mislabeled Data using the Area Under the Margin Ranking

Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, Kilian Q. Weinberger

PDF

Open Access 3 Repos 3 Datasets 1 Video

TL;DR

This paper presents a novel method using the Area Under the Margin (AUM) statistic to identify and remove mislabeled or ambiguous data points in training sets, improving neural network generalization.

Contribution

The paper introduces a new AUM-based approach that effectively isolates mislabeled data by exploiting training dynamics, outperforming prior methods on multiple datasets.

Findings

01

Removes 17% of data on WebVision50, improving test error by 1.6%.

02

Removes 13% of data on CIFAR100, reducing error by 1.2%.

03

Consistently outperforms previous techniques on synthetic and real-world datasets.

Abstract

Not all data in a typical training set help with generalization; some samples can be overly ambiguous or outrightly mislabeled. This paper introduces a new method to identify such samples and mitigate their impact when training neural networks. At the heart of our algorithm is the Area Under the Margin (AUM) statistic, which exploits differences in the training dynamics of clean and mislabeled samples. A simple procedure - adding an extra class populated with purposefully mislabeled threshold samples - learns a AUM upper bound that isolates mislabeled data. This approach consistently improves upon prior work on synthetic and real-world datasets. On the WebVision50 classification task our method removes 17% of training data, yielding a 1.6% (absolute) improvement in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

Identifying Mislabeled Data using the Area Under the Margin Ranking· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Machine Learning and Algorithms

MethodsTest