Active label cleaning for improved dataset quality under resource   constraints

Melanie Bernhardt; Daniel C. Castro; Ryutaro Tanno; Anton; Schwaighofer; Kerem C. Tezcan; Miguel Monteiro; Shruthi Bannur; Matthew; Lungren; Aditya Nori; Ben Glocker; Javier Alvarez-Valle; Ozan Oktay

arXiv:2109.00574·cs.CV·April 25, 2022

Active label cleaning for improved dataset quality under resource constraints

Melanie Bernhardt, Daniel C. Castro, Ryutaro Tanno, Anton, Schwaighofer, Kerem C. Tezcan, Miguel Monteiro, Shruthi Bannur, Matthew, Lungren, Aditya Nori, Ben Glocker, Javier Alvarez-Valle, Ozan Oktay

PDF

1 Repo

TL;DR

This paper introduces an active label cleaning method that prioritizes samples for re-annotation based on estimated correctness and difficulty, significantly improving dataset quality and model performance under resource constraints.

Contribution

It proposes a novel data-driven approach for prioritizing label re-annotation, outperforming random selection in resource-limited settings, especially in medical imaging.

Findings

01

Active label cleaning corrects labels up to 4 times more effectively than random selection.

02

Cleaning noisy labels improves model training and evaluation.

03

Method is effective on natural and medical images.

Abstract

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation - which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/InnerEye-DeepLearning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.