Automatic Feasibility Study via Data Quality Analysis for ML: A   Case-Study on Label Noise

Cedric Renggli; Luka Rimanic; Luka Kolar; Wentao Wu; Ce Zhang

arXiv:2010.08410·cs.LG·August 31, 2022

Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

Cedric Renggli, Luka Rimanic, Luka Kolar, Wentao Wu, Ce Zhang

PDF

Open Access 2 Repos

TL;DR

This paper introduces Snoopy, a data quality analysis tool that estimates the Bayes error rate to assess task feasibility in ML projects, helping data scientists avoid unrealistic expectations caused by noisy data.

Contribution

The paper presents a practical Bayes error estimator and demonstrates its effectiveness in feasibility studies, reducing labeling efforts and improving project planning in ML.

Findings

01

The estimator accurately predicts irreducible error across diverse datasets.

02

Incorporating feasibility analysis reduces labeling time and costs.

03

Systematic feasibility studies improve ML project success rates.

Abstract

In our experience of working with domain experts who are using today's AutoML systems, a common problem we encountered is what we call "unrealistic expectations" -- when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning (ML). Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Data Stream Mining Techniques