Generating the Ground Truth: Synthetic Data for Soft Label and Label   Noise Research

Sjoerd de Vries; Dirk Thierens

arXiv:2309.04318·cs.LG·September 24, 2024·Int. J. Data Sci. Anal.

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

Sjoerd de Vries, Dirk Thierens

PDF

Open Access 1 Repo

TL;DR

This paper introduces SYNLABEL, a framework for generating synthetic, noiseless datasets based on real-world data, enabling precise study of label noise and soft label learning in machine learning models.

Contribution

SYNLABEL provides a novel method to create clean, customizable datasets with soft labels, facilitating accurate evaluation of label noise effects and soft label learning techniques.

Findings

01

SYNLABEL accurately quantifies label noise effects.

02

It generates datasets with adjustable complexity.

03

The framework improves evaluation of noise handling methods.

Abstract

In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjoerd-de-vries/synlabel
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms · Music and Audio Processing