Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research
Sjoerd de Vries, Dirk Thierens

TL;DR
This paper introduces SYNLABEL, a framework for generating synthetic, noiseless datasets based on real-world data, enabling precise study of label noise and soft label learning in machine learning models.
Contribution
SYNLABEL provides a novel method to create clean, customizable datasets with soft labels, facilitating accurate evaluation of label noise effects and soft label learning techniques.
Findings
SYNLABEL accurately quantifies label noise effects.
It generates datasets with adjustable complexity.
The framework improves evaluation of noise handling methods.
Abstract
In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms · Music and Audio Processing
