PuckTrick: A Library for Making Synthetic Data More Realistic
Alessandra Agostini, Andrea Maurino, Blerina Spahiu

TL;DR
PuckTrick is a Python library that systematically contaminates synthetic datasets with realistic errors to evaluate and improve ML model robustness.
Contribution
It introduces a structured approach to inject various real-world data imperfections into synthetic data, aiding robustness testing.
Findings
Models trained on contaminated data perform better than those on error-free synthetic data.
Tree-based and linear models are particularly affected by data contamination.
Systematic contamination helps in assessing model resilience under realistic data conditions.
Abstract
The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
