PuckTrick: A Library for Making Synthetic Data More Realistic

Alessandra Agostini; Andrea Maurino; Blerina Spahiu

arXiv:2506.18499·cs.LG·April 30, 2026

PuckTrick: A Library for Making Synthetic Data More Realistic

Alessandra Agostini, Andrea Maurino, Blerina Spahiu

PDF

TL;DR

PuckTrick is a Python library that systematically contaminates synthetic datasets with realistic errors to evaluate and improve ML model robustness.

Contribution

It introduces a structured approach to inject various real-world data imperfections into synthetic data, aiding robustness testing.

Findings

01

Models trained on contaminated data perform better than those on error-free synthetic data.

02

Tree-based and linear models are particularly affected by data contamination.

03

Systematic contamination helps in assessing model resilience under realistic data conditions.

Abstract

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.