Targeted synthetic data generation for tabular data via hardness characterization
Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna, Passino

TL;DR
This paper presents a targeted synthetic data augmentation method for tabular data that uses data valuation to identify high-value points, leading to improved model performance and efficiency.
Contribution
It introduces a simple, computationally efficient pipeline that generates synthetic data based on hardness characterization, outperforming non-targeted methods.
Findings
Shapley-based data valuation is comparable to learning-based methods in hardness tasks.
Synthetic data from hardest points improves out-of-sample predictions.
The approach is more computationally efficient than non-targeted augmentation.
Abstract
Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Adversarial Robustness in Machine Learning · Data Quality and Management
