Targeted synthetic data generation for tabular data via hardness   characterization

Tommaso Ferracci; Leonie Tabea Goldmann; Anton Hinel; Francesco Sanna; Passino

arXiv:2410.00759·cs.LG·February 11, 2025

Targeted synthetic data generation for tabular data via hardness characterization

Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna, Passino

PDF

Open Access 1 Repo

TL;DR

This paper presents a targeted synthetic data augmentation method for tabular data that uses data valuation to identify high-value points, leading to improved model performance and efficiency.

Contribution

It introduces a simple, computationally efficient pipeline that generates synthetic data based on hardness characterization, outperforming non-targeted methods.

Findings

01

Shapley-based data valuation is comparable to learning-based methods in hardness tasks.

02

Synthetic data from hardest points improves out-of-sample predictions.

03

The approach is more computationally efficient than non-targeted augmentation.

Abstract

Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tommaso-ferracci/Targeted_Augmentation_Amex
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Adversarial Robustness in Machine Learning · Data Quality and Management