Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering
Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng

TL;DR
This paper introduces PRRO, a novel pipeline that improves synthetic tabular data generation for supervised learning by pruning noisy data and reordering columns, leading to significant performance gains.
Contribution
PRRO integrates data pruning and column reordering techniques into tabular data synthesis, enhancing the utility of synthetic data for supervised learning tasks.
Findings
PRRO improves predictive performance by up to 871%.
Synthetic data with PRRO closely matches original class distributions.
PRRO enhances synthetic data quality on imbalanced datasets.
Abstract
Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Data Processing Techniques · Anomaly Detection Techniques and Applications
