ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning
Xiaofeng Lin, Seungbae Kim, Zhuoya Li, Zachary DeSoto, Charles Fleming, Guang Cheng

TL;DR
ReTabSyn introduces a reinforcement learning-based method for synthesizing realistic tabular data by focusing on conditional distributions, improving data utility especially in low-data, imbalanced, and distribution-shifted scenarios.
Contribution
The paper proposes ReTabSyn, a novel reinforcement learning pipeline that emphasizes learning conditional distributions for more effective tabular data synthesis.
Findings
ReTabSyn outperforms state-of-the-art baselines on benchmarks with small sample sizes.
The method effectively handles class imbalance and distribution shift.
It can incorporate expert constraints into synthetic data generation.
Abstract
Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution , as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques
