Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
Xinyan Han, Yan Lu, Xiaoyu Lin, Yuanyuan Jiang, Yuanrui Wang, Xuanyue Li, Wenchao Zou, Xingxuan Zhang

TL;DR
This paper introduces DiffICL, a novel approach for tabular data synthesis that leverages in-context learning with pretrained priors to enhance data quality and privacy simultaneously, especially in small-data regimes.
Contribution
The paper proposes DiffICL, a new method that uses in-context learning and pretrained structural priors to improve tabular data generation without sacrificing privacy.
Findings
DiffICL outperforms existing models in data quality and privacy protection.
Synthetic data from DiffICL effectively augments real datasets.
DiffICL mitigates the quality-privacy tradeoff in small-data regimes.
Abstract
Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
