Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation
Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

TL;DR
This paper benchmarks the privacy risks of foundation models in synthetic tabular data generation, revealing high leakage potential and proposing prompt tweaks to mitigate risks while maintaining data utility.
Contribution
It provides the first comprehensive benchmark of foundation models' privacy leakage in tabular data synthesis and explores simple prompt-based mitigation strategies.
Findings
Foundation models have higher privacy leakage than baselines.
LLaMA 3.3 70B shows up to 54% higher true-positive rate in membership inference.
Prompt tweaks can reduce leakage significantly while preserving data fidelity.
Abstract
Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Modeling in Geospatial Applications · Digital and Cyber Forensics
