Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Jessup Byun; Xiaofeng Lin; Joshua Ward; Guang Cheng

arXiv:2507.17066·cs.LG·July 24, 2025

Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

PDF

Open Access

TL;DR

This paper benchmarks the privacy risks of foundation models in synthetic tabular data generation, revealing high leakage potential and proposing prompt tweaks to mitigate risks while maintaining data utility.

Contribution

It provides the first comprehensive benchmark of foundation models' privacy leakage in tabular data synthesis and explores simple prompt-based mitigation strategies.

Findings

01

Foundation models have higher privacy leakage than baselines.

02

LLaMA 3.3 70B shows up to 54% higher true-positive rate in membership inference.

03

Prompt tweaks can reduce leakage significantly while preserving data fidelity.

Abstract

Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Modeling in Geospatial Applications · Digital and Cyber Forensics