When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
Joshua Ward, Bochao Gu, Chi-Hua Wang, Guang Cheng

TL;DR
This paper reveals privacy risks in LLM-based tabular data generation, demonstrating that models can memorize and leak numeric patterns, and proposes defenses to mitigate this vulnerability.
Contribution
The work introduces a simple membership inference attack on synthetic data from LLMs and proposes a novel digit perturbation sampling method to defend against it.
Findings
The attack exposes significant privacy leakage across models and datasets.
The proposed sampling strategy effectively reduces privacy risks with minimal utility loss.
Some models can be perfect membership classifiers using the attack.
Abstract
Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
