Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

Qian Ma; Sarah Rajtmajer

arXiv:2604.07486·cs.CR·April 14, 2026

Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

Qian Ma, Sarah Rajtmajer

PDF

TL;DR

This paper introduces RPSG, a method for generating realistic private synthetic text data using private seeds and differential privacy, balancing data utility with privacy protection.

Contribution

The paper presents RPSG, a novel approach combining private seeds and formal differential privacy to improve private synthetic data generation.

Findings

01

RPSG achieves high fidelity to private data.

02

RPSG provides strong privacy guarantees.

03

Experimental results outperform state-of-the-art methods.

Abstract

Large language models (LLMs) have emerged as a powerful tool for synthetic data generation. A particularly important use case is producing synthetic replicas of private text, which requires carefully balancing privacy and utility. We propose Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), which uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy (DP) mechanism in the candidate selection, to generate realistic synthetic data. Comprehensive experiments against state-of-the-art private synthetic data generation methods demonstrate that RPSG achieves high fidelity to private data while providing strong privacy protection.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.