Private Text Generation by Seeding Large Language Model Prompts
Supriya Nagesh, Justin Y. Chen, Nina Mishra, Tal Wagner

TL;DR
This paper introduces DP-KPS, a method for generating private synthetic text from sensitive data using large language models with differentially private prompts, enabling privacy-preserving data sharing for machine learning.
Contribution
The paper presents DP-KPS, a novel prompt-based approach that achieves differential privacy in synthetic text generation without fine-tuning or training models.
Findings
Synthetic corpora retain predictive power for ML tasks
DP-KPS effectively balances privacy and diversity
Method requires minimal compute and no model training
Abstract
We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Privacy-Preserving Technologies in Data
