Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe
Xiang Yue, Huseyin A. Inan, Xuechen Li, Girish Kumar, Julia McAnallen,, Hoda Shajari, Huan Sun, David Levitan, Robert Sim

TL;DR
This paper presents a simple, practical method for generating high-quality synthetic text with differential privacy by fine-tuning pretrained language models, balancing utility and privacy protection.
Contribution
It introduces a straightforward recipe for differentially private text generation that achieves competitive utility while ensuring strong privacy guarantees.
Findings
Synthetic text generated with DP is utility-competitive with non-private models.
The method effectively protects against privacy leakages.
Extensive empirical analysis validates the approach on benchmark and private data.
Abstract
Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee, such as differential privacy (DP), provides a promising path to mitigating these privacy concerns, but previous approaches in this direction have typically failed to produce synthetic data of high quality. In this work, we show that a simple and practical recipe in the text domain is effective: simply fine-tuning a pretrained generative language model with DP enables the model to generate useful synthetic text with strong privacy protection. Through extensive empirical analyses on both benchmark and private customer data, we demonstrate that our method produces synthetic text that is competitive in terms of utility with its non-private counterpart,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Digital and Cyber Forensics
