Harnessing large-language models to generate private synthetic text
Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas, Terzis

TL;DR
This paper presents a method for generating high-quality differentially private synthetic text data using large language models, enabling reuse and sharing without privacy risks.
Contribution
The paper introduces a novel training objective and parameter tuning strategy that significantly improves the quality of DP synthetic text data generated by large language models.
Findings
Synthetic data achieves performance comparable to direct DP training.
Synthetic data can be used for downstream model tuning.
Proposed method outperforms previous approaches in data quality.
Abstract
Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. An alternative approach, which this paper studies, is to use a sensitive dataset to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data. Doing so has several advantages: synthetic data can be reused for other tasks (including for hyper parameter tuning), retained indefinitely, and shared with third parties without sacrificing privacy. However, generating private synthetic data is much harder than training a private model. To improve performance on text data, recent work has utilized public data by starting with a pre-trained generative language model and privately fine-tuning it on sensitive data. This model can be used to sample a DP…
Peer Reviews
Decision·Submitted to ICLR 2024
* The writing has good clarity. * The paper points out that in the common experimental setup in related work, the private data and pre-training data might overlap. It is an important issue that the community needs to pay attention to. * The results are promising.
* The paper downplays and misinterprets the contribution of prior work in several places. As a result, the contribution of the paper is overstated. * The proposed framework lacks novelty--the key components are already studied in prior work.
- The authors conduct extensive evaluation and offer valuable empirical insights into DP synthetic text generation, such as highlighting the importance of prefix-LM that assigns zero weights to the prefix during training, random initialization for prompt tensors on DP prompt tuning, and the superior performance of LoRA compared to prompt tuning. - Additionally, the paper provides analysis of the synthetic data, such as the effects of synthetic data size, the rank correction of synthetic data fo
Novelty: - The novelty of the study may be limited, given that DP-SGD is a standard technique for DP synthetic text generation (Yue et al. 2022), and parameter-efficient fine-tuning has already been explored in DP LLM (Yu et al., 2021), albeit not directly applied to synthetic data generation. Comparison to Yue et al. (2022): - The discussion and comparison with Yue et al. (2022) might be confusing to the readers. It would be helpful if the authors could clarify the difference between 'condit
- Very clearly written paper on a timely topic, will be really helpful for anyone interested in DP LLMs, and also very accessible to wider audience as well. - Thorough, rigorous experiments where e.g. the de-duplication of the fine-tuning data from the pre-training data is carried out. I believe the results are very valuable and this can be a good reference for DP LLM studies. - Interesting finding: Impressive results on the classification accuracy of the downstream models trained with DP synt
- One weakness coming to my mind is the lack of novelty as there is not really anything new proposed in the paper. All the experiments are results of combining existing techniques. At the same time, I do think these are really valuable experimental results and the lack of theoretical novelties is not necessarily a problem. - I would have liked to see more about the computational costs of the experiments. I see there are some compute cost numbers in Appendix M, but would have been interesting to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
