PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang, Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

TL;DR
PrE-Text introduces a method for generating differentially private synthetic textual data that enables training small models more efficiently and improves large language model performance on private data, addressing key challenges of on-device training.
Contribution
The paper presents PrE-Text, a novel approach for DP synthetic data generation that enhances privacy-preserving model training and outperforms on-device training in efficiency and effectiveness.
Findings
PrE-Text synthetic data outperforms on-device training for small models.
Training large models on PrE-Text data improves LLM performance on private data.
PrE-Text reduces communication, computation, and training rounds significantly.
Abstract
On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes (, ). We achieve these results while using 9 fewer rounds, 6 less client computation per round, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Digital Rights Management and Security · Artificial Intelligence in Law
