PrE-Text: Training Language Models on Private Federated Data in the Age   of LLMs

Charlie Hou; Akshat Shrivastava; Hongyuan Zhan; Rylan Conway; Trang; Le; Adithya Sagar; Giulia Fanti; Daniel Lazar

arXiv:2406.02958·cs.LG·October 21, 2024·1 cites

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang, Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

PDF

Open Access 1 Repo

TL;DR

PrE-Text introduces a method for generating differentially private synthetic textual data that enables training small models more efficiently and improves large language model performance on private data, addressing key challenges of on-device training.

Contribution

The paper presents PrE-Text, a novel approach for DP synthetic data generation that enhances privacy-preserving model training and outperforms on-device training in efficiency and effectiveness.

Findings

01

PrE-Text synthetic data outperforms on-device training for small models.

02

Training large models on PrE-Text data improves LLM performance on private data.

03

PrE-Text reduces communication, computation, and training rounds significantly.

Abstract

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ( $ϵ = 1.29$ , $ϵ = 7.58$ ). We achieve these results while using 9 $\times$ fewer rounds, 6 $\times$ less client computation per round, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

houcharlie/pre-text
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Digital Rights Management and Security · Artificial Intelligence in Law