KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge   Distillation from Server

Wenhao Wang; Xiaoyu Liang; Rui Ye; Jingyi Chai; Siheng Chen; Yanfeng; Wang

arXiv:2410.05725·cs.CR·October 11, 2024

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

Wenhao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, Yanfeng, Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

KnowledgeSG is a privacy-preserving framework for synthetic text generation that combines differential privacy and knowledge distillation from a server, improving data quality and model performance without exposing private data.

Contribution

It introduces a novel client-server framework that enhances synthetic data quality and model performance while ensuring privacy through differential privacy and knowledge distillation.

Findings

01

Effective in medical and financial domains

02

Maintains privacy by transmitting models, not data

03

Improves synthetic data quality and model accuracy

Abstract

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wwh0411/knowledgesg
pytorchOfficial

Videos

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server· underline

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Digital and Cyber Forensics · Data Quality and Management