Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion
Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing, Liu, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang

TL;DR
This paper introduces WASP, a novel framework for private data synthesis that combines multiple pre-trained language models with contrastive learning to generate high-quality, privacy-preserving synthetic data, especially effective in data-scarce scenarios.
Contribution
WASP is the first method to utilize weighted multi-PLM fusion with contrastive learning for private data synthesis, addressing data scarcity and bias issues in existing approaches.
Findings
WASP outperforms existing methods on 6 datasets across various tasks.
The Top-Q voting mechanism improves private data distribution estimation.
Dynamic weighting of PLMs enhances synthetic data quality.
Abstract
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data
