PrivCode: When Code Generation Meets Differential Privacy
Zheng Liu, Chen Gong, Terry Yue Zhuo, Kecen Li, Weichen Yu, Matt Fredrikson, Tianhao Wang

TL;DR
PrivCode is a novel differential privacy framework for code datasets that enhances privacy guarantees while maintaining high utility in generated code, addressing key challenges in private code synthesis.
Contribution
It introduces the first DP synthesizer for code datasets with a two-stage framework combining privacy-sanitizing and utility-boosting techniques.
Findings
PrivCode outperforms baselines in utility across multiple tasks.
It effectively protects sensitive data under different privacy budgets.
The approach is validated on four large language models and four benchmarks.
Abstract
Large language models (LLMs) have presented outstanding performance in code generation and completion. However, fine-tuning these models on private datasets can raise privacy and proprietary concerns, such as the leakage of sensitive personal information. Differentially private (DP) code generation provides theoretical guarantees for protecting sensitive code by generating synthetic datasets that preserve statistical properties while reducing privacy leakage concerns. However, DP code generation faces significant challenges due to the strict syntactic dependencies and the privacy-utility trade-off. We propose PrivCode, the first DP synthesizer specifically designed for code datasets. It incorporates a two-stage framework to improve both privacy and utility. In the first stage, termed "privacy-sanitizing", PrivCode generates DP-compliant synthetic code by training models using DP-SGD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Advanced Malware Detection Techniques · Adversarial Robustness in Machine Learning
