PrivCode: When Code Generation Meets Differential Privacy

Zheng Liu; Chen Gong; Terry Yue Zhuo; Kecen Li; Weichen Yu; Matt Fredrikson; Tianhao Wang

arXiv:2512.05459·cs.CR·January 16, 2026

PrivCode: When Code Generation Meets Differential Privacy

Zheng Liu, Chen Gong, Terry Yue Zhuo, Kecen Li, Weichen Yu, Matt Fredrikson, Tianhao Wang

PDF

Open Access

TL;DR

PrivCode is a novel differential privacy framework for code datasets that enhances privacy guarantees while maintaining high utility in generated code, addressing key challenges in private code synthesis.

Contribution

It introduces the first DP synthesizer for code datasets with a two-stage framework combining privacy-sanitizing and utility-boosting techniques.

Findings

01

PrivCode outperforms baselines in utility across multiple tasks.

02

It effectively protects sensitive data under different privacy budgets.

03

The approach is validated on four large language models and four benchmarks.

Abstract

Large language models (LLMs) have presented outstanding performance in code generation and completion. However, fine-tuning these models on private datasets can raise privacy and proprietary concerns, such as the leakage of sensitive personal information. Differentially private (DP) code generation provides theoretical guarantees for protecting sensitive code by generating synthetic datasets that preserve statistical properties while reducing privacy leakage concerns. However, DP code generation faces significant challenges due to the strict syntactic dependencies and the privacy-utility trade-off. We propose PrivCode, the first DP synthesizer specifically designed for code datasets. It incorporates a two-stage framework to improve both privacy and utility. In the first stage, termed "privacy-sanitizing", PrivCode generates DP-compliant synthetic code by training models using DP-SGD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Advanced Malware Detection Techniques · Adversarial Robustness in Machine Learning