N-grams Bayesian Differential Privacy
Osman Ramadan, James Withers, Douglas Orr

TL;DR
This paper introduces a Bayesian differential privacy mechanism for n-gram counts that improves privacy-utility trade-offs in language models by leveraging public data as a prior, outperforming existing methods.
Contribution
It proposes a novel Bayesian approach using public data as a prior to achieve tighter privacy bounds and better utility in n-gram language modeling.
Findings
Achieves up to 85% reduction in KL divergence at epsilon=0.1
Provides superior privacy protection compared to k-anonymity
Offers competitive performance at large vocabularies
Abstract
Differential privacy has gained popularity in machine learning as a strong privacy guarantee, in contrast to privacy mitigation techniques such as k-anonymity. However, applying differential privacy to n-gram counts significantly degrades the utility of derived language models due to their large vocabularies. We propose a differential privacy mechanism that uses public data as a prior in a Bayesian setup to provide tighter bounds on the privacy loss metric epsilon, and thus better privacy-utility trade-offs. It first transforms the counts to log space, approximating the distribution of the public and private data as Gaussian. The posterior distribution is then evaluated and softmax is applied to produce a probability distribution. This technique achieves up to 85% reduction in KL divergence compared to previously known mechanisms at epsilon equals 0.1. We compare our mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Internet Traffic Analysis and Secure E-voting
MethodsSoftmax
