$d_X$-Privacy for Text and the Curse of Dimensionality
Hassan Jameel Asghar, Robin Carpentier, Benjamin Zi Hao Zhao, Dali Kaafar

TL;DR
This paper examines the limitations of the $d_X$-privacy mechanism in text data, revealing its tendency to produce either identical or dissimilar words, and proposes a post-processing fix to improve semantic consistency.
Contribution
The paper identifies a key issue with the multidimensional Laplace mechanism in text privacy and introduces a post-processing method to mitigate the problem.
Findings
The mechanism often outputs identical or dissimilar words, rarely similar ones.
The distance in high-dimensional embeddings affects nearest neighbor behavior.
A post-processing step improves semantic similarity in outputs.
Abstract
A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for -privacy, which is a relaxation of differential privacy for metric spaces. We identify an intriguing peculiarity of this mechanism. When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely outputs semantically similar words. We investigate this observation in detail, and tie it to the fact that the distance of the nearest neighbor of a word in any word embedding model (which are high-dimensional) is much larger than the relative difference in distances to any of its two consecutive neighbors. We also show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor. We derive the distribution, moments and tail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security
