Privacy Disclosure of Similarity Rank in Speech and Language Processing
Tom B\"ackstr\"om, Mohammad Hassan Vali, My Nguyen, Silas Rech

TL;DR
This paper introduces a method to quantify privacy risks in biometric identification by analyzing the information leaked through similarity ranks, revealing that even noisy measures can disclose sensitive identity details.
Contribution
It proposes a novel metric for measuring privacy disclosure via similarity rank distributions, applicable to speech and biometric data, enhancing privacy threat evaluation.
Findings
Speaker embeddings contain the most PII among features.
Privacy disclosure increases with test sample length.
The metric allows comparison of PII disclosure across features.
Abstract
Speaker, author, and other biometric identification applications often compare a sample's similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
