Privacy Disclosure of Similarity Rank in Speech and Language Processing

Tom B\"ackstr\"om; Mohammad Hassan Vali; My Nguyen; Silas Rech

arXiv:2508.05250·eess.AS·December 1, 2025

Privacy Disclosure of Similarity Rank in Speech and Language Processing

Tom B\"ackstr\"om, Mohammad Hassan Vali, My Nguyen, Silas Rech

PDF

Open Access

TL;DR

This paper introduces a method to quantify privacy risks in biometric identification by analyzing the information leaked through similarity ranks, revealing that even noisy measures can disclose sensitive identity details.

Contribution

It proposes a novel metric for measuring privacy disclosure via similarity rank distributions, applicable to speech and biometric data, enhancing privacy threat evaluation.

Findings

01

Speaker embeddings contain the most PII among features.

02

Privacy disclosure increases with test sample length.

03

The metric allows comparison of PII disclosure across features.

Abstract

Speaker, author, and other biometric identification applications often compare a sample's similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data