Content Leakage in LibriSpeech and Its Impact on the Privacy Evaluation of Speaker Anonymization
Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian M\"oller

TL;DR
This paper uncovers a privacy vulnerability in the Librispeech dataset where speaker identities can be inferred from vocabulary, highlighting the need for more robust evaluation datasets like EdAcc for speaker anonymization methods.
Contribution
It reveals a vocabulary-based identity leakage in Librispeech and introduces EdAcc as a more privacy-preserving alternative for evaluating speaker anonymization.
Findings
Librispeech speakers can be identified by their vocabularies.
Perfect anonymizers cannot prevent vocabulary-based identity leakage.
EdAcc dataset reduces vocabulary-based speaker identification, improving privacy evaluation.
Abstract
Speaker anonymization aims to conceal a speaker's identity, without considering the linguistic content. In this study, we reveal a weakness of Librispeech, the dataset that is commonly used to evaluate anonymizers: the books read by the Librispeech speakers are so distinct, that speakers can be identified by their vocabularies. Even perfect anonymizers cannot prevent this identity leakage. The EdAcc dataset is better in this regard: only a few speakers can be identified through their vocabularies, encouraging the attacker to look elsewhere for the identities of the anonymized speakers. EdAcc also comprises spontaneous speech and more diverse speakers, complementing Librispeech and giving more insights into how anonymizers work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling
