Content Leakage in LibriSpeech and Its Impact on the Privacy Evaluation of Speaker Anonymization

Carlos Franzreb; Arnab Das; Tim Polzehl; Sebastian M\"oller

arXiv:2601.13107·eess.AS·January 21, 2026

Content Leakage in LibriSpeech and Its Impact on the Privacy Evaluation of Speaker Anonymization

Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian M\"oller

PDF

Open Access

TL;DR

This paper uncovers a privacy vulnerability in the Librispeech dataset where speaker identities can be inferred from vocabulary, highlighting the need for more robust evaluation datasets like EdAcc for speaker anonymization methods.

Contribution

It reveals a vocabulary-based identity leakage in Librispeech and introduces EdAcc as a more privacy-preserving alternative for evaluating speaker anonymization.

Findings

01

Librispeech speakers can be identified by their vocabularies.

02

Perfect anonymizers cannot prevent vocabulary-based identity leakage.

03

EdAcc dataset reduces vocabulary-based speaker identification, improving privacy evaluation.

Abstract

Speaker anonymization aims to conceal a speaker's identity, without considering the linguistic content. In this study, we reveal a weakness of Librispeech, the dataset that is commonly used to evaluate anonymizers: the books read by the Librispeech speakers are so distinct, that speakers can be identified by their vocabularies. Even perfect anonymizers cannot prevent this identity leakage. The EdAcc dataset is better in this regard: only a few speakers can be identified through their vocabularies, encouraging the attacker to look elsewhere for the identities of the anonymized speakers. EdAcc also comprises spontaneous speech and more diverse speakers, complementing Librispeech and giving more insights into how anonymizers work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling