Speaker Anonymization with Phonetic Intermediate Representations
Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli,, Ngoc Thang Vu

TL;DR
This paper introduces a speaker anonymization method using phonetic intermediate representations, leveraging speech recognition and synthesis to effectively hide speaker identity while preserving speech content, and demonstrates robustness and improved privacy in experiments.
Contribution
The work presents a novel speaker anonymization pipeline that uses phonetic transcriptions and anonymized embeddings, improving privacy robustness and speech quality over existing methods.
Findings
System handles imperfect ASR transcriptions effectively.
Combining speaker embeddings from multiple sources improves anonymization.
Outperforms Voice Privacy Challenge 2020 baselines in privacy and speech quality.
Abstract
In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic content as much as possible. Our experimental results on LibriSpeech and VCTK corpora reveal two key findings: 1) although automatic speech recognition produces imperfect transcriptions, our neural speech synthesis system can handle such errors, making our system feasible and robust, and 2) combining speaker embeddings from different resources is beneficial and their appropriate normalization is crucial. Overall, our final best system outperforms significantly the baselines provided in the Voice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Hate Speech and Cyberbullying Detection
