TL;DR
This paper investigates which speech features contribute to speaker identification in anonymization, extending kNN-VC with interpretable prosody components that enhance privacy and reveal the importance of prosodic cues.
Contribution
It introduces interpretable prosody-based components to kNN-VC, demonstrating their effectiveness in improving speaker anonymization and uncovering the role of prosody in privacy attacks.
Findings
Prosody leakage affects speaker anonymization.
Adding duration and variation components improves privacy.
Target selection impacts privacy attack success.
Abstract
Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
