HLTCOE JHU Submission to the Voice Privacy Challenge 2024
Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola, Garc\'ia-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

TL;DR
This paper compares voice conversion and TTS-based systems for voice privacy, highlighting their strengths and weaknesses, and introduces a hybrid system that balances privacy and emotion preservation.
Contribution
It presents a comprehensive evaluation of voice anonymization methods and proposes a novel hybrid system to improve privacy while maintaining emotional content.
Findings
Voice conversion preserves emotion better but less effective at anonymization.
TTS methods excel at anonymization but lose emotional cues.
The hybrid system achieves over 40% EER and 47% UAR, balancing privacy and emotion.
Abstract
We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions
