RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations
Seungmin Kim, Sohee Park, Donghyun Kim, Jisu Lee, Daeseon Choi

TL;DR
RoVo is a proactive defense method that injects adversarial perturbations into embedding vectors of audio to protect against speech synthesis attacks and remains effective against speech enhancement techniques.
Contribution
It introduces embedding-level perturbations for robust voice protection, outperforming existing methods and resisting secondary speech enhancement attacks.
Findings
Increased Defense Success Rate (DSR) by over 70% against state-of-the-art models
Achieved 99.5% DSR on commercial speaker-verification API
Perturbations remain effective under strong speech enhancement conditions
Abstract
With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Speech Recognition and Synthesis · Hate Speech and Cyberbullying Detection
