VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses
Maryam Abbasihafshejani, AHM Nazmus Sakib, Murtuza Jadliwala

TL;DR
This paper introduces VocalBridge, a latent diffusion-based purification method that effectively removes protective perturbations from speech, exposing vulnerabilities in current voiceprint defenses against cloning and verification attacks.
Contribution
VocalBridge is a novel latent diffusion framework that purifies perturbed speech in the EnCodec latent space, improving robustness over existing defenses without requiring transcripts.
Findings
Outperforms existing purification methods in recovering cloneable voices
Demonstrates fragility of current perturbation-based defenses
Highlights need for more robust voice protection mechanisms
Abstract
The rapid advancement of speech synthesis technologies, including text-to-speech (TTS) and voice conversion (VC), has intensified security and privacy concerns related to voice cloning. Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech to obscure speaker identity while maintaining intelligibility. However, adversaries can apply advanced purification techniques to remove these perturbations, recover authentic acoustic characteristics, and regenerate cloneable voices. Despite the growing realism of such attacks, the robustness of existing defenses under adaptive purification remains insufficiently studied. Most existing purification methods are designed to counter adversarial noise in automatic speech recognition (ASR) systems rather than speaker verification or voice cloning pipelines. As a result, they fail to suppress the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Speech Recognition and Synthesis · Topic Modeling
