Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System
Ze Li, Yao Shi, Yunfei Xu, Ming Li

TL;DR
This paper explores vulnerabilities of zero-shot TTS systems to adversarial attacks and proposes defense strategies like adversarial training and purification to improve security.
Contribution
It introduces and evaluates two novel defense methods—adversarial training and diffusion-based purification—for speaker embedding based zero-shot TTS systems.
Findings
Adversarial attacks can significantly degrade TTS quality and security.
Adversarial training improves robustness against perturbations.
Diffusion models effectively purify adversarial audio samples.
Abstract
Speaker embedding based zero-shot Text-to-Speech (TTS) systems enable high-quality speech synthesis for unseen speakers using minimal data. However, these systems are vulnerable to adversarial attacks, where an attacker introduces imperceptible perturbations to the original speaker's audio waveform, leading to synthesized speech sounds like another person. This vulnerability poses significant security risks, including speaker identity spoofing and unauthorized voice manipulation. This paper investigates two primary defense strategies to address these threats: adversarial training and adversarial purification. Adversarial training enhances the model's robustness by integrating adversarial examples during the training process, thereby improving resistance to such attacks. Adversarial purification, on the other hand, employs diffusion probabilistic models to revert adversarially perturbed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsDiffusion
