Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System

Ze Li; Yao Shi; Yunfei Xu; Ming Li

arXiv:2410.04017·eess.AS·October 7, 2025·ICME

Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System

Ze Li, Yao Shi, Yunfei Xu, Ming Li

PDF

Open Access

TL;DR

This paper explores vulnerabilities of zero-shot TTS systems to adversarial attacks and proposes defense strategies like adversarial training and purification to improve security.

Contribution

It introduces and evaluates two novel defense methods—adversarial training and diffusion-based purification—for speaker embedding based zero-shot TTS systems.

Findings

01

Adversarial attacks can significantly degrade TTS quality and security.

02

Adversarial training improves robustness against perturbations.

03

Diffusion models effectively purify adversarial audio samples.

Abstract

Speaker embedding based zero-shot Text-to-Speech (TTS) systems enable high-quality speech synthesis for unseen speakers using minimal data. However, these systems are vulnerable to adversarial attacks, where an attacker introduces imperceptible perturbations to the original speaker's audio waveform, leading to synthesized speech sounds like another person. This vulnerability poses significant security risks, including speaker identity spoofing and unauthorized voice manipulation. This paper investigates two primary defense strategies to address these threats: adversarial training and adversarial purification. Adversarial training enhances the model's robustness by integrating adversarial examples during the training process, thereby improving resistance to such attacks. Adversarial purification, on the other hand, employs diffusion probabilistic models to revert adversarially perturbed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsDiffusion