Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio
Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

TL;DR
This paper investigates how large text-to-speech models can be exploited to generate harmful audio content despite safety measures, revealing significant vulnerabilities and proposing new attack methods.
Contribution
The study introduces HARMGEN, a suite of five novel attacks that bypass safety filters in TTS systems, exposing critical security flaws in current deployment practices.
Findings
Attacks significantly lower refusal rates in TTS systems.
HARMGEN increases toxicity in generated speech across multiple models.
Proactive moderation detects 57-93% of attacks.
Abstract
Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Speech Recognition and Synthesis
