TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations

Xiaoxue Gao; Yiming Chen; Xianghu Yue; Yu Tsao; Nancy F. Chen

arXiv:2407.01927·eess.AS·October 23, 2024

TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations

Xiaoxue Gao, Yiming Chen, Xianghu Yue, Yu Tsao, Nancy F. Chen

PDF

Open Access

TL;DR

This paper introduces TTSlow, an adversarial method to intentionally slow down TTS systems, evaluating their robustness and efficiency against input perturbations, and highlighting vulnerabilities across different models and datasets.

Contribution

The paper presents TTSlow, the first attack method targeting TTS models to evaluate their robustness and efficiency, using novel adversarial strategies on text and speaker embeddings.

Findings

01

TTSlow effectively increases TTS generation time across multiple models and datasets.

02

The attack impacts speech intelligibility minimally while significantly slowing down synthesis.

03

The approach reveals vulnerabilities in both autoregressive and non-autoregressive TTS systems.

Abstract

Text-to-speech (TTS) has been extensively studied for generating high-quality speech with textual inputs, playing a crucial role in various real-time applications. For real-world deployment, ensuring stable and timely generation in TTS models against minor input perturbations is of paramount importance. Therefore, evaluating the robustness of TTS models against such perturbations, commonly known as adversarial attacks, is highly desirable. In this paper, we propose TTSlow, a novel adversarial approach specifically tailored to slow down the speech generation process in TTS systems. To induce long TTS waiting time, we design novel efficiency-oriented adversarial loss to encourage endless generation process. TTSlow encompasses two attack strategies targeting both text inputs and speaker embedding. Specifically, we propose TTSlow-text, which utilizes a combination of homoglyphs-based and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques