Efficient Incremental Text-to-Speech on GPUs

Muyang Du; Chuan Liu; Jiaxing Qi; Junjie Lai

arXiv:2211.13939·cs.SD·December 6, 2022

Efficient Incremental Text-to-Speech on GPUs

Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai

PDF

Open Access

TL;DR

This paper introduces a GPU-based incremental text-to-speech method that achieves ultra-low latency and high concurrency, suitable for real-time online speech applications.

Contribution

The paper presents a novel GPU-optimized incremental TTS approach using Instant Request Pooling and Module-wise Dynamic Batching, enabling real-time performance.

Findings

01

First-chunk latency under 80ms at 100 QPS on NVIDIA A10

02

Outperforms non-incremental methods in concurrency and latency

03

Demonstrates high-quality speech synthesis in real-time

Abstract

Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems