Efficient Incremental Text-to-Speech on GPUs
Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai

TL;DR
This paper introduces a GPU-based incremental text-to-speech method that achieves ultra-low latency and high concurrency, suitable for real-time online speech applications.
Contribution
The paper presents a novel GPU-optimized incremental TTS approach using Instant Request Pooling and Module-wise Dynamic Batching, enabling real-time performance.
Findings
First-chunk latency under 80ms at 100 QPS on NVIDIA A10
Outperforms non-incremental methods in concurrency and latency
Demonstrates high-quality speech synthesis in real-time
Abstract
Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
