High Quality Streaming Speech Synthesis with Low,   Sentence-Length-Independent Latency

Nikolaos Ellinas; Georgios Vamvoukakis; Konstantinos Markopoulos,; Aimilios Chalamandaris; Georgia Maniati; Panos Kakoulidis; Spyros Raptis,; June Sig Sung; Hyoungmin Park; Pirros Tsiakoulis

arXiv:2111.09052·cs.SD·November 18, 2021

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos,, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis,, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

PDF

TL;DR

This paper introduces a low-latency, end-to-end speech synthesis system capable of real-time performance on CPUs, using a novel attention mechanism and streaming inference to produce near-natural speech quickly regardless of sentence length.

Contribution

The paper proposes a new autoregressive TTS architecture with location-based attention and streaming inference, achieving low latency and high-quality speech synthesis.

Findings

01

Achieves 31x real-time speed on CPU

02

Maintains nearly constant latency regardless of sentence length

03

Produces speech of almost natural quality

Abstract

This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Highway Layer · Batch Normalization · Highway Network · Convolution · Bidirectional GRU · Max Pooling · Residual Connection · CBHG