Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context   Prediction Network

Takaaki Saeki; Shinnosuke Takamichi; and Hiroshi Saruwatari

arXiv:2109.10724·cs.SD·September 23, 2021

Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a low-latency incremental TTS synthesis method that uses a distilled, lightweight model to predict future context, significantly reducing inference time while maintaining speech quality for real-time applications.

Contribution

It proposes a novel approach that distills a large GPT2-based context prediction network into a simple recurrent model for faster incremental TTS synthesis.

Findings

01

Achieves comparable speech quality to previous methods

02

Reduces inference time by about ten times

03

Enables real-time incremental speech synthesis

Abstract

Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step. In this paper, we propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model. We perform knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsKnowledge Distillation