Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network
Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari

TL;DR
This paper introduces a low-latency incremental TTS synthesis method that uses a distilled, lightweight model to predict future context, significantly reducing inference time while maintaining speech quality for real-time applications.
Contribution
It proposes a novel approach that distills a large GPT2-based context prediction network into a simple recurrent model for faster incremental TTS synthesis.
Findings
Achieves comparable speech quality to previous methods
Reduces inference time by about ten times
Enables real-time incremental speech synthesis
Abstract
Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step. In this paper, we propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model. We perform knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsKnowledge Distillation
