Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large   Pretrained Language Model

Takaaki Saeki; Shinnosuke Takamichi; Hiroshi Saruwatari

arXiv:2012.12612·cs.SD·May 26, 2021

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

PDF

TL;DR

This paper introduces an incremental TTS approach that uses a pseudo lookahead generated by a large pretrained language model to improve speech quality without increasing latency, mimicking human reading.

Contribution

It proposes a novel incremental TTS method leveraging GPT-2 for pseudo lookahead, balancing low latency with high speech naturalness.

Findings

01

Achieves higher speech quality than methods without lookahead

02

Matches speech quality of methods with full future context

03

Demonstrates effective use of pretrained language models in TTS

Abstract

This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.