DiscreTalk: Text-to-Speech as a Machine Translation Problem

Tomoki Hayashi; Shinji Watanabe

arXiv:2005.05525·cs.CL·May 13, 2020·21 cites

DiscreTalk: Text-to-Speech as a Machine Translation Problem

Tomoki Hayashi, Shinji Watanabe

PDF

Open Access

TL;DR

DiscreTalk introduces a novel end-to-end text-to-speech system that models speech as a machine translation task using discrete symbols, enabling the application of NMT techniques and improving naturalness over traditional TTS models.

Contribution

The paper presents a new TTS approach combining VQ-VAE and Transformer NMT, eliminating the need for hyperparameter tuning and reducing over-smoothing issues.

Findings

01

Outperforms conventional Transformer-TTS in naturalness

02

Achieves performance comparable to VQ-VAE reconstruction

03

Utilizes NMT techniques like beam search and subword units

Abstract

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a mapping function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a mapping in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsVQ-VAE · Solana Customer Service Number +1-833-534-1729