A Vector Quantized Approach for Text to Speech Synthesis on Real-World   Spontaneous Speech

Li-Wei Chen; Shinji Watanabe; Alexander Rudnicky

arXiv:2302.04215·eess.AS·February 9, 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents MQTTS, a novel vector quantized TTS system trained on real-world data, which effectively handles speech diversity and alignment issues, outperforming existing systems in quality.

Contribution

The work introduces a vector quantized approach with multiple code groups and monotonic alignment for real-world data TTS, addressing alignment mismatch problems.

Findings

01

MQTTS outperforms existing TTS systems in objective measures.

02

Using discrete codes improves speech intelligibility.

03

Silence prompts enhance synthesis quality.

Abstract

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

b04901014/mqtts
pytorchOfficial

Videos

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention with Linear Biases