A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural   TTS

Haohan Guo; Fenglong Xie; Frank K. Soong; Xixin Wu; Helen Meng

arXiv:2209.10887·cs.SD·September 23, 2022

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-stage, multi-codebook VQ-VAE approach for neural TTS that encodes speech features at multiple resolutions, leading to high-quality speech synthesis with improved MOS scores.

Contribution

It presents a novel multi-stage, multi-codebook VQ-VAE framework for neural TTS, enhancing speech quality and efficiency over baseline methods.

Findings

01

Achieves an MOS of 4.41, surpassing the baseline of 3.62.

02

Effective use of multiple stages and codebooks improves TTS performance.

03

Compact models maintain high MOS scores with fewer parameters.

Abstract

We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively. Multi-stage predictors are trained to map the input text sequence to MSMCRs progressively by minimizing a combined loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms. The proposed approach is trained and tested with an English TTS database of 16 hours by a female speaker. The proposed TTS achieves an MOS score of 4.41, which outperforms the baseline with an MOS of 3.62.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hhguo/msmc-tts
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques