VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ   Acoustic Feature

Chenpeng Du; Yiwei Guo; Xie Chen; Kai Yu

arXiv:2204.00768·eess.AS·October 25, 2024

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu

PDF

Open Access

TL;DR

VQTTS introduces a novel TTS system using self-supervised vector-quantized acoustic features, replacing traditional mel-spectrograms, resulting in improved naturalness and high-fidelity speech synthesis.

Contribution

The paper presents a new TTS framework with a classification-based acoustic model and a specialized vocoder using VQ features, achieving state-of-the-art naturalness.

Findings

01

VQ acoustic features improve reconstruction quality.

02

VQTTS outperforms existing TTS systems in naturalness.

03

Self-supervised VQ features enhance speech synthesis fidelity.

Abstract

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Model