WavThruVec: Latent speech representation as intermediate features for   neural speech synthesis

Hubert Siuzdak; Piotr Dura; Pol van Rijn; Nori Jacoby

arXiv:2203.16930·cs.SD·November 22, 2022·1 cites

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

PDF

Open Access

TL;DR

WavThruVec introduces a two-stage neural speech synthesis model using high-level Wav2Vec 2.0 embeddings as intermediate features, enhancing robustness, generalization, and enabling zero-shot synthesis.

Contribution

The paper proposes WavThruVec, a novel two-stage architecture leveraging Wav2Vec 2.0 embeddings for improved neural speech synthesis, addressing limitations of traditional intermediate features.

Findings

01

Matches state-of-the-art synthesis quality

02

Improves robustness to noise and out-of-vocabulary words

03

Enables zero-shot voice synthesis and conversion

Abstract

Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques