WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

TL;DR
WavThruVec introduces a two-stage neural speech synthesis model using high-level Wav2Vec 2.0 embeddings as intermediate features, enhancing robustness, generalization, and enabling zero-shot synthesis.
Contribution
The paper proposes WavThruVec, a novel two-stage architecture leveraging Wav2Vec 2.0 embeddings for improved neural speech synthesis, addressing limitations of traditional intermediate features.
Findings
Matches state-of-the-art synthesis quality
Improves robustness to noise and out-of-vocabulary words
Enables zero-shot voice synthesis and conversion
Abstract
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
