Simple and Effective Unsupervised Speech Synthesis
Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei, Baevski, James Glass

TL;DR
This paper presents the first unsupervised speech synthesis system that generates natural and intelligible speech using only unlabeled audio, text, and a lexicon, eliminating the need for labeled datasets.
Contribution
It introduces a novel unsupervised speech synthesis framework combining recent speech recognition and neural synthesis techniques, advancing the field without labeled data.
Findings
Synthesizes speech comparable to supervised systems in naturalness
Achieves high intelligibility in synthesized speech
Operates effectively with only unlabeled data
Abstract
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
