Pheme: Efficient and Conversational Speech Generation
Pawe{\l} Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vuli\'c

TL;DR
Pheme introduces a compact, efficient, and high-quality conversational speech generation model that enables parallel, real-time synthesis using smaller training data, surpassing existing autoregressive TTS models in performance and efficiency.
Contribution
The paper presents the Pheme model series, which achieves high-quality, parallel speech synthesis with reduced data requirements and training time, addressing limitations of current autoregressive TTS systems.
Findings
Pheme models match the quality of larger autoregressive TTS models.
They enable real-time, parallel speech generation.
Training data requirements are reduced by over 10 times.
Abstract
In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
