Pheme: Efficient and Conversational Speech Generation

Pawe{\l} Budzianowski; Taras Sereda; Tomasz Cichy; Ivan Vuli\'c

arXiv:2401.02839·eess.AS·January 8, 2024·1 cites

Pheme: Efficient and Conversational Speech Generation

Pawe{\l} Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vuli\'c

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Pheme introduces a compact, efficient, and high-quality conversational speech generation model that enables parallel, real-time synthesis using smaller training data, surpassing existing autoregressive TTS models in performance and efficiency.

Contribution

The paper presents the Pheme model series, which achieves high-quality, parallel speech synthesis with reduced data requirements and training time, addressing limitations of current autoregressive TTS systems.

Findings

01

Pheme models match the quality of larger autoregressive TTS models.

02

They enable real-time, parallel speech generation.

03

Training data requirements are reduced by over 10 times.

Abstract

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PolyAI-LDN/pheme
pytorchOfficial

Datasets

Pendrokar/open_tts_tracker
dataset· 396 dl
396 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems