MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech   Synthesis

Qian Yang; Jialong Zuo; Zhe Su; Ziyue Jiang; Mingze Li; Zhou Zhao,; Feiyang Chen; Zhefeng Wang; Baoxing Huai

arXiv:2407.14006·eess.AS·July 22, 2024

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao,, Feiyang Chen, Zhefeng Wang, Baoxing Huai

PDF

Open Access

TL;DR

MSceneSpeech is a high-quality, open-source Mandarin speech dataset designed for expressive multi-scene speech synthesis, featuring diverse speakers, prosodic styles, and a baseline model capable of scene-specific and speaker-specific speech generation.

Contribution

The paper introduces MSceneSpeech, a comprehensive multi-scene Mandarin speech dataset, and establishes a baseline model for expressive, multi-speaker, scene-aware speech synthesis.

Findings

01

Effective synthesis of scene-specific prosody and speaker timbre.

02

Diverse multi-scene, multi-speaker speech data enhances synthesis quality.

03

Open source dataset facilitates research in expressive speech synthesis.

Abstract

We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems