MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao,, Feiyang Chen, Zhefeng Wang, Baoxing Huai

TL;DR
MSceneSpeech is a high-quality, open-source Mandarin speech dataset designed for expressive multi-scene speech synthesis, featuring diverse speakers, prosodic styles, and a baseline model capable of scene-specific and speaker-specific speech generation.
Contribution
The paper introduces MSceneSpeech, a comprehensive multi-scene Mandarin speech dataset, and establishes a baseline model for expressive, multi-speaker, scene-aware speech synthesis.
Findings
Effective synthesis of scene-specific prosody and speaker timbre.
Diverse multi-scene, multi-speaker speech data enhances synthesis quality.
Open source dataset facilitates research in expressive speech synthesis.
Abstract
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
