EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming, Lei, Zhou Zhao

TL;DR
This paper introduces EMOVIE, a Mandarin emotional speech dataset, and proposes EMSpeech, a simple model that generates expressive speech from text and emotion labels, advancing emotional TTS research.
Contribution
The paper releases a new Mandarin emotion speech dataset and presents EMSpeech, a model that predicts emotion from text without extra reference audio, improving emotional speech synthesis.
Findings
The dataset effectively supports emotion classification tasks.
EMSpeech achieves comparable performance in emotional speech synthesis.
The model generates expressive speech conditioned on predicted emotion labels.
Abstract
Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. After that, we propose a simple but efficient architecture for emotional speech synthesis called EMSpeech. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
