Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima

TL;DR
This paper introduces novel DNN architectures for emotional speech synthesis that can extrapolate emotional expressions without needing emotional speech data from target speakers, enabling flexible and efficient emotional TTS.
Contribution
The study proposes multiple architectures to separately model speaker and emotional features, allowing emotional expression transfer without target speaker emotional speech data.
Findings
Subjective evaluations show the models can convey emotions to some extent.
The parallel model (PM) correctly conveys sad and joyful emotions over 60%.
Objective evaluations are inconclusive due to variability in emotional expression.
Abstract
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of "extrapolate emotional expressions" is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
