EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis   System

Hao Li; Yongguo Kang; Zhenyu Wang

arXiv:1806.09276·eess.AS·June 27, 2018·1 cites

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Hao Li, Yongguo Kang, Zhenyu Wang

PDF

Open Access

TL;DR

EMPHASIS is a multi-lingual, phoneme-based acoustic model for speech synthesis that produces expressive, high-quality speech in real-time, incorporating emotional and prosodic features for Mandarin-English synthesis.

Contribution

It introduces a novel CBHG-based regression network with feature grouping for improved emotional speech synthesis in a multi-lingual framework.

Findings

01

Achieves better subjective quality than existing real-time systems

02

Capable of synthesizing expressive interrogative and exclamatory speech

03

Supports Mandarin-English bilingual speech synthesis

Abstract

We present EMPHASIS, an emotional phoneme-based acoustic model for speech synthesis system. EMPHASIS includes a phoneme duration prediction model and an acoustic parameter prediction model. It uses a CBHG-based regression network to model the dependencies between linguistic features and acoustic features. We modify the input and output layer structures of the network to improve the performance. For the linguistic features, we apply a feature grouping strategy to enhance emotional and prosodic features. The acoustic parameters are designed to be suitable for the regression task and waveform reconstruction. EMPHASIS can synthesize speech in real-time and generate expressive interrogative and exclamatory speech with high audio quality. EMPHASIS is designed to be a multi-lingual model and can synthesize Mandarin-English speech for now. In the experiment of emotional speech synthesis, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing