Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis   That Entertains Audiences

Shuhei Kato; Yusuke Yasuda; Xin Wang; Erica Cooper; Shinji Takaki; and; Junichi Yamagishi

arXiv:1911.00137·eess.AS·June 2, 2020·IEEE Access

Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences

Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, and, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper explores modeling rakugo speech synthesis using advanced neural models to entertain audiences, highlighting current limitations and insights for future improvements in character distinguishability and expressiveness.

Contribution

It applies Tacotron 2 and enhancements to rakugo speech synthesis, providing new insights into entertainment-focused speech quality beyond naturalness.

Findings

01

Synthesized speech lacks professional-level quality.

02

Character distinguishability and content understandability are crucial for entertainment.

03

Richer fundamental frequency expression enhances entertainment value.

Abstract

We have been investigating rakugo speech synthesis as a challenging example of speech synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal entertainment similar to a combination of one-person stand-up comedy and comic storytelling and is popular even today. In rakugo, a performer plays multiple characters, and conversations or dialogues between the characters make the story progress. To investigate how close the quality of synthesized rakugo speech can approach that of professionals' speech, we modeled rakugo speech using Tacotron 2, a state-of-the-art speech synthesis system that can produce speech that sounds as natural as human speech albeit under limited conditions, and an enhanced version of it with self-attention to better consider long-term dependencies. We also used global style tokens and manually labeled context features to enrich speaking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling