JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and   Nonverbal Expressions

Detai Xin; Junfeng Jiang; Shinnosuke Takamichi; Yuki Saito; Akiko; Aizawa; Hiroshi Saruwatari

arXiv:2310.06072·cs.SD·March 7, 2024

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko, Aizawa, Hiroshi Saruwatari

PDF

Open Access

TL;DR

JVNV is a novel Japanese emotional speech corpus created using large language models, including verbal content and nonverbal vocalizations, to improve emotional speech synthesis and recognition.

Contribution

This paper introduces JVNV, the first Japanese emotional speech corpus with automatically generated scripts incorporating nonverbal vocalizations using large language models.

Findings

01

JVNV has better phoneme coverage and emotion recognizability than previous corpora.

02

Adding nonverbal vocalizations increases synthesis difficulty and highlights future challenges.

03

Benchmark results show a performance gap between read-aloud and emotional speech synthesis.

Abstract

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Emotion and Mood Recognition