Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics   Description for Prompt-based Control

Aya Watanabe; Shinnosuke Takamichi; Yuki Saito; Wataru Nakata; Detai; Xin; Hiroshi Saruwatari

arXiv:2309.13509·cs.SD·September 26, 2023

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai, Xin, Hiroshi Saruwatari

PDF

Open Access

TL;DR

Coco-Nut is a newly created Japanese speech corpus with diverse utterances and free-form voice descriptions, enabling improved control in text-to-speech synthesis through prompt-based methods.

Contribution

This paper introduces Coco-Nut, a large-scale, high-quality Japanese speech corpus with free-form descriptions, and demonstrates its utility via benchmarking with contrastive speech-text learning models.

Findings

01

The corpus enables more intuitive voice control in TTS.

02

Benchmark results show improved voice characteristic manipulation.

03

The methodology ensures high-quality, diverse data collection and annotation.

Abstract

In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling