JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis
Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR
The JSUT corpus is a large-scale, freely available Japanese speech dataset designed to facilitate end-to-end speech synthesis research, covering all main pronunciations and consisting of 10 hours of speech data.
Contribution
This paper introduces the first comprehensive large-scale Japanese speech corpus specifically designed for end-to-end speech synthesis, filling a significant resource gap.
Findings
Corpus covers all main Japanese pronunciations
Contains 10 hours of speech data
Freely available online
Abstract
Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
