J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari

TL;DR
This paper introduces J-MAC, a Japanese audiobook speech corpus designed for advanced speech synthesis research, emphasizing cross-sentence expressiveness and speaker-specific nuances, and provides insights from initial synthesis evaluations.
Contribution
The paper presents a novel method for automatically constructing a high-quality Japanese audiobook speech corpus from existing recordings, independent of language, and demonstrates its application in speech synthesis.
Findings
The corpus enables improved audiobook speech synthesis performance.
Automatic extraction and alignment methods are effective for corpus creation.
Initial evaluations offer insights into challenges and potentials in audiobook synthesis.
Abstract
In this paper, we construct a Japanese audiobook speech corpus called "J-MAC" for speech synthesis research. With the success of reading-style speech synthesis, the research target is shifting to tasks that use complicated contexts. Audiobook speech synthesis is a good example that requires cross-sentence, expressiveness, etc. Unlike reading-style speech, speaker-specific expressiveness in audiobook speech also becomes the context. To enhance this research, we propose a method of constructing a corpus from audiobooks read by professional speakers. From many audiobooks and their texts, our method can automatically extract and refine the data without any language dependency. Specifically, we use vocal-instrumental separation to extract clean data, connectionist temporal classification to roughly align text and audio, and voice activity detection to refine the alignment. J-MAC is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
