AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li

TL;DR
This paper introduces AISHELL-3, a large-scale multi-speaker Mandarin speech corpus with detailed speaker attributes, and presents a baseline multi-speaker TTS system capable of zero-shot voice cloning, demonstrating high voice similarity.
Contribution
The paper provides a new high-quality Mandarin speech dataset with detailed speaker info and develops a multi-speaker TTS baseline system with zero-shot capabilities.
Findings
High voice similarity in synthesis results
Effective zero-shot voice cloning demonstrated
Dataset and baseline system publicly available
Abstract
In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
