AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Yao Shi; Hui Bu; Xin Xu; Shaoji Zhang; Ming Li

arXiv:2010.11567·cs.SD·April 23, 2021·83 cites

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li

PDF

Open Access 3 Datasets

TL;DR

This paper introduces AISHELL-3, a large-scale multi-speaker Mandarin speech corpus with detailed speaker attributes, and presents a baseline multi-speaker TTS system capable of zero-shot voice cloning, demonstrating high voice similarity.

Contribution

The paper provides a new high-quality Mandarin speech dataset with detailed speaker info and develops a multi-speaker TTS baseline system with zero-shot capabilities.

Findings

01

High voice similarity in synthesis results

02

Effective zero-shot voice cloning demonstrated

03

Dataset and baseline system publicly available

Abstract

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling