Variational Auto-Encoder based Mandarin Speech Cloning

Qingyu Xing; Xiaohan Ma

arXiv:2203.02967·cs.SD·March 8, 2022

Variational Auto-Encoder based Mandarin Speech Cloning

Qingyu Xing, Xiaohan Ma

PDF

Open Access

TL;DR

This paper presents a real-time Mandarin speech cloning system using a variational auto-encoder based model, improving synthesis quality and efficiency over previous methods, and tailored subjective evaluation scenarios.

Contribution

Introduces a novel Mandarin speech cloning approach combining VAENAR-TTS with a new dataset, achieving near real-time synthesis with high naturalness and similarity.

Findings

01

Achieved 2.74 MOS in naturalness and similarity.

02

Real-time factor (RTF) indicates high efficiency.

03

Enhanced subjective evaluation with scenario-based testing.

Abstract

Speech cloning technology is becoming more sophisticated thanks to the advances in machine learning. Researchers have successfully implemented natural-sounding English speech synthesis and good English speech cloning by some effective models. However, because of prosodic phrasing and large character set of Mandarin, Chinese utilization of these models is not yet complete. By creating a new dataset and replacing Tacotron synthesizer with VAENAR-TTS, we improved the existing speech cloning technique CV2TTS to almost real-time speech cloning while guaranteeing synthesis quality. In the process, we customized the subjective tests of synthesis quality assessment by attaching various scenarios, so that subjects focus on the differences between voice and our improvements maybe were more advantageous to practical applications. The results of the A/B test, real-time factor (RTF) and 2.74 mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques