MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge   2023

Zhihang Xu; Shaofei Zhang; Xi Wang; Jiajun Zhang; Wenning Wei; Lei He; and Sheng Zhao

arXiv:2309.02743·eess.AS·September 13, 2023·1 cites

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

Zhihang Xu, Shaofei Zhang, Xi Wang, Jiajun Zhang, Wenning Wei, Lei He, and Sheng Zhao

PDF

Open Access

TL;DR

MuLanTTS is a neural TTS system that leverages contextual and emotion encoders, achieving high-quality speech synthesis for French audiobook data in the Blizzard Challenge 2023.

Contribution

It introduces MuLanTTS, an end-to-end neural TTS system with enhanced prosody and expressiveness, adapted for long-form and dialogue speech synthesis.

Findings

01

Achieved mean quality scores of 4.3 and 4.5, comparable to natural speech.

02

Effectively used denoise algorithms and long audio processing.

03

Demonstrated strong performance in both hub and spoke tasks.

Abstract

In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for French TTS as hub task and another 2 hours of speaker adaptation as spoke task are released to build synthesized voices for different test purposes including sentences, paragraphs, homographs, lists, etc. Building upon DelightfulTTS, we adopt contextual and emotion encoders to adapt the audiobook data to enrich beyond sentences for long-form prosody and dialogue expressiveness. Regarding the recording quality, we also apply denoise algorithms and long audio processing for both corpora. For the hub task, only the 50-hour single speaker data is used for building the TTS system, while for the spoke task, a multi-speaker source model is used for target speaker fine tuning. MuLanTTS achieves mean scores of quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing