The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System   for LIMMITS'24 Challenge

Yixuan Zhou; Shuoyi Zhou; Shun Lei; Zhiyong Wu; Menglin Wu

arXiv:2404.16619·cs.SD·April 26, 2024

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Yixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin Wu

PDF

Open Access

TL;DR

This paper introduces a multi-lingual, multi-speaker few-shot voice cloning system that enhances speaker similarity and naturalness through novel model components and data strategies, achieving top MOS scores in the LIMMITS'24 Challenge.

Contribution

We develop a multi-speaker multi-lingual few-shot voice cloning system with speaker-aware encoding, flow-based decoding, and advanced data augmentation techniques, outperforming existing methods.

Findings

01

Achieved a speaker similarity MOS of 4.25.

02

Obtained a naturalness MOS of 3.97.

03

Performed well in the LIMMITS'24 Challenge.

Abstract

This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques