Improve few-shot voice cloning using multi-modal learning

Haitong Zhang; Yue Lin

arXiv:2203.09708·cs.SD·March 21, 2022

Improve few-shot voice cloning using multi-modal learning

Haitong Zhang, Yue Lin

PDF

TL;DR

This paper introduces a multi-modal learning approach to enhance few-shot voice cloning, extending Tacotron2 with unsupervised speech representations, and demonstrates significant performance improvements in TTS and voice conversion tasks.

Contribution

The paper presents a novel multi-modal system for few-shot voice cloning that integrates unsupervised speech representations into Tacotron2, addressing the limitations of single-modal models.

Findings

01

Multi-modal learning significantly improves voice cloning performance.

02

The system outperforms single-modal baselines in TTS and voice conversion.

03

Experimental results validate the effectiveness of the proposed approach.

Abstract

Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.