Improve few-shot voice cloning using multi-modal learning
Haitong Zhang, Yue Lin

TL;DR
This paper introduces a multi-modal learning approach to enhance few-shot voice cloning, extending Tacotron2 with unsupervised speech representations, and demonstrates significant performance improvements in TTS and voice conversion tasks.
Contribution
The paper presents a novel multi-modal system for few-shot voice cloning that integrates unsupervised speech representations into Tacotron2, addressing the limitations of single-modal models.
Findings
Multi-modal learning significantly improves voice cloning performance.
The system outperforms single-modal baselines in TTS and voice conversion.
Experimental results validate the effectiveness of the proposed approach.
Abstract
Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
