UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan, Su

TL;DR
UniSyn is an end-to-end unified model that can generate both speech and singing voices from only individual speaker or singer data, using a novel variational autoencoder framework with disentangled control over speaker and style.
Contribution
The paper introduces UniSyn, a unified TTS and SVS model that requires only single-speaker or singer data, employing a multi-conditional VAE with disentangled latent spaces for flexible voice synthesis.
Findings
Outperforms state-of-the-art voice generation methods.
Can generate natural speech and singing without paired data.
Demonstrates effectiveness across different speakers and singers.
Abstract
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
