VISinger2+: End-to-End Singing Voice Synthesis Augmented by   Self-Supervised Learning Representation

Yifeng Yu; Jiatong Shi; Yuning Wu; Yuxun Tang; Shinji Watanabe

arXiv:2406.08761·cs.SD·December 17, 2024

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Yifeng Yu, Jiatong Shi, Yuning Wu, Yuxun Tang, Shinji Watanabe

PDF

Open Access

TL;DR

This paper presents VISinger2+ which enhances singing voice synthesis by integrating self-supervised learning representations and spectral features, effectively improving naturalness and expressiveness using unlabeled data.

Contribution

It introduces a novel method that combines self-supervised learning features with spectral information to improve SVS quality beyond existing models.

Findings

01

Improved naturalness in synthesized singing voices.

02

Enhanced performance demonstrated in objective and subjective evaluations.

03

Effective use of unlabeled data for SVS enhancement.

Abstract

Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing