Anyone GAN Sing
Shreeviknesh Sankaran, Sukavanan Nanjundan, G. Paavai Anand

TL;DR
This paper introduces a ConvLSTM-based GAN model optimized with Wasserstein loss to synthesize singing voices, trained on a dataset of non-professional singers, and evaluated through objective and subjective metrics.
Contribution
It presents a novel ConvLSTM-GAN architecture for singing voice synthesis, inspired by WGANSing, with specific training and inference procedures for improved performance.
Findings
Achieved measurable Mel-Cepstral Distance improvements.
Subjective listening tests indicate high perceived quality.
Model successfully synthesizes singing voices from linguistic and frequency features.
Abstract
The problem of audio synthesis has been increasingly solved using deep neural networks. With the introduction of Generative Adversarial Networks (GAN), another efficient and adjective path has opened up to solve this problem. In this paper, we present a method to synthesize the singing voice of a person using a Convolutional Long Short-term Memory (ConvLSTM) based GAN optimized using the Wasserstein loss function. Our work is inspired by WGANSing by Chandna et al. Our model inputs consecutive frame-wise linguistic and frequency features, along with singer identity and outputs vocoder features. We train the model on a dataset of 48 English songs sung and spoken by 12 non-professional singers. For inference, sequential blocks are concatenated using an overlap-add procedure. We test the model using the Mel-Cepstral Distance metric and a subjective listening test with 18 participants.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Speech Recognition and Synthesis
