Leveraging Symmetrical Convolutional Transformer Networks for Speech to   Singing Voice Style Transfer

Shrutina Agarwal; Sriram Ganapathy; Naoya Takahashi

arXiv:2208.12410·cs.SD·August 29, 2022

Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi

PDF

Open Access

TL;DR

This paper introduces SymNet, a neural network architecture that effectively converts speech into singing voice by modeling alignment and style transfer, outperforming previous methods in quality and naturalness.

Contribution

The paper presents a novel symmetrical neural network architecture, SymNet, for speech-to-singing voice transfer, incorporating data augmentation and training techniques to enhance performance.

Findings

01

Significant improvement in objective reconstruction quality.

02

Subjective listening tests show higher perceived audio quality.

03

Model outperforms previous methods and baselines.

Abstract

In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody while preserving the speaker identity and naturalness. The proposed SymNet model is comprised of symmetrical stack of three types of layers - convolutional, transformer, and self-attention layers. The paper also explores novel data augmentation and generative loss annealing methods to facilitate the model training. Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice. In these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsTest