InstructSing: High-Fidelity Singing Voice Generation via Instructing   Yourself

Chang Zeng; Chunhui Wang; Xiaoxiao Miao; Jian Zhao; Zhonglin Jiang,; Yong Chen

arXiv:2409.06330·eess.AS·September 11, 2024

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang,, Yong Chen

PDF

Open Access

TL;DR

InstructSing is a neural vocoder that significantly accelerates training convergence while maintaining high-quality, high-fidelity singing voice synthesis through innovative integration of signal processing and adversarial training.

Contribution

The paper introduces InstructSing, a neural vocoder that converges faster than existing models by combining differentiable digital signal processing with adversarial training.

Findings

01

Achieves comparable voice quality with only one-tenth of training steps

02

Converges faster on GPU hardware compared to traditional neural vocoders

03

Generates high-fidelity singing voices at 48kHz sampling rate

Abstract

It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Music and Audio Processing

MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet