InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself
Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang,, Yong Chen

TL;DR
InstructSing is a neural vocoder that significantly accelerates training convergence while maintaining high-quality, high-fidelity singing voice synthesis through innovative integration of signal processing and adversarial training.
Contribution
The paper introduces InstructSing, a neural vocoder that converges faster than existing models by combining differentiable digital signal processing with adversarial training.
Findings
Achieves comparable voice quality with only one-tenth of training steps
Converges faster on GPU hardware compared to traditional neural vocoders
Generates high-fidelity singing voices at 48kHz sampling rate
Abstract
It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Music and Audio Processing
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet
