SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and   Periodic Inductive Bias

Sipan Li; Songxiang Liu; Luwen Zhang; Xiang Li; Yanyao Bian; Chao; Weng; Zhiyong Wu; Helen Meng

arXiv:2309.07803·eess.AS·September 15, 2023

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao, Weng, Zhiyong Wu, Helen Meng

PDF

Open Access

TL;DR

SnakeGAN is a novel GAN-based universal vocoder that leverages DDSP prior knowledge and periodic inductive biases to generate high-fidelity audio across diverse out-of-domain scenarios, including unseen speakers, styles, and musical pieces.

Contribution

It introduces a new GAN architecture with Snake activations and anti-aliased representations, enhancing generalization to out-of-domain audio synthesis tasks.

Findings

01

Outperforms existing methods in subjective and objective metrics

02

Successfully synthesizes high-fidelity audio for unseen speakers and styles

03

Effective in non-speech vocalization and musical audio synthesis

Abstract

Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal vocoder, which can synthesize high-fidelity audio in various OOD scenarios. SnakeGAN takes a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge, aiming at recovering high-fidelity waveform from a Mel-spectrogram. We introduce periodic nonlinearities through the Snake activation function and anti-aliased representation into the generator, which further brings the desired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies