Transferring neural speech waveform synthesizers to musical instrument   sounds generation

Yi Zhao; Xin Wang; Lauri Juvela; Junichi Yamagishi

arXiv:1910.12381·eess.AS·November 20, 2019·1 cites

Transferring neural speech waveform synthesizers to musical instrument sounds generation

Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper investigates how neural speech waveform synthesizers can be adapted for musical instrument sound generation, showing that pre-training on speech data enhances performance and that different models excel under various adaptation scenarios.

Contribution

It compares three neural synthesizers for musical instrument sounds, demonstrating the benefits of speech pre-training and fine-tuning for music audio synthesis.

Findings

01

Pre-training on speech data improves music synthesis quality.

02

WaveGlow excels in zero-shot learning scenarios.

03

NSF performs best with fine-tuning and produces natural-sounding audio.

Abstract

Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training from scratch on music data, zero-shot learning from the speech domain, and fine-tuning-based adaptation from the speech to the music domain. The results of a large-scale perceptual test demonstrated that the performance of three synthesizers improved when they were pre-trained on speech data and fine-tuned on music data, which indicates the usefulness of knowledge from speech data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsTest · Mixture of Logistic Distributions · Affine Coupling · Normalizing Flows · Invertible 1x1 Convolution · WaveGlow · Dilated Causal Convolution · WaveNet