Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural   Vocoder

Reo Yoneyama; Yi-Chiao Wu; and Tomoki Toda

arXiv:2210.15533·cs.SD·February 28, 2023

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

Reo Yoneyama, Yi-Chiao Wu, and Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces Source-Filter HiFi-GAN, a neural vocoder that combines source-filter theory with HiFi-GAN to achieve fast, high-fidelity, and pitch-controllable voice synthesis suitable for real-time applications.

Contribution

It integrates source-filter conditioning into HiFi-GAN, enabling pitch control while maintaining high quality and speed, outperforming previous models in singing voice generation.

Findings

01

Outperforms HiFi-GAN and uSFGAN in voice quality and speed

02

Achieves real-time pitch controllable high-fidelity synthesis

03

Compatible with end-to-end and real-time systems

Abstract

Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Model Reduction and Neural Networks · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · HiFi-GAN