Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder
Reo Yoneyama, Yi-Chiao Wu, and Tomoki Toda

TL;DR
This paper introduces Source-Filter HiFi-GAN, a neural vocoder that combines source-filter theory with HiFi-GAN to achieve fast, high-fidelity, and pitch-controllable voice synthesis suitable for real-time applications.
Contribution
It integrates source-filter conditioning into HiFi-GAN, enabling pitch control while maintaining high quality and speed, outperforming previous models in singing voice generation.
Findings
Outperforms HiFi-GAN and uSFGAN in voice quality and speed
Achieves real-time pitch controllable high-fidelity synthesis
Compatible with end-to-end and real-time systems
Abstract
Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Model Reduction and Neural Networks · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · HiFi-GAN
