HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise   Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li; Cong Han; Xilin Jiang; Nima Mesgarani

arXiv:2309.09493·eess.AS·September 19, 2023·1 cites

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

PDF

Open Access 1 Repo

TL;DR

HiFTNet is a novel neural vocoder that combines inverse STFT with a harmonic-plus-noise filter, achieving high-quality, fast, and parameter-efficient speech synthesis suitable for real-time applications.

Contribution

It introduces HiFTNet, integrating harmonic-plus-noise filtering with iSTFT for improved speed and quality over prior GAN-based vocoders.

Findings

01

Outperforms iSTFTNet and HiFi-GAN in subjective quality.

02

Achieves real-time inference with fewer parameters.

03

Matches or exceeds BigVGAN performance on benchmark datasets.

Abstract

Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yl4579/HiFTNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsHiFi-GAN · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings