iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using   1D-2D CNN

Takuhiro Kaneko; Hirokazu Kameoka; Kou Tanaka; Shogo Seki

arXiv:2308.07117·cs.SD·August 15, 2023

iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

PDF

Open Access

TL;DR

iSTFTNet2 introduces a 1D-2D CNN architecture for neural vocoding, enhancing speed and lightweight design while maintaining high speech quality, by effectively modeling high-dimensional spectrograms.

Contribution

It proposes a novel 1D-2D CNN structure for neural vocoders, improving speed and model efficiency without sacrificing speech quality.

Findings

01

iSTFTNet2 is faster than previous models.

02

It is more lightweight with comparable speech quality.

03

The new architecture effectively models high-dimensional spectrograms.

Abstract

The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via temporal upsampling. However, this strategy compromises the potential to enhance the speed. Therefore, we propose iSTFTNet2, an improved variant of iSTFTNet with a 1D-2D CNN that employs 1D and 2D CNNs to model temporal and spectrogram structures, respectively. We designed a 2D CNN that performs frequency upsampling after conversion in a few-frequency space. This design facilitates the modeling of high-dimensional spectrograms without compromising the speed. The results demonstrated that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

Methods1-Dimensional Convolutional Neural Networks