JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

Hyunjae Cho; Junhyeok Lee; Wonbin Jung

arXiv:2406.06111·eess.AS·June 11, 2024

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

Hyunjae Cho, Junhyeok Lee, Wonbin Jung

PDF

Open Access

TL;DR

JenGAN introduces a novel training strategy with stacked shifted filters to improve GAN-based speech synthesis, reducing artifacts and enhancing perceptual quality without increasing inference complexity.

Contribution

The paper proposes JenGAN, a new method that stacks shifted low-pass filters to enforce shift-equivariance, reducing artifacts in GAN-based vocoders.

Findings

01

JenGAN significantly improves speech synthesis quality.

02

It reduces tonal artifacts in generated speech.

03

Performance gains are consistent across evaluation metrics.

Abstract

Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings