JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis
Hyunjae Cho, Junhyeok Lee, Wonbin Jung

TL;DR
JenGAN introduces a novel training strategy with stacked shifted filters to improve GAN-based speech synthesis, reducing artifacts and enhancing perceptual quality without increasing inference complexity.
Contribution
The paper proposes JenGAN, a new method that stacks shifted low-pass filters to enforce shift-equivariance, reducing artifacts in GAN-based vocoders.
Findings
JenGAN significantly improves speech synthesis quality.
It reduces tonal artifacts in generated speech.
Performance gains are consistent across evaluation metrics.
Abstract
Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
