Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder?

Hui-Peng Du; Yang Ai; Rui-Chen Zheng; Ye-Xin Lu; Zhen-Hua Ling

arXiv:2508.07711·eess.AS·August 12, 2025·IEEE Signal Process. Lett.

Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder?

Hui-Peng Du, Yang Ai, Rui-Chen Zheng, Ye-Xin Lu, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces FreeGAN, a neural vocoder that generates high-quality speech without using GANs, improving training efficiency and reducing complexity while maintaining comparable speech quality.

Contribution

The paper proposes a novel GAN-free neural vocoder architecture with amplitude-phase serial prediction, demonstrating comparable performance to GAN-based models.

Findings

01

FreeGAN achieves speech quality comparable to GAN-based vocoders.

02

Training efficiency and model complexity are significantly improved.

03

GAN is not necessary for high-quality mel-spectrogram-based neural vocoding.

Abstract

Recently, mainstream mel-spectrogram-based neural vocoders rely on generative adversarial network (GAN) for high-fidelity speech generation, e.g., HiFi-GAN and BigVGAN. However, the use of GAN restricts training efficiency and model complexity. Therefore, this paper proposes a novel FreeGAN vocoder, aiming to answer the question of whether GAN is necessary for mel-spectrogram-based neural vocoders. The FreeGAN employs an amplitude-phase serial prediction framework, eliminating the need for GAN training. It incorporates amplitude prior input, SNAKE-ConvNeXt v2 backbone and frequency-weighted anti-wrapping phase loss to compensate for the performance loss caused by the absence of GAN. Experimental results confirm that the speech quality of FreeGAN is comparable to that of advanced GAN-based vocoders, while significantly improving training efficiency and complexity. Other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition