Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition
Zhengxi Liu, Yanmin Qian

TL;DR
Basis-MelGAN introduces a novel neural vocoder that decomposes audio with learned bases and weights, significantly reducing computational complexity while maintaining high audio quality, enabling more efficient real-time synthesis.
Contribution
The paper proposes Basis-MelGAN, a neural vocoder that simplifies upsampling layers by predicting basis weights instead of raw audio, reducing computational cost.
Findings
Achieves high-quality audio comparable to existing GAN vocoders.
Reduces GFLOPs from 17.74 to 7.95, improving efficiency.
Maintains audio quality while significantly lowering computational complexity.
Abstract
Recent studies have shown that neural vocoders based on generative adversarial network (GAN) can generate audios with high quality. While GAN based neural vocoders have shown to be computationally much more efficient than those based on autoregressive predictions, the real-time generation of the highest quality audio on CPU is still a very challenging task. One major computation of all GAN-based neural vocoders comes from the stacked upsampling layers, which were designed to match the length of the waveform's length of output and temporal resolution. Meanwhile, the computational complexity of upsampling networks is closely correlated with the numbers of samples generated for each window. To reduce the computation of upsampling layers, we propose a new GAN based neural vocoder called Basis-MelGAN where the raw audio samples are decomposed with a learned basis and their associated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
