TL;DR
This paper presents an improved GAN-based model for unconditional audio generation, incorporating hierarchical architecture and cycle regularization, achieving better quality in singing, speech, and musical instrument sounds.
Contribution
It introduces a hierarchical generator architecture and cycle regularization to enhance audio quality and prevent mode collapse in unconditional GAN audio generation.
Findings
Outperforms previous models in quality metrics
Effective in generating singing, speech, and musical instrument sounds
Cycle regularization reduces mode collapse
Abstract
In a recent paper, we have presented a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices. As the generator of the model is designed to take a variable-length sequence of noise vectors as input, it can generate mel-spectrograms of variable length. However, our previous listening test shows that the quality of the generated audio leaves room for improvement. The present paper extends and expands that previous work in the following aspects. First, we employ a hierarchical architecture in the generator to induce some structure in the temporal dimension. Second, we introduce a cycle regularization mechanism to the generator to avoid mode collapse. Third, we evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices. Evaluation result shows that new model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
