EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks
Shijia Liao, Shiyi Lan, Arun George Zachariah

TL;DR
EVA-GAN introduces a scalable, high-fidelity audio generation model that significantly improves spectral quality, high-frequency reconstruction, and robustness, enabling realistic 44.1kHz audio synthesis from large datasets.
Contribution
The paper presents EVA-GAN, a novel scalable GAN architecture for high-fidelity audio generation, addressing spectral discontinuities and out-of-domain robustness with extensive dataset training.
Findings
Outperforms previous models in spectral and high-frequency quality
Achieves robust performance on out-of-domain audio data
Enables high-quality 44.1kHz audio synthesis
Abstract
The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
