AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation

HaeChun Chung

arXiv:2512.22166·cs.SD·December 30, 2025

AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation

HaeChun Chung

PDF

Open Access 1 Models

TL;DR

AudioGAN is a novel GAN-based framework for real-time high-fidelity text-to-audio generation, achieving faster inference and fewer parameters than existing models, making it practical for media applications.

Contribution

This paper introduces AudioGAN, the first GAN-based TTA model that generates audio in a single pass with innovative attention mechanisms, reducing complexity and inference time.

Findings

01

Achieves state-of-the-art performance on AudioCaps dataset.

02

Uses 90% fewer parameters than previous models.

03

Runs 20 times faster, synthesizing audio in under one second.

Abstract

Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
SeaSky1027/AudioGAN
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis