Fast Text-to-Audio Generation with Adversarial Post-Training

Zachary Novack; Zach Evans; Zack Zukowski; Josiah Taylor; CJ Carr; Julian Parker; Adnan Al-Sinan; Gian Marco Iodice; Julian McAuley; Taylor Berg-Kirkpatrick; Jordi Pons

arXiv:2505.08175·cs.SD·May 21, 2025

Fast Text-to-Audio Generation with Adversarial Post-Training

Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons

PDF

1 Repo 2 Models

TL;DR

This paper introduces ARC post-training, an adversarial acceleration method for diffusion/flow models, significantly reducing text-to-audio generation latency without distillation, enabling near real-time performance on high-end and edge devices.

Contribution

The paper presents ARC post-training, a novel adversarial acceleration technique for diffusion/flow models that improves inference speed for text-to-audio generation without relying on distillation.

Findings

01

Generates 12 seconds of stereo audio in 75ms on H100 GPU

02

Achieves 7 seconds of audio generation on a mobile device

03

First adversarial acceleration method for diffusion/flow models not based on distillation

Abstract

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $\approx$ 12s of 44.1kHz stereo audio in $\approx$ 75ms on an H100, and $\approx$ 7s on a mobile…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stability-ai/stable-audio-tools
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.