AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

Junqi Zhao; Jinzheng Zhao; Haohe Liu; Yun Chen; Lu Han; Xubo Liu; Mark Plumbley; Wenwu Wang

arXiv:2505.22106·cs.SD·May 29, 2025

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

Junqi Zhao, Jinzheng Zhao, Haohe Liu, Yun Chen, Lu Han, Xubo Liu, Mark Plumbley, Wenwu Wang

PDF

Open Access

TL;DR

AudioTurbo is a novel method that combines pre-trained diffusion models with rectified diffusion to significantly accelerate text-to-audio generation, achieving high-quality results with fewer sampling steps.

Contribution

It introduces a new approach that leverages pre-trained models to improve the efficiency of rectified diffusion for text-to-audio synthesis, reducing inference steps substantially.

Findings

01

Outperforms prior models with only 10 sampling steps

02

Reduces inference steps from 3 to 10 compared to previous methods

03

Demonstrates superior quality on the AudioCaps dataset

Abstract

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion