FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Huadai Liu; Jialei Wang; Rongjie Huang; Yang Liu; Heng Lu; Zhou Zhao; Wei Xue

arXiv:2410.12266·eess.AS·June 4, 2025

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Zhou Zhao, Wei Xue

PDF

Open Access 1 Repo

TL;DR

FlashAudio introduces rectified flows with optimized time distribution and anchored guidance to enable fast, high-fidelity text-to-audio generation, surpassing traditional diffusion models in quality and speed.

Contribution

The paper proposes FlashAudio with rectified flows, Bifocal Samplers, immiscible flow, and Anchored Optimization to significantly improve one-step text-to-audio generation performance.

Findings

01

Outperforms diffusion models with hundreds of steps in audio quality.

02

Achieves 400x faster-than-real-time sampling on a single GPU.

03

Surpasses previous one-step methods in quality due to rectified flows and optimization techniques.

Abstract

Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing consistency-based distillation aim to achieve few-step or single-step inference, their one-step performance is constrained by curved trajectories, preventing them from surpassing traditional diffusion models. In this work, we introduce FlashAudio with rectified flows to learn straight flow for fast simulation. To alleviate the inefficient timesteps allocation and suboptimal distribution of noise, FlashAudio optimizes the time distribution of rectified flow with Bifocal Samplers and proposes immiscible flow to minimize the total distance of data-noise pairs in a batch vias assignment. Furthermore, to address the amplified accumulation error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Text-to-Audio/AudioLCM
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings