AudioLCM: Text-to-Audio Generation with Latent Consistency Models
Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize, Cheng, Siqi Zheng, Zhou Zhao

TL;DR
AudioLCM introduces a fast, high-quality text-to-audio generation model using latent consistency models and guided distillation, achieving rapid inference with minimal steps and high fidelity.
Contribution
The paper presents AudioLCM, a novel consistency-based model that significantly reduces sampling steps in text-to-audio generation while maintaining high quality, and integrates advanced transformer techniques for stability.
Findings
Requires only 2 iterations for high-quality audio synthesis.
Achieves 333x faster-than-real-time sampling speed.
Maintains competitive sample quality with state-of-the-art models.
Abstract
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsConsistency Models · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · LLaMA
