AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu; Rongjie Huang; Yang Liu; Hengyuan Cao; Jialei Wang; Xize; Cheng; Siqi Zheng; Zhou Zhao

arXiv:2406.00356·eess.AS·July 10, 2024·1 cites

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize, Cheng, Siqi Zheng, Zhou Zhao

PDF

Open Access 2 Repos

TL;DR

AudioLCM introduces a fast, high-quality text-to-audio generation model using latent consistency models and guided distillation, achieving rapid inference with minimal steps and high fidelity.

Contribution

The paper presents AudioLCM, a novel consistency-based model that significantly reduces sampling steps in text-to-audio generation while maintaining high quality, and integrates advanced transformer techniques for stability.

Findings

01

Requires only 2 iterations for high-quality audio synthesis.

02

Achieves 333x faster-than-real-time sampling speed.

03

Maintains competitive sample quality with state-of-the-art models.

Abstract

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsConsistency Models · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · LLaMA