ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation   with Consistency Distillation

Yatong Bai; Trung Dang; Dung Tran; Kazuhito Koishida; Somayeh Sojoudi

arXiv:2309.10740·cs.SD·June 25, 2024

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

PDF

Open Access 1 Repo 1 Models

TL;DR

ConsistencyTTA significantly accelerates diffusion-based text-to-audio generation by reducing inference queries to one, maintaining quality and diversity, and enabling fine-tuning with audio-aware metrics.

Contribution

Introduces a novel latent consistency model with classifier-free guidance that speeds up TTA generation and allows for effective fine-tuning.

Findings

01

Reduces inference computation by 400x compared to traditional diffusion models.

02

Maintains high quality and diversity in generated audio.

03

Can be fine-tuned with audio-space metrics like CLAP score.

Abstract

Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Bai-YT/ConsistencyTTA
jaxOfficial

Models

🤗
Bai-YT/ConsistencyTTA
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings