LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Wenhao Guan; Kaidi Wang; Wangjin Zhou; Yang Wang; Feng Deng; Hui Wang,; Lin Li; Qingyang Hong; Yong Qin

arXiv:2406.08203·eess.AS·June 13, 2024

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Wenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang,, Lin Li, Qingyang Hong, Yong Qin

PDF

Open Access 1 Repo

TL;DR

LAFMA introduces a flow matching-based model in the audio latent space that improves text-to-audio generation quality and efficiency, reducing inference steps while maintaining high sample quality.

Contribution

The paper presents a novel integration of flow matching with latent space diffusion models for improved audio generation.

Findings

01

Achieves higher audio quality than previous models.

02

Reduces inference steps to ten with minimal quality loss.

03

Demonstrates effectiveness of flow matching in audio synthesis.

Abstract

Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the quality of samples generated by diffusion models still needs improvement. And the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous Text-to-Audio (TTA) methods mostly used diffusion models in the latent space for audio generation. In this paper, we explore the integration of the Flow Matching (FM) model into the audio latent space for audio generation. The FM is an alternative simulation-free method that trains continuous normalization flows (CNF) based on regressing vector fields. We demonstrate that our model significantly enhances the quality of generated audio samples, achieving better performance than prior models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gwh22/LAFMA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsDiffusion