LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
Wenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang,, Lin Li, Qingyang Hong, Yong Qin

TL;DR
LAFMA introduces a flow matching-based model in the audio latent space that improves text-to-audio generation quality and efficiency, reducing inference steps while maintaining high sample quality.
Contribution
The paper presents a novel integration of flow matching with latent space diffusion models for improved audio generation.
Findings
Achieves higher audio quality than previous models.
Reduces inference steps to ten with minimal quality loss.
Demonstrates effectiveness of flow matching in audio synthesis.
Abstract
Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the quality of samples generated by diffusion models still needs improvement. And the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous Text-to-Audio (TTA) methods mostly used diffusion models in the latent space for audio generation. In this paper, we explore the integration of the Flow Matching (FM) model into the audio latent space for audio generation. The FM is an alternative simulation-free method that trains continuous normalization flows (CNF) based on regressing vector fields. We demonstrate that our model significantly enhances the quality of generated audio samples, achieving better performance than prior models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsDiffusion
