FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

Yi Yuan; Xubo Liu; Haohe Liu; Mark D. Plumbley; Wenwu Wang

arXiv:2409.07614·cs.SD·January 10, 2025

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

PDF

Open Access

TL;DR

FlowSep introduces a novel generative approach based on rectified flow matching for language-queried sound separation, achieving superior quality and efficiency over existing models by leveraging a VAE latent space and a pre-trained vocoder.

Contribution

This work pioneers the application of rectified flow matching in sound separation, demonstrating its advantages over discriminative and diffusion-based models in LASS tasks.

Findings

01

FlowSep outperforms state-of-the-art models on multiple benchmarks.

02

FlowSep surpasses diffusion-based models in separation quality and inference speed.

03

The model effectively learns linear flow trajectories in VAE latent space for sound separation.

Abstract

Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing