Separate What You Describe: Language-Queried Audio Source Separation

Xubo Liu; Haohe Liu; Qiuqiang Kong; Xinhao Mei; Jinzheng Zhao; Qiushi; Huang; Mark D. Plumbley; Wenwu Wang

arXiv:2203.15147·eess.AS·March 30, 2022

Separate What You Describe: Language-Queried Audio Source Separation

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi, Huang, Mark D. Plumbley, Wenwu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces LASS-Net, a neural network that separates audio sources based on natural language queries, addressing the challenge of linking linguistic descriptions with audio sources, and demonstrates promising results on a new dataset.

Contribution

The paper presents LASS-Net, the first end-to-end model for language-queried audio source separation, integrating acoustic and linguistic information for improved separation performance.

Findings

01

LASS-Net outperforms baseline methods in source separation accuracy.

02

The model generalizes well with diverse human-annotated descriptions.

03

Promising potential for real-world applications in audio retrieval and separation.

Abstract

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuxubo717/lass
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis