OpenSep: Leveraging Large Language Models with Textual Inversion for   Open World Audio Separation

Tanvir Mahmud; Diana Marculescu

arXiv:2409.19270·cs.SD·October 1, 2024

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

Tanvir Mahmud, Diana Marculescu

PDF

Open Access 1 Repo 1 Video

TL;DR

OpenSep introduces a framework that uses large language models and textual inversion to improve open-world audio separation, effectively handling unseen sources without manual intervention.

Contribution

It leverages LLMs with textual inversion and a multi-level training extension to enhance audio separation in open-world scenarios, surpassing existing methods.

Findings

01

Outperforms state-of-the-art baseline methods

02

Accurately separates unseen and variable sources

03

Effectively handles complex real-world audio mixtures

Abstract

Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tanvir-utexas/opensep
noneOfficial

Videos

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing