OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation
Tanvir Mahmud, Diana Marculescu

TL;DR
OpenSep introduces a framework that uses large language models and textual inversion to improve open-world audio separation, effectively handling unseen sources without manual intervention.
Contribution
It leverages LLMs with textual inversion and a multi-level training extension to enhance audio separation in open-world scenarios, surpassing existing methods.
Findings
Outperforms state-of-the-art baseline methods
Accurately separates unseen and variable sources
Effectively handles complex real-world audio mixtures
Abstract
Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
