PromptSep: Generative Audio Separation via Multimodal Prompting

Yutong Wen; Ke Chen; Prem Seetharaman; Oriol Nieto; Jiaqi Su; Rithesh Kumar; Minje Kim; Paris Smaragdis; Zeyu Jin; Justin Salamon

arXiv:2511.04623·cs.SD·November 7, 2025

PromptSep: Generative Audio Separation via Multimodal Prompting

Yutong Wen, Ke Chen, Prem Seetharaman, Oriol Nieto, Jiaqi Su, Rithesh Kumar, Minje Kim, Paris Smaragdis, Zeyu Jin, Justin Salamon

PDF

Open Access

TL;DR

PromptSep advances audio source separation by integrating multimodal prompts, including vocal imitation, to enable more intuitive and versatile sound extraction and removal, outperforming previous models in various benchmarks.

Contribution

The paper introduces PromptSep, a novel framework that combines language and vocal imitation prompts with a diffusion model for improved, flexible audio separation tasks.

Findings

01

State-of-the-art sound removal performance

02

Effective vocal imitation-guided separation

03

Competitive language-queried separation results

Abstract

Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis