OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Xize Cheng,Siqi Zheng,Zehan Wang,Minghui Fang,Ziang Zhang,Rongjie, Huang,Ziyang Ma,Shengpeng Ji,Jialong Zuo,Tao Jin,Zhou Zhao

TL;DR
OmniSep is a unified framework for sound separation that uses omni-modal queries and a novel Query-Mixup training strategy to improve separation performance across multiple modalities and open-vocabulary scenarios.
Contribution
The paper introduces OmniSep, a novel multi-modal sound separation framework with Query-Mixup training and retrieval-augmented Query-Aug for open-vocabulary separation, advancing the state-of-the-art.
Findings
Achieves state-of-the-art results on multiple datasets.
Effectively handles multi-modal and open-vocabulary sound separation.
Demonstrates flexible query influence for sound retention or removal.
Abstract
The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating…
Peer Reviews
Decision·ICLR 2025 Poster
The paper presents an omni-modal sound separation approach, allowing users to conduct sound separation tasks using queries across various modalities, including text, image, and audio, both independently and jointly. Additionally, the authors conducted a comprehensive evaluation across several benchmarks. Further, the inclusion of extensive ablation studies provides insight into the contributions of each component: impact of Query-Mixup, negative query weighting, long text description-queried sep
While the paper presents a solid framework, there are some details missing that make it difficult to give a firm acceptance (please answer Questions). The use of full-length video as a query, which includes the target segment, raises questions about the potential for information leakage. It’s unclear if the QueryNet architecture ensures a bottleneck to prevent the model from “cheating” by directly accessing target audio features. To better demonstrate the model’s robustness, an alternative set
The paper presents a clear and well-motivated problem statement, addressing three fundamental limitations in current sound separation approaches: the absence of unified multi-modal query handling, insufficient flexibility in sound manipulation (particularly for unwanted sound removal), and restricted vocabulary constraints that preclude natural language descriptions. The authors construct a compelling narrative throughout the introduction and literature review, effectively contextualizing their
There are several issues with the paper, which can broadly be classified into two areas. Most of these issues can be fixed with better, scientific writing, and more explanation, rigour and experimentation. Technical Clarity Issues: 1. Significant documentation gaps in core variables and operations in Separate-Net: - q_i and q_ij are poorly explained or completely undefined in Separate-Net - Many variables are undefined and or with no dimensions specified. All variables should be clea
1. A novel model for source separation queried by audio, text and image. Propose a novel query-mixup method to enable such model. 2. Propose novel negative query and query-aug method to improve performance and flexibility at inference time. 3. State-of-the-art performance on separation tasks queried by all modalities
1. Presentation: Certain aspects of the paper’s presentation feel oversimplified, particularly in explaining key contributions and technical details. Please refer to the questions section for specific areas needing clarification. 2. Contribution of Query-Aug: The significance of the query-augmentation (query-aug) contribution seems overstated. For example, lines 054–056 suggest that current systems cannot handle open-vocabulary queries. However, Liu et al. (2023) ("Separate Anything You Describ
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Acoustic Wave Phenomena Research · Speech Recognition and Synthesis
