Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

TL;DR
This paper introduces a bi-modal framework leveraging language descriptions and pretrained embeddings to improve weakly-supervised audio separation, significantly boosting performance without requiring single-source audio during training.
Contribution
It proposes a novel bi-modal separation framework that enhances unsupervised and supervised audio separation using language modality and pretrained joint embeddings.
Findings
Achieves 71% SDR boost over unsupervised baselines
Reaches 97.5% of supervised performance with weak supervision
Improves supervised training by 17% with semi-supervised approach
Abstract
Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality…
Peer Reviews
Decision·ICLR 2024 poster
- The paper is generally speaking well written with a lot of information needed to be reproduced by the researchers in the field of sound separation. - The authors have conducted several experiments to show that their algorithm performs better than some other existing text-based separation algorithms. - It is very important that the proposed method yields a significant performance improvement over the unsupervised baseline of MixIT. - I also like the novel extension of MixIT with conditioning.
1. The authors should make clear the distinction of when the proposed method is trained using only weak supervision and when it is semi-supervised trained. For instance, in Table 1, I think the proposed framework row refers to the semi-supervised version of the method, thus the authors should rename the column to ‘Fully supervised’ from ‘Supervised’. Maybe a better idea is to specify the data used to train ALL the parts of each model and have two big columns ‘Mixture training data’ and ‘Single s
The paper presents a compelling and well-articulated examination of a critical limitation within the existing zero-shot/few-shot conditional source separation pipeline: the dataset. It addresses this issue within the context of cross-modal learning (CLAP). The paper's technical novelty is commendable. It ingeniously employs CLAP to establish a method for validating the consistency of separation outputs, integrating loss objectives that effectively converge both intermediate semantic consistency
I assigned a 'weak reject' score to this paper due to what I consider to be a critical issue in the design of the separation pipeline and an empirical problem in the experimental results. Firstly, the paper introduces the utilization of CLAP embeddings as a constraint to enforce semantic similarity between the separation output and the source constraint. While this is a technically novel approach, it raises two important concerns: (1) The use of CLAP, or similar contrastive learning models, to
1. The paper introduces a unique framework that integrates unsupervised, weakly supervised, and semi-supervised training for source separation, conditional on language prompts. The experimental results validate the efficacy of the proposed method. 2. The paper conducts detailed experiments, contrasting the proposed approach with various baselines. Ablation tests further underscore the superior performance of the proposed method compared to existing techniques.
Although it might not be a weakness, it is a bit questionable how the model performance will improve in large-dataset settings.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
