Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

Tanvir Mahmud; Saeed Amizadeh; Kazuhito Koishida; Diana Marculescu

arXiv:2404.01740·cs.SD·April 3, 2024·1 cites

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

This paper introduces a bi-modal framework leveraging language descriptions and pretrained embeddings to improve weakly-supervised audio separation, significantly boosting performance without requiring single-source audio during training.

Contribution

It proposes a novel bi-modal separation framework that enhances unsupervised and supervised audio separation using language modality and pretrained joint embeddings.

Findings

01

Achieves 71% SDR boost over unsupervised baselines

02

Reaches 97.5% of supervised performance with weak supervision

03

Improves supervised training by 17% with semi-supervised approach

Abstract

Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The paper is generally speaking well written with a lot of information needed to be reproduced by the researchers in the field of sound separation. - The authors have conducted several experiments to show that their algorithm performs better than some other existing text-based separation algorithms. - It is very important that the proposed method yields a significant performance improvement over the unsupervised baseline of MixIT. - I also like the novel extension of MixIT with conditioning.

Weaknesses

1. The authors should make clear the distinction of when the proposed method is trained using only weak supervision and when it is semi-supervised trained. For instance, in Table 1, I think the proposed framework row refers to the semi-supervised version of the method, thus the authors should rename the column to ‘Fully supervised’ from ‘Supervised’. Maybe a better idea is to specify the data used to train ALL the parts of each model and have two big columns ‘Mixture training data’ and ‘Single s

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

The paper presents a compelling and well-articulated examination of a critical limitation within the existing zero-shot/few-shot conditional source separation pipeline: the dataset. It addresses this issue within the context of cross-modal learning (CLAP). The paper's technical novelty is commendable. It ingeniously employs CLAP to establish a method for validating the consistency of separation outputs, integrating loss objectives that effectively converge both intermediate semantic consistency

Weaknesses

I assigned a 'weak reject' score to this paper due to what I consider to be a critical issue in the design of the separation pipeline and an empirical problem in the experimental results. Firstly, the paper introduces the utilization of CLAP embeddings as a constraint to enforce semantic similarity between the separation output and the source constraint. While this is a technically novel approach, it raises two important concerns: (1) The use of CLAP, or similar contrastive learning models, to

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

1. The paper introduces a unique framework that integrates unsupervised, weakly supervised, and semi-supervised training for source separation, conditional on language prompts. The experimental results validate the efficacy of the proposed method. 2. The paper conducts detailed experiments, contrasting the proposed approach with various baselines. Ablation tests further underscore the superior performance of the proposed method compared to existing techniques.

Weaknesses

Although it might not be a weakness, it is a bit questionable how the model performance will improve in large-dataset settings.

Code & Models

Repositories

Videos

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis